Real-time Prediction of Segmentation Quality

by   Robert Robinson, et al.

Recent advances in deep learning based image segmentation methods have enabled real-time performance with human-level accuracy. However, occasionally even the best method fails due to low image quality, artifacts or unexpected behaviour of black box algorithms. Being able to predict segmentation quality in the absence of ground truth is of paramount importance in clinical practice, but also in large-scale studies to avoid the inclusion of invalid data in subsequent analysis. In this work, we propose two approaches of real-time automated quality control for cardiovascular MR segmentations using deep learning. First, we train a neural network on 12,880 samples to predict Dice Similarity Coefficients (DSC) on a per-case basis. We report a mean average error (MAE) of 0.03 on 1,610 test samples and 97 separating low and high quality segmentations. Secondly, in the scenario where no manually annotated data is available, we train a network to predict DSC scores from estimated quality obtained via a reverse testing strategy. We report an MAE=0.14 and 91 Predictions are obtained in real-time which, when combined with real-time segmentation methods, enables instant feedback on whether an acquired scan is analysable while the patient is still in the scanner. This further enables new applications of optimising image acquisition towards best possible analysis results.


page 3

page 6


Automated Quality Control in Image Segmentation: Application to the UK Biobank Cardiac MR Imaging Study

Background: The trend towards large-scale studies including population i...

Robust Image Segmentation Quality Assessment without Ground Truth

Deep learning based image segmentation methods have achieved great succe...

Segmentation with Multiple Acceptable Annotations: A Case Study of Myocardial Segmentation in Contrast Echocardiography

Most existing deep learning-based frameworks for image segmentation assu...

Embedded deep learning in ophthalmology: Making ophthalmic imaging smarter

Deep learning has recently gained high interest in ophthalmology, due to...

Quality control for more reliable integration of deep learning-based image segmentation into medical workflows

Machine learning algorithms underpin modern diagnostic-aiding software, ...

A Fine-Grain Error Map Prediction and Segmentation Quality Assessment Framework for Whole-Heart Segmentation

When introducing advanced image computing algorithms, e.g., whole-heart ...

On the segmentation of astronomical images via level-set methods

Astronomical images are of crucial importance for astronomers since they...

1 Introduction

Finding out that an acquired medical image is not usable for the intended purpose is not only costly but can be critical if image-derived quantitative measures should have supported clinical decisions in diagnosis and treatment. Real-time assessment of the downstream analysis task, such as image segmentation, is highly desired. Ideally, such an assessment could be performed while the patient is still in the scanner, so that in the case an image is not analysable, a new scan could be obtained immediately (even automatically). Such a real-time assessment requires two components, a real-time analysis method and a real-time prediction of the quality of the analysis result. This paper proposes a solution to the latter with a particular focus on image segmentation as the analysis task.

Recent advances in deep learning based image segmentation have brought highly efficient and accurate methods, most of which are based on Convolutional Neural Networks (CNNs). However, even the best method will occasionally fail due to insufficient image quality (e,g., noise, artefacts, corruption) or show unexpected behaviour on new data. In clinical settings, it is of paramount importance to be able to detect such failure cases on a per-case basis. In clinical research, such as population studies, it is important to be able to detect failure cases in automated pipelines, so invalid data can be discarded in the subsequent statistical analysis.

Here, we focus on automatic quality control of image segmentation. Specifically, we assess the quality of automatically generated segmentations of cardiovascular MR (CMR) from the UK Biobank (UKBB) Imaging Study [1].

Automated quality control is dominated by research in the natural-image domain and is often referred to as image quality assessment (IQA). The literature proposes methodologies to quantify the technical characteristics of an image, such as the amount of blur, and more recently a way to assess the aesthetic quality of such images [2]. In the medical image domain, IQA is an important topic of research in the fields of image acquisition and reconstruction. An example is the work by Farzi et al. [3] proposing an unsupervised approach to detect artefacts. Where research is conducted into the quality or accuracy of image segmentations, it is almost entirely assumed that there is a manually annotated ground truth (GT) labelmap available for comparison. Our domain has seen little work on assessing the quality of generated segmentations particularly on a per-case basis and in the absence of GT.

Related Work:

Some previous studies have attempted to deliver quality estimates of automatically generated segmentations when GT is unavailable. Most methods tend to rely on a reverse-testing strategy. Both Reverse Validation [4] and Reverse Testing [5] employ a form of cross-validation by training segmentation models on a dataset that are then evaluated either on a different fold of the data or a separate test-set. Both of these methods require a fully-labeled set of data for use in training. Additionally, these methods are limited to conclusions about the quality of the segmentation algorithms rather than the individual labelmaps as the same data is used for training and testing purposes.

Where work has been done in assessing individual segmentations, it often also requires large sets of labeled training data. In [6] a model was trained using numerous statistical and energy measures from segmentation algorithms. Although this model is able to give individual predictions of accuracy for a given segmentation, it again requires the use of a fully-annotated dataset. Moving away from this limitation, [7, 8] have shown that applying Reverse Classification Accuracy (RCA) gives accurate predictions of traditional quality metrics on a per-case basis. They accomplish this by comparing a set of reference images with manual segmentations to the test-segmentation, evaluating a quality metric between these, and then taking the best value as a prediction for segmentation quality. This is done using a set of only 100 reference images with verified labelmaps. However, the time taken to complete RCA on a single segmentation is prohibits real-time quality control frameworks: around 11 minutes.


In this study, we show that applying a modern deep learning approach to the problem of automated quality control in deployed image-segmentation frameworks can decrease the per-case analysis time to the order of milliseconds whilst maintaining good accuracy. We predict Dice Similarity Coefficient (DSC) at large-scale analyzing over 16,000 segmentations of images from the UKBB. We also show that measures derived from RCA can be used to inform our network removing the need for a large, manually-annotated dataset. When pairing our proposed real-time quality assessment with real-time segmentation methods, one can envision new avenues of optimising image acquisition automatically toward best possible analysis results.

2 Method & Material

Figure 1: (left) Histogram of Dice Similarity Coefficients (DSC) for 29,292 segmentations. Range is with 10 equally spaced bins. Red line shows minimum counts (1,610) at in the bin

used to balance scores. (right) 5 channels of the CNNs in both experiments: the image and one-hot-encoded labelmaps for background (BG), left-ventricular cavity (LV), left-ventricular myocardium (LVM) and right-ventricular cavity (RVC).

We use the Dice Similarity Coefficient (DSC) as a metric of quality for segmentations. It measures the overlap between a proposed segmentation and its ground truth (GT) (usually a manual reference). We aim to predict DSC for segmentations in the absence of GT. We perform two experiments in which CNNs are trained to predict DSC. First we describe our input data and the models.

Our initial dataset consists of 4,882 3D (2D-stacks) end-diastolic (ED) cardiovascular magnetic resonance (CMR) scans from the UK Biobank (UKBB) Imaging Study111UK Biobank Resource under Application Number 2964.. All images have a manual segmentation which is unprecedented at this scale. We take these labelmaps as reference GT. Each labelmap contains 3 classes: left-ventricular cavity (LVC), left-ventricular myocardium (LVM) and right-ventricular cavity (RVC) which are separate from the background class (BG). In this work, we also consider the segmentation as a single binary entity comprising all classes: whole-heart (WH).

A random forest (RF) of 350 trees and maximum Depth 40 is trained on 100 cardiac atlases from an in-house database and used to segment the 4,882 images at depths of 2, 4, 6, 8, 10, 15, 20, 24, 36 and 40. We calculate DSC from the GT for the 29,292 generated segmentations. The distribution is shown in Fig 

1. Due to the imbalance in DSC scores of this data, we choose to take a random subset of 1,610 segmentations from each DSC bin, equal to the minimum number of counts-per-bin across the distribution. Our final dataset comprises 16,100 score-balanced segmentations with reference GT.

From each segmentation we create 4 one-hot-encoded masks: masks 1 to 4 correspond to the classes BG, LVC, LVM and RVC respectively. The voxels of the mask are set at when they do not belong to the mask’s class and the element set to 1 otherwise. For example, the mask for LVC is everywhere except for voxels of the LVC class which are given the value . This gives the network a greater chance to learn the relationships between the voxels’ classes and their locations. An example of the segmentation masks is shown in Fig 1.

At training time, our data-generator re-samples the UKBB images and our segmentations to have consistent shape of making our network fully 3D with 5 data channels: the image and 4 segmentation masks. The images are also normalized such that the entire dataset falls in the range .

For comparison and consistency, we choose to use the same input data and network architecture for each of our experiments. We employ a 50-layer 3D residual network written in Python with the Keras library and trained on an 11GB Nvidia GeForce GTX 1080 Ti GPU. Residual networks are advantageous as they allow the training of deeper networks by repeating smaller blocks. They benefit from skip connections that allow data to travel deeper into the network. We use the Adam optimizer with learning rate of

and decay of 0.005. Batch sizes are kept constant at 46 samples per batch. We run validation at the end of each epoch for model-selection purposes.


Can we take advantage of a CNN’s inference speed to give fast and accurate predictions of segmentation quality? This is an important question for analysis pipelines which could benefit from the increased confidence in segmentation quality without compromising processing time. To answer this question we conduct the following experiments.

Experiment 1: Directly predicting DSC.

Is it possible to directly predict the quality of a segmentation given only the image-segmentation pair? In this experiment we calculate, per class, the DSC between our segmentations and the GT. These are used as training labels. We have 5 nodes in the final layer of the network where the output is

. This vector represents the DSC per class including background and whole-heart. We use mean-squared-error loss and report mean-absolute-error between the output and GT DSC. We split our data 80:10:10 giving 12,880 training samples and 1,610 samples each for validation and testing. Performing this experiment is costly as it requires a large manually-labeled dataset which is not readily available in practice.

Experiment 2: Predicting RCA scores.

Considering the promising results of the RCA framework [7, 8] in accurately predicting the quality of segmentations in the absence of large labeled datasets, can we use the predictions from RCA as training data to allow a network to give comparatively accurate predictions on a test-set? In this experiment, we perform RCA on all 16,100 segmentations. To ensure that we train on balanced scores, we again perform histogram binning on the RCA scores and take equal numbers from each class. We finish with a total of 5363 samples split into training, validation and test sets of 4787, 228 and 228 respectively. The predictions per-class are used as labels during training. Similar to Experiment 1, we obtain a single predicted DSC output for each class using the same network and hyper-parameters, but without the need for the large, often-unobtainable manually-labeled training set.

3 Results

Figure 2: Examples showing excellent prediction of Dice Similarity Coefficient (DSC) in Experiment 1. Quality increases from top-left to bottom-right. Each panel shows (left to right) the image, test-segmentation and reference GT.

Results from Experiment 1 are shown in Table 1

. We report mean absolute error (MAE) and standard deviations per class between reference GT and predicted DSC. Our results show that our network can directly predict whole-heart DSC from the image-segmentation pair with MAE of 0.03 (SD = 0.04). We see similar performance on individual classes. Table 


also shows MAE over the top and bottom halves of the GT DSC range. This suggests that the MAE is equally distributed over poor and good quality segmentations. For WH we report 72% of the data have MAE less than 0.05 with outliers (

) comprising only 6% of the data. Distributions of the MAEs for each class can be seen in Fig 3. Examples of good and poor quality segmentations are shown in Fig 2 with their GT and predictions. Results show excellent true (TPR) and false-positive rates (FPR) on a whole-heart binary classification task with DSC threshold of 0.70. The reported accuracy of 97% is better than the 95% reported with RCA in [8].

Our results for Experiment 2 are recorded in Table 1. It is expected that direct predictions of DSC from the RCA labels are less accurate than in Experiment 1. The reasoning is two-fold: first, the RCA labels are themselves predictions and retain inherent uncertainty and second, the training set here is much smaller than in Experiment 1. However, we report MAE of 0.14 (SD = 0.09) for the WH case and 91% accuracy on the binary classification task. Distributions of the MAEs are shown in Fig 3

. LVM has a greater variance in MAE which is in line with previous results using RCA


. Thus, the network would be a valuable addition to an analysis pipeline where operators can be informed of likely poor-quality segmentations, along with some confidence interval, in real-time.

On average, the inference time for each network was of the order 600 ms on CPU and 40 ms on GPU. This is over 10,000 times faster than with RCA (660 seconds) whilst maintaining good accuracy. In an automated image analysis pipeline, this method would deliver excellent performance at high-speed and at large-scale. When paired with a real-time segmentation method it would be possible provide real-time feedback during image acquisition whether an acquired image is of sufficient quality for the downstream segmentation task.

Figure 3: Distribution of the mean absolute errors (MAE) for Experiments 1 (left) and 2 (right). Results are shown for each class: background (BG), left-ventricular cavity (LV), left-ventricular myocardium (LVM), right-ventricular cavity (RVC) and for the whole-heart (WH).
Mean Absolute Error (MAE) Experiment 1 Experiment 2 0 DSC 1 0 DSC 1 Class BG 0.008 (0.011) 0.012 (0.014) 0.004 (0.002) 0.034 (0.042) 0.048 (0.046) 0.074 (0.002) LV 0.038 (0.040) 0.025 (0.024) 0.053 (0.047) 0.120 (0.128) 0.069 (0.125) 0.213 (0.065) LVM 0.055 (0.064) 0.027 (0.027) 0.083 (0.078) 0.191 (0.218) 0.042 (0.041) 0.473 (0.111) RVC 0.039 (0.041) 0.021 (0.020) 0.058 (0.047) 0.127 (0.126) 0.076 (0.109) 0.223 (0.098) WH 0.031 (0.035) 0.018 (0.018) 0.043 (0.043) 0.139 (0.091) 0.112 (0.093) 0.188 (0.060) TPR 0.975 FPR 0.060 Acc. 0.965 TPR 0.879 FPR 0.000 Acc. 0.906
Table 1: For Experiments 1 and 2, Mean absolute error (MAE) for poor () and good () quality segmentations over individual classes and whole-heart (WH). Standard deviations in brackets. (right) Statistics from binary classification (threshold [8]): True (TRP) and false-positive (FPR) rates over full DSC range with classification accuracy (Acc).

4 Conclusion

Ensuring the quality of a automatically generated segmentation in a deployed image analysis pipeline in real-time is challenging. We have shown that we can employ Convolutional Neural Networks to tackle this problem with great computational efficient and with good accuracy.

We recognize that our networks are prone to learning features specific to assessing the quality of Random Forest segmentations. We can build on this by training the network with segmentations generated from an ensemble of methods. However, we must reiterate that the purpose of the framework in this study is to give an indication of the predicted quality and not a direct one-to-one mapping to the reference DSC. Currently, these networks will correctly predict whether a segmentation is ‘good’ or ‘poor’ on some threshold, but will not confidently distinguish between two segmentations of similar quality.

Our trained CNNs are insensitive to small regional or boundary differences in labelmaps which are of good quality. Thus they cannot be used to assess quality of a segmentation at fine-scale. Again, this may be improved by a more diverse and granular training-sets. The labels for training the network in Experiment 1 are not easily available in most cases. However, by performing RCA, one can automatically obtain training labels for the network in Experiment 2 and this could be applied to segmentations generated with other algorithms. The cost of using data obtained with RCA is an increase in MAE. This is reasonable compared to the effort required to obtain a large, manually-labeled dataset.


RR is funded by KCL&Imperial EPSRC CDT in Medical Imaging (EP/L015226/1) and GlaxoSmithKline; VV by Indonesia Endowment for Education (LPDP) Indonesian Presidential PhD Scholarship; KF supported by The Medical College of Saint Bartholomew’s Hospital Trust. AL and SEP acknowledge support from NIHR Barts Biomedical Research Centre and EPSRC program grant (EP/P001009/ 1). SN and SKP are supported by the Oxford NIHR BRC and the Oxford British Heart Foundation Centre of Research Excellence. This project supported by the MRC (grant number MR/L016311/1). NA is supported by a Wellcome Trust Research Training Fellowship (203553/Z/Z). The authors SEP, SN and SKP acknowledge the British Heart Foundation (BHF) (PG/14/89/31194). BG received funding from the ERC under Horizon 2020 (grant agreement No 757173, project MIRA, ERC-2017-STG).


  • [1] Petersen, S.E., Aung, N., Sanghvi, M.M., Zemrak, F., Fung, K., Paiva, J.M., Francis, J.M., Khanji, M.Y., Lukaschuk, E., Lee, A.M., Carapella, V., Kim, Y.J., Leeson, P., Piechnik, S.K., Neubauer, S.: Reference ranges for cardiac structure and function using cardiovascular magnetic resonance (CMR) in caucasians from the UK biobank population cohort. Journal of Cardiovascular Magnetic Resonance 19(1) (feb 2017)
  • [2] Bosse, S., Maniry, D., Müller, K.R., Wiegand, T., Samek, W.: Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment. (1) (2016) 1–14
  • [3] Farzi, M., Pozo, J.M., McCloskey, E.V., Wilkinson, J.M., Frangi, A.F.: Automatic quality control for population imaging: A generic unsupervised approach. In Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W., eds.: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016, Cham, Springer International Publishing (2016) 291–299
  • [4] Zhong, E., Fan, W., Yang, Q., Verscheure, O., Ren, J.:

    Cross validation framework to choose amongst models and datasets for transfer learning.

    In Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M., eds.: Machine Learning and Knowledge Discovery in Databases, Berlin, Heidelberg, Springer Berlin Heidelberg (2010) 547–562

  • [5] Fan, W., Davidson, I.: Reverse testing. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2006, New York, New York, USA, ACM Press (2006) 147
  • [6] Kohlberger, T., Singh, V., Alvino, C., Bahlmann, C., Grady, L.: Evaluating Segmentation Error without Ground Truth. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2012. Springer Berlin Heidelberg (2012) 528–536
  • [7] Valindria, V.V., Lavdas, I., Bai, W., Kamnitsas, K., Aboagye, E.O., Rockall, A.G., Rueckert, D., Glocker, B.: Reverse Classification Accuracy: Predicting Segmentation Performance in the Absence of Ground Truth. IEEE Transactions on Medical Imaging (2017) 1–1
  • [8] Robinson, R., Valindria, V.V., Bai, W., Suzuki, H., Matthews, P.M., Page, C., Rueckert, D., Glocker, B.: Automatic Quality Control of Cardiac MRI Segmentation in Large-Scale Population Imaging. In Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S., eds.: Medical Image Computing and Computer Assisted Intervention - MICCAI 2017, Cham, Springer International Publishing (2017) 720–727