Biomedical image data are increasingly processed with automated image analysis pipelines which employ a variety of tools to extract clinically useful information. It is important to understand the limitations of such pipelines and assess the quality of the results being reported. This is a particular issue when we consider large-scale population imaging databases comprising thousands of images such as the UK Biobank (UKBB) Imaging Study . There are often many modules in automated pipelines 
where each may contribute to inaccuracies in the final output and reduce the overall quality of the analysis, e.g. intensity normalisation, segmentation, registration and feature extraction. On a large scale, it is infeasible to perform a manual, visual inspection of all outputs, and even more difficult to perform quality control (QC) within the pipeline itself. We break down this challenge and focus on the automated QC of image segmentation.
Image segmentation is the process of partitioning an image into several parts where each of these parts is a collection of pixels (or voxels) corresponding to a particular structure. The purpose of segmentation is to derive quantitative measures of these structures, e.g. calculating ventricular volume or vessel thickness. Automated segmentation is desired to reduce workload for this tedious, time-consuming and error prone task. A number of these methods have been developed, ranging from basic region-growing techniques and graph cuts to more advanced algorithms involving machine learning4].
Segmentation performance is traditionally evaluated on a labelled validation dataset, which is a subset of the dataset that is the algorithm does not see during training. This evaluation is done using a series of metrics to compare the predicted segmentation and a reference ‘ground truth’ (GT). Popular metrics include volumetric overlap , surface distances or other statistical measures . Due to the lack of actual GT, manual expert annotations are used as reference, despite inter- and intra-rater variability. Once a segmentation method is deployed in clinical practice no such quantitative evaluation can be carried out routinely.
Evaluating the average performance of an algorithm on validation data is arguably less important than being able to assess the quality on a per-case basis, and it is crucial to identify cases where the segmentation has failed. We show that we can effectively predict the per-case quality of automated segmentations of 3D cardiac MRI (CMR) from the UKBB which enables fully automated QC in large-scale population studies and clinical practice.
In this article we will first present related work that attempts to address the problem of automated QC at large-scale. Our method and datasets are then described in detail before we present our results and discuss their implications.
Despite its practical importance, there is relatively little work on automatically predicting performance of image analysis methods. Much of the prior work on automated quality control has focused on the quality of images themselves. This focus on image quality assessment (IQA) is also true in the medical-imaging community [7, 8]. In the context of image segmentation, there exist only a few methods outlined here.
Algorithms often rely on ‘labels’ to support their training. In our case, each label would indicate the quality of each segmentation, either by categorical label, e.g. 0 for ‘poor’ and 1 for ‘good’, or by continuous value such as a Dice Similarity Coefficient. In cases where such labelled data is scarce, Reverse Validation  and Reverse Testing  use labels generated by one model, trained on a subset of available data, to train another model which is evaluated on the remaining data. This is effectively cross-validation where the amount of labelled data is limited. In Reverse Testing, ‘some rules’ are created to assess the performance of and rank the different models. In our context, this would involve creating a segmentation quality model from a subset of MR scans, and their corresponding segmentations, which can then be tested on the remaining images. Different models would be created and tested in order to choose the best model. The difficulty in these methods is that we require all of the scans to be accurately segmented in order to train, and to evaluate, a good model. That is, we need a large, fully-annotated training dataset which is often not available in our field. Additionally, Reverse Validation and Reverse Testing do not allow us to identify individual cases where a segmentation may have failed; instead they focus upon the segmentation method as a whole.
In a method proposed by Kohlberger et al., the quality of segmentations is assessed on a per-case basis using machine learning. The group used 42 different hand-crafted statistics about the intensity and appearance of multi-organ computed-tomography (CT) scans to inform their model. Whilst this method achieved good performance metrics and an accuracy of around 85%, it requires a lot of training data with both good and bad segmentations which is non-trivial to obtain.
In this work, we adopt the recently-proposed approach of Reverse Classification Accuracy (RCA) . Unlike Reverse Validation and Reverse Testing, RCA can accurately predict the quality of a segmentation on a case-by-case basis only requiring a relatively small set of accurately segmented reference images. In RCA, the predicted segmentation being assessed is used to create a small model to re-segment the reference images for which segmentations are available. If at least one image in the reference set is re-segmented well, the predicted segmentation, that we wish to assess, must have been of good quality. We employ RCA to perform segmentation quality analysis on a per-case basis while only requiring a small set of reference images and segmentations.
Methods and Data
Our purpose is to have a system that is able to predict the per-case quality of a segmentation produced by any algorithm deployed in clinical practice. We want our method to not only give us a prediction of the quality of the segmentation, but to be able to identify if that segmentation has failed. To this end, we employ RCA which will give a prediction about the quality of individual segmentations.
Reverse Classification Accuracy
In RCA the idea is to build a model, also known as an ‘RCA classifier’, solely using one test image and its predicted segmentation which acts as pseudo ground truth. This classifier is then evaluated on a reference dataset for which segmentations are available. There are two possible outcomes to this procedure:
Case 1: assuming that the predicted segmentation is of good quality, the created model should be able to segment at least one of the reference images with high accuracy. This is likely to be a reference image which is similar to the test image.
Case 2: if none of the reference images are segmented successfully, then the predicted segmentation is likely to be of poor quality.
These assumptions are valid if the reference dataset is representative of the test data. This is usually the case in the context of machine learning where the reference data could have been used in the first place to train the automated method for which we want to predict test performance. If the test data were very different, the automated method would in any case not perform well, and RCA scores would reflect this. It is a great advantage that the same reference dataset can be used to train an automated segmentation method, and also afterwards serves as the reference database enabling prediction of performance after deployment of the segmentation method.
The performance of the RCA classifier on the reference set is measured with any chosen quality metric, e.g., the Dice similarity coefficient (DSC). The highest score among all reference images determines the quality estimate for the predicted segmentation obtained for a test image.
The original work on RCA  explored a variety of possible classifiers that could be trained on a single test-image and its segmentation including Atlas Forests (AF)  and Convolutional Neural Networks (CNNs). In this context, and throughout this paper, an ‘atlas’ refers to an image-segmentation pair whose segmentation has been verified by a manual annotator. In Valindria’s paper, a simple single-atlas registration classifier outperformed both the AF and CNN approaches in predicting segmentation accuracy. For this reason, we chose to use this simple approach for the model in our work. Registration is the process of aligning two or more images based upon similar content within them, e.g. structures or intensities. Rigid registration restricts the images to move only by linear translations and rotations. More complex non-rigid registration methods exists that allow for differences in scale between the images and for more complex distortions. The single-atlas registration classifier in RCA works by performing non-rigid registration of the test-image to a set of individual reference-images. The resulting transformations are then used to warp the test-segmentation. This yields a set of warped segmentations which are quantitatively compared to the reference segmentations. The overlap between the pairs is calculated as the Dice Similarity Coefficient (DSC) whilst boundary agreement is computed using surface-distance metrics. The best metric values among the reference set are taken to be the prediction for the quality of the test-segmentation.
We chose to modify the single-atlas registration classifier from that used in Valindria et al.’s proposal of the RCA method . Processing and modifying the test-segmentation is not usually desirable as this may introduce discretization artefacts adding false-positives into the binary labelmap. We choose to perform the single-atlas registration in reverse: we register the reference-images to the test-image and use this transformation to warp the reference segmentations. This results in a set of warped segmentations in the test-image space which are then compared to the test-segmentation. Figure 1 gives an overview of RCA as applied in our study. We now set out our framework more formally.
For the RCA reference images, we use a set of cardiac atlases with reference segmentations . We have a test set of images with automatically generated predicted segmentations whose quality we would like to assess. If the GT segmentations for exist, one can evaluate the accuracy of these quality assessments. Using RCA we estimate the quality of those predicted segmentations and compare the estimates to the real quality with respect to the GT.
In the case where , we take the th test image and its predicted segmentation . To apply RCA, all reference images are first registered to by performing a rigid registration in the form of a centre of mass (CoM) alignment. Initial versions of our work  used landmark registration  at this stage, but we now opt for CoM alignment to reduce computational cost. We then perform non-linear registration of each aligned reference image in to the test image to get warped reference images . The same transformations are used to warp the GT reference segmentations to get the set . For each warped segmentation in we compare against the predicted segmentation by evaluating a set of metrics detailed below. The best value for each metric over all warped reference segmentations is taken to be the prediction of segmentation accuracy for . In our validation studies, we can compute the real metrics by comparing the predicted segmentation with its GT .
Evaluation of Predicted Accuracy
The segmentation quality metrics predicited with RCA include the Dice similarity coefficient (DSC), mean surface distance (MSD), root-mean-square surface distance (RMS) and Hausdorff distance (HD). For two segmentations, A and B, DSC is a measure of overlap given by . The surface distance between a point on the surface of A and the surface of B is given by the minimum of the euclidean norm for all points in the surface of B. The total surface distance is the sum of the surface distances for all points in A. We don’t assume symmetry in these calculations, so the surface distance is also calculated from B to A. By taking the mean over all points we get the MSD. RMS is calculated by taking the square of surface distances, averaging and taking the square root. Finally, the HD is taken to be the maximum surface distance.
For each test image, we report the evaluation metrics for each class label: left-ventricular (LV) cavity, LV myocardium (LVM) and right-ventricular (RV) cavity (RVC). We incorporate the voxels of the papillary muscles into the LV cavity class. The right-ventricular myocardium is difficult to segment because it is thin, therefore it is seldom seen in SAX CMR segmentations and not considered in this paper. For each evaluation metric (DSC and surface distances), we could report two difference average values: either a whole-heart average by combining all class labels into a single ‘whole-heart’ (WH) class or, second, by taking the mean across the individual class scores. The WH-class average is usually higher because a voxel attributed to an incorrect class will reduce the mean calculated across the classes, but will actually be considered correct in the single WH-class case.
We perform three investigations in this work which are summarised in Table 1: A) an initial small-scale validation study on 400 test contours of 80 images from an internal cardiac atlas dataset; B) a large-scale validation study on another 4,805 UKBB image with manual ground truth and C) a real-world application to a large set of 7,250 UKBB 3D CMR segmentations.
The reference image set is the same in all of our studies. We use 100 2D-stack short-axis (SA) end-diastolic (ED) CMR scans that were automatically segmented and validated by expert clinicians at Hammersmith Hospital, London. Note that the reference set is distinct from all other datasets used. Compared with data from the UKBB, the reference set are of higher in-plane resolution at mm and have a smaller slice thickness of 2 mm. These images are not used for any purpose other than for this reference set. When choosing a reference set, one should ensure that it is representative of the dataset on which it is being used i.e. it should be of the same domain (SAX CMR in this case) and large enough to capture some variability across the dataset. A reference set that is too small may underestimate the RCA prediction. Though we argue that this may be better than overestimating the quality of a segmentation. Conversely, too large a reference set will cause a significant lengthening of RCA execution time. We have explored the effect of the RCA reference set size on the prediction accuracy as part of our evlauation which we present in our Discussion.
Experiment A: Initial Validation Study,
Data: We validate RCA on predicting cardiac image segmentation quality using 100 manually verified image-segmentation pairs (different from the reference dataset). Each atlas contains a SA ED 3D (2D-stack) CMR and its manual segmentation. The images have a pixel-resolution of mm and span voxels. Each manual segmentation identifies voxels belonging to the LV cavity, LV myocardium and RV cavity separating the heart from the background class.
For validation, we generate automatic segmentations of our atlases with varying quality. We employ Random Forests (RFs) withtrees and a maximum depth of trained on the same set of 100 cardiac atlases used for testing RCA in this experiment. RFs allow us to produce a variety of test segmentations with intentionally degraded segmentation quality by limiting the depth of the trees during test time. We obtain 4 sets of 100 segmentations by using depths of 5, 20, 30 and 40. Thus, a total of 400 segmentations are used in our initial validation study.
Evaluation: We perform RCA on all 400 segmentations to yield predictions of segmentation quality. The manual segmentations allow us to evaluate the real metrics for each automated segmentation. We compare these to the quality predicted by RCA. To identify individual cases where segmentation has failed, we implement a simple classification strategy similar to that in Valindria’s work 
. We consider a 2-group binary classification where DSC scores in the range [0.0 0.7) are considered ‘poor’ and in the range [0.7 1.0] are ‘good’. These boundaries are somewhat arbitrary and would be adjusted for a particular use-case. Other strategies could be employed on a task-specific basis, e.g. formulation as outlier detection with further statistical measures. The thresholding approach allows us to calculate true (TPR) and false (FPR) positive rates for our method as well as an overall accuracy from the confusion matrix.
Experiment B: Large-scale Validation on Manually-segmented UKBB Data,
In this experiment we demonstrate that RCA is robust for employment in large-scale studies, and indeed produces accurate predictions of segmentation quality on an individual basis. As part of our collaboration under UK Biobank Application 2964 we have access to 4,805 CMR images with manually drawn contours.
Data: In the context of UKBB data, 3D means a stack of 2D acquisitions with slice-thickness of 8.0 mm and slice-gap of 2 mm . The CMR scans have in-plane resolution of mm and span around pixels per slice. The number of slices per scan varies between 4-14 with the majority (89%) having 9-12 slices.
Petersen and colleagues [16, 17] manually segmented all slices of each 3D cardiac MRI scan available under the data access application. Several annotators were employed following a standard operating procedure to generate almost 5,000 high-quality segmentations. With these manual segmentations acting as GT we directly compare predicted segmentation scores with real scores at large scale.
We use the same RF trained in Experiment A to perform automated segmentations at various tree depths chosen randomly across the 4,805 scans yielding segmentations of varying quality. In addition to our RF segmentations, we evaluate RCA with 900 segmentations generated with a recent deep learning based approach. As part of the UKBB Application 2964, Bai et al.  have trained a CNN on 3,900 manually segmented images. The remaining 900 were then automatically segmented using the trained network. The results of Bai et al.’s CNN approach reflect the state-of-the-art in automated 3D cardiac MR segmentation with an accuracy matching the performance human experts .
Evaluation: We perform RCA on all 4,805 RF segmentations to yield predictions of segmentation quality. We also perform RCA separately on the 900 CNN segmentations produced by a state-of-the-art deep learning approach. With the availability of GT manual segmentations, we can evaluate this experiment in the same way as Experiment A.
Experiment C: Automatic Quality Control in the UKBB Imaging Study,
Having evaluated RCA in the previous two experiments, this experiment mimics how our approach would behave in a real-world application where the GT is unavailable. We apply RCA to segmentations of CMR images from the UKBB.
Data: In total, 7,250 cardiac MR images were available to us through the UKBB resource. Each image has been automatically segmented using a multi-atlas segmentation approach . As part of a genome-wide association study (GWAS), each automatic segmentation has been checked manually to confirm segmentation quality. As there is no GT segmentation, we rely on manual QC scores for these segmentations assessed by a clinical expert. The manual QC is based only on visual inspection of the basal, mid and apical layers. For each layer a score between 0 and 2 is assigned based on the quality of only the LV myocardium segmentation. The total QC score is thus between 0 and 6, where a 6 would be considered as a highly accurate segmentation. Scores for individual layers were not recorded. Where the UKBB images had a poor field-of-view (FOV), the segmentations were immediately discarded for use in the GWAS study: we have given these images a score of -1. For the GWAS study, poor FOV meant any image in which the entire heart was not visible. We expect that despite the poor FOV of these images, the segmentations themselves may still be of good quality as the algorithms can still see most of the heart. Out of the 7,250 segmented images, 152 have a bad FOV () and 42 have an obviously poor segmentation (). There are 2, 14, 44, 300, 2866 and 3830 images having QC scores 1 to 6 respectively. This investigation explored how well RCA-based quality predictions correlate with those manual QC scores.
Evaluation: We perform RCA on all 7,250 segmentations to yield predictions of segmentation quality for the LVM. With the absence of GT segmentations, we are unable to perform the same evaluation as in Experiments A and B. In this case, we determine the correlation between the predicted scores from RCA (for LV myocardium) and the manual QC scores. A visual inspection of individual cases is also performed at quality categories.
|B||UKBB-2964||4,805||Yes||RF and CNN|
Here we present results from our three investigations: (A) the initial small-scale validation study; (B) application to a large set of UKBB cardiac MRI with visual QC scores; and (C) a further large-scale validation study on UKBB with manual expert segmentations.
Quantitative results for the experiments are presented in each section. Figure 2 demonstrates additional qualitative inspection that is performed on a per-case basis during RCA. The top row of Figure 2 shows the mid-ventricular slice of an ED CMR scan and a RF-generated segmentation which is under test. An overlay of the two are also shown alongside the manual reference segmentation which is not available in practice. Below this, an array of further panels is shown. Each of these panels presents one of the 100 reference images used, its corresponding reference segmentation and the result of warping the segmentation-under-test (top-panel, second image) to this reference image. The calculated DSC between the reference image’s GT and the warped segmentation is displayed above each panel. The array shows the reference image with the highest (top-left) and the lowest (bottom-right) calculated DSC with the remaining panels showing DSCs that are uniformly spaced amongst the remaining 98 reference images. We can see in this example that there is a large range of predicted DSC values, but only the maximum prediction, selected in red, is used as the prediction of segmentation quality. For the example in Figure 2 we show a ‘good’ quality segmentation-under-test for which we are predicting a DSC of 0.904 using RCA. The real DSC between the segmentation-under-test and the GT manual segmentation is 0.944. Note that in this case these values are calculated for the ‘whole-heart’ where individual class labels are merged into one. These values are shown above the top panel along with the DSC calculated on a per-class basis.
For considerations of space, we do not show more visual examples but note that a visualisation as in Figure 2 could be produced on a per-case basis in a deployed system aiding interpretability and visual means for manual validation by human experts.
(A) Initial Validation Study
A summary of the results is shown in Table 2. We observe low mean absolute error (MAE) across all evaluation metrics and all class labels. The scatter plots in Figure 3 on real and predicted scores illustrate the very good performance of RCA in predicting segmentation quality scores. We also find that from the 400 test segmentations, RCA is able to classify ‘good’ () and ‘poor’ () segmentations with an accuracy of 99%. From 171 poor segmentations at this threshold, 166 could be correctly identified by RCA, i.e. 97.1%. 100% of good-quality segmentations were correctly labelled. Additionally, we find binary classification accuracy of 95% when applying a threshold of 2.0 mm on the MSD. From 365 poor segmentations at this threshold, 348 could be correctly identified by RCA, i.e. 95.3%. Similarly, 31 from 35 (88.6%) good-quality segmentations were correctly labelled. For all evaluation metrics, there is a strong, positive linear relationship between predicted and real values with and . Further analysis of our data shows increasing absolute error in each metric as the real score gets worse, e.g. the error for MSD increases with increasing surface distance. This correlates larger MAE with lower segmentation quality. In addition, when we consider only those segmentations where the real metric is 30 or less, the MAE drop significantly to 0.65, 1.71 and 6.78 mm for MSD, RMS and HD respectively. We are not concerned with greater errors for poor segmentations as they are still likely to be identified by RCA as having failed.
(B) Large-scale Validation with Manual GT on UKBB
Results for the RF segmentations are shown in Table 3. We report 95% binary classification accuracy with a DSC threshold of 0.7 and low MAE on the DSC. From 589 poor segmentations at this threshold, 443 could be correctly identified by RCA, i.e. 75.2%. Similarly, 4139 from 4216 (98.2%) good-quality segmentations were correctly labelled. Additionally, we find binary classification accuracy of 98% when applying a threshold of 2.0 mm on the MSD. From 2497 poor segmentations at this threshold, 2429 could be correctly identified by RCA, i.e. 97.3%. Similarly, 2270 from 2308 (98.3%) good-quality segmentations were correctly labelled. The true positive rates (TPR) are high across the classes, this shows RCA is able to correctly and consistently identify ‘good’ quality segmentations. MSD-based false-positive rates (FPR) are shown to be lower than those based on DSC, this would indicate that MSD is more discriminative for ‘poor’ quality segmentations and does not misclassify them so much as DSC. We identify only two instances where RCA predictions do not conform to the overall trend and predict much higher than the real DSC. On inspection, we find that the GT of these segmentations were missing mid-slices causing the real DSC to drop. These points can be seen in the upper-left-hand quadrant on Figure 4. The figure also shows that, over all metrics, there is high correlation between predicted and real quality metrics. This is very much comparable to the results from our initial validation study (A) in Figure 3. The strong relationship between the predicted quality metrics from RCA and the equivalent scores calculated with respect to the manual segmentations demonstrates concretely that RCA is capable of correctly identifying, on a case-by-case basis, segmentations of poor quality in large-scale imaging studies.
On the CNN segmentations, we report 99.8% accuracy in binary classification for the whole-heart class. With a DSC threshold set at 0.7, RCA correctly identified 898 from 900 good-quality segmentations with 2 false-negatives. A visualization of this can be seen in the top panel of Figure 5 where the predicted and real DSC can be seen clustered in the high-quality corner of each metric’s plot (upper-right for DSC and lower-left for surface-distance metrics). This reflects the high quality segmentations of the deep learning approach which have been correctly identified as such using RCA. Table 4 shows the detailed statistics for this experiment.
We note that the individual class accuracy for the LV myocardium is lower in the CNN case when DSC is used at the quality metric. We show the results for this class in the bottom panel of Figure 5. Segmentors can have difficulty with this class due to its more complex shape. From the plotted points we see all cases fall into a similar cluster to the average WH case, but the RCA score under-predicts the real DSC. This exemplifies a task-specific setting for how RCA would be used in practice. In this case one cannot rely only on DSC to predict the quality of the segmentation, so MSD could provide a more appropriate quality prediction.
(C) Quality Control on 7,250 UK Biobank Images
Figure 6 shows the relationship between manual QC scores and the predicted DSC, MSD, RMS and HD obtained from RCA. Note, these predictions are for the LV myocardium and not the overall segmentation as this class was the focus of the manual QC procedure. Manual QC was not performed for the other classes.
Figure 6 also shows a sample of segmentations with manual QC scores of 0, 1, 5 and 6 for the LV myocardium. With a score of 0, ‘A’ must have a ‘poor’ quality segmentation of LV myocardium at the basal, apical and mid slices. Example ‘B’ shows relatively low surface-distance metrics and a low DSC, we see this visually as the boundary of the myocardium is in the expected region, but is incomplete in all slices. This segmentation has been given a score of 1 because the mid-slice is well segmented while the rest is not; which is correctly identified by RCA. In example ‘C’, the segmentation of the LV myocardium is clearly not good with respect to the image, yet it has been given a manual QC score of 5. Again, RCA is able to pick up such outliers by predicting a lower DSC. The final example ‘D’ displays an agreement between the high predicted DSC from RCA and the high manual QC score. These examples demonstrate RCA’s ability to correctly identify both good and poor quality segmentations when performing assessments over an entire 3D segmentation. It also demonstrates the limitations of manual QC and the success of RCA in identifying segmentation failure on a per-case basis.
Creating a set of manual QC scores for over 7,200 images is a laborious task but it has provided worthwhile evidence for the utility of RCA on large-scale studies. It is clear, however, that the 3-layer inspection approach with a single-rater has limitations. First, inspecting all layers would be preferable, but it highly time-consuming and, second, with multiple raters, averaging or majority voting could be employed to reduce human error.
We should note that RCA is unlikely to mimic the exact visual manual QC process, and it should not, as it naturally provides a different, more comprehensive, assessment of segmentation quality. The manual QC is a rather crude assessment of segmentation quality, as such, we did not perform a direct, quantitative comparison using the visual QC categories but rather wanted to demonstrate that there is a general correlation between manual QC and the predicted RCA score.
We have shown that it is possible to assess the quality of individual cardiac MR segmentations at large-scale and in the absence of ground truth. Previous approaches have primarily focused on evaluating overall, average performance of segmentation methods or required large sets of pre-annotated data of good and bad quality segmentations for training a classifier. Our method is well suited for use in image-analysis pipelines and clinical workflows where the quality of segmentations should be assessed on a per-case basis. We have also shown that RCA can provide predictions on a per-class basis. Note that our manually labelled dataset did not include the RV Myocardium as a label and therefore has been omitted from our study.
The RCA validation process was carried out on 8-core Intel i7 3.6 GHz machines. The whole process for a single test segmentation - including 100 reference image registrations, warping 100 reference segmentations and metric evaluations - took on average 11 minutes, making it suitable for background processing in large-scale studies and clinical practice. However, this is a limitation as the runtime per case currently does not allow immediate feedback and prohibits applications with real-time constraints. For example, one could envision a process where cardiac MR scans are immediately segmented after acquisition, and feedback on the quality would be required while the patient is still in the scanner. For this, the computation time of RCA would need to be reduced possibly through an automatic selection of a subset of reference images. We report preliminary results for using a deep learning approach to speed up the process in . With a real-time RCA framework, the method could be used to identify challenging cases for CNN-based segmentors where the RCA feedback could be used to improve the segmentation algorithm.
As noted earlier, using a subset of the reference set could help to optimize the run-time of RCA predictions. To better understand the effect of reference set size on prediction accuracy, we have performed an empirical evaluation using the data from Experiment B. We took the 4,805 automated segmentations and their manual GT and performed RCA using randomly selected subsets of the 100 image-segmentation pairs from the full reference set. Five different randomly selected sets of sizes 10, 15, 25, 35, 50, 65 and 75 were created and used for obtaining RCA predictions on the 4,805 images. Figure 7 shows the mean accuracy computed across the 5 runs for each reference set size. Error bars indicate the highest and lowest accuracy achieved across the five runs. Accuracy is computed using the same DSC threshold of 0.7 as used in Experiment B. The figure shows that the mean accuracy increases with increasing number of reference images. The error bars in Figure 7 show a decrease in size with increasing size of the reference set. As the reference set grows in size, a greater variability in the images is captured that allows the RCA process to become more accurate. Noteworthy, even with small reference sets of about 20 images high accuracy of more then 90 is obtained.
Although RCA can give a good indication of the real DSC score for an individual segmentation, an accurate one-to-one mapping between the predicted and real DSC has not been achieved. However, we have shown that the method will confidently differentiate between ‘good’ and ‘poor’ quality segmentations based on an application specific threshold. The threshold could be chosen depending on the application’s requirements for what qualifies as a ‘good’ segmentation. Failed segmentations could be re-segmented with different parameters, regenerated with alternative methods, discarded from further analyses or, more likely, sent to a user for manual inspection. Additionally, whilst RCA has been shown to be robust to cardiovascular topology it would need to be re-evaluated for use in other anatomical regions.
Reverse Classification Accuracy had previously been shown to effectively predict the quality of whole-body multi-organ segmentations. We have successfully validated the RCA framework on 3D cardiac MR, demonstrating the robustness of the methodology to different anatomy. RCA has been successful in identifying poor-quality image segmentations with measurements of DSC, MSD, RMS and HD and has shown excellent MAE against all metrics. RCA has also been successful in producing a comparable outcome to a manual quality control procedure on a large database of 7,250 images from the UKBB. We have shown further success in accurately predicting quality metrics on 4,805 segmentations from Petersen et al., for which manual segmentations were available for evaluation. Predicting segmentation accuracy in the absence of ground truth is a step towards fully automated QC in image analysis pipelines.
Our contributions to the field are three-fold: 1) a thorough validation of RCA for the application of cardiac MR segmentation QC. Our results indicate highly accurate predictions of segmentation quality across various metrics; 2) a feasibility study of using RCA for automatic QC in large-scale studies. RCA predictions correlate with a set of manual QC scores and enable outlier detection in a large set of 7,250 cases, and 3) a large-scale validation on 4,800 cardiac MR images from the UKBB. Furthermore, we have done this without the need for a large, labelled dataset and we can predict segmentation quality on a per-case basis.
CMR: cardiovascular magnetic resonance; LV: left ventricle; RV: right ventricle; ED: end-diastole; RCA: reverse classification accuracy; GT: ground truth; DSC: Dice similarity coefficient; MSD: mean surface distance; RMS: root-mean-squared surface distance; HD: Hausdorff distance; MAE: mean absolute error; RF: random forest; CNN: convolutional neural network; UKBB: UK Biobank; 3D: 3-dimensional; MR: magnetic resonance; CMR: cardiac magnetic resonance; QC: quality control; GWAS: genome-wide association study; FOV: field of view; TPR: true-positive rate; FPR: false-positive rate; AF: atlas forest;
Ethics approval and consent to participate
The UKBB has approval from the North West Research Ethics Committee (REC reference: 11/NW/0382). All participants have given written informed consent.
Availability of data and materials
The imaging data and manual annotations were provided by the UKBB Resource under Application Number 2946. Part (B) was conducted with data obtained through Application Number 18545. Researchers can apply to use the UKBB data resource for health-related research in the public interest. Python code for performing RCA as implemented in this work is in the public repository: https://github.com/mlnotebook/RCA. Example sets of reference and test-images are provided as links in the repository.
Consent for Publication
Steffen E. Petersen provides consultancy to Circle Cardiovascular Imaging Inc. (Calgary, Alberta, Canada). Ben Glocker receives research funding from HeartFlow Inc. (Redwood City, CA, USA).
RR and BG conceived and designed the study; RR performed implementations, data analysis and write the manuscript. VV developed the original RCA framework. WB provided the automated segmentations that were given QC scores by HS. OO provided landmarks used in the original version of this work. SN, SKP and SEP are overall consortium leads for UKBB access application 2964 and responsible for the conceptualisation of creating a CMR segmentation reference standard. Data curation for application 2964 by SEP, NA, AML and VC and manual contours of 5,000 CMR produced by MMS, NA, JMP, FZ, KF, EL, VC and YJK. PM and DR provided advice in the stage of model development and clinical applications.
RR is funded by both the King’s College London & Imperial College London EPSRC Centre for Doctoral Training in Medical Imaging (EP/L015226/1) and GlaxoSmithKline; VV by Indonesia Endowment for Education (LPDP) Indonesian Presidential PhD Scholarship and HS by Research Fellowship from Uehara Memorial Foundation. This work was also supported by the following institutions: KF is supported by The Medical College of Saint Bartholomew’s Hospital Trust, an independent registered charity that promotes and advances medical and dental education and research at Barts and The London School of Medicine and Dentistry. AL and SEP acknowledge support from the NIHR Barts Biomedical Research Centre and from the “SmartHeart” EPSRC program grant (EP/P001009/ 1). SN and SKP are supported by the Oxford NIHR Biomedical Research Centre and the Oxford British Heart Foundation Centre of Research Excellence. This project was enabled through access to the MRC eMedLab Medical Bioinformatics infrastructure, supported by the Medical Research Council (grant number MR/L016311/1). NA is supported by a Wellcome Trust Research Training Fellowship (203553/Z/Z). The authors SEP, SN and SKP acknowledge the British Heart Foundation (BHF) for funding the manual analysis to create a cardiovascular magnetic resonance imaging reference standard for the UKBB imaging resource in 5000 CMR scans (PG/14/89/31194). PMM gratefully acknowledges support from the Edmond J. Safra Foundation and Lily Safra, the Imperial College Healthcare Trust Biomedical Research Centre, the EPSRC Centre for Mathematics in Precision Healthcare, the UK Dementia Research Institute and the MRC. BG received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 757173, project MIRA, ERC-2017-STG).
This work was carried out under UKBB Applications 18545 and 2964. The authors wish to thank all UKBB participants and staff.
-  Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M., Liu, B., Matthews, P., Ong, G., Pell, J., Silman, A., Young, A., Sprosen, T., Peakman, T., Collins, R.: UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoS Medicine 12(3), 1–10 (2015). doi:10.1371/journal.pmed.1001779
-  Shariff, A., Kangas, J., Coelho, L.P., Quinn, S., Murphy, R.F.: Automated Image Analysis for High-Content Screening and Analysis. Journal of Biomolecular Screening 15(7), 726–734 (2010). doi:10.1177/1087057110370894
-  de Bruijne, M.: Machine learning approaches in medical image analysis: From detection to diagnosis. Medical Image Analysis 33, 94–97 (2016). doi:10.1016/j.media.2016.06.032
-  Bai, W., Sinclair, M., Tarroni, G., Oktay, O., Rajchl, M., Vaillant, G., Lee, A.M., Aung, N., Lukaschuk, E., Sanghvi, M.M., Zemrak, F., Fung, K., Paiva, J.M., Carapella, V., Kim, Y.J., Suzuki, H., Kainz, B., Matthews, P.M., Petersen, S.E., Piechnik, S.K., Neubauer, S., Glocker, B., Rueckert, D.: Human-level cmr image analysis with deep fully convolutional networks. 1710.09289v3
-  Crum, W.R., Camara, O., Hill, D.L.G.: Generalized overlap measures for evaluation and validation in medical image analysis. IEEE Transactions on Medical Imaging 25(11), 1451–1461 (2006). doi:10.1109/TMI.2006.880587
-  Taha, A.A., Hanbury, A.: Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC medical imaging 15, 29 (2015). doi:10.1186/s12880-015-0068-x
-  Carapella, V., Jiménez-Ruiz, E., Lukaschuk, E., Aung, N., Fung, K., Paiva, J., Sanghvi, M., Neubauer, S., Petersen, S., Horrocks, I., Piechnik, S.: Towards the Semantic Enrichment of Free-Text Annotation of Image Quality Assessment for UK Biobank Cardiac Cine MRI Scans. In: MICCAI Workshop on Large-scale Annotation of Biomedical Data and Expert Label Synthesis (LABELS), pp. 238–248. Springer, Cham (2016). doi:10.1007/978-3-319-46976-8_25
-  Zhang, L., Gooya, A., Dong, B., Hua, R., Petersen, S.E., Medrano-Gracia, P., Frangi, A.F.: Automated Quality Assessment of Cardiac MR Images Using Convolutional Neural Networks. In: Tsaftaris, S.A., Gooya, A., Frangi, A.F., Prince, J.L. (eds.) Medical Image Computing and Computer-Assisted Intervention – SASHIMI 2016. Lecture Notes in Computer Science, vol. 9968, pp. 138–145. Springer, Cham (2016). doi:10.1007/978-3-319-46630-914
Zhong, E., Fan, W., Yang, Q., Verscheure, O., Ren, J.: Cross Validation Framework to Choose amongst Models and Datasets for Transfer Learning. In: Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 6323 LNAI, pp. 547–562. Springer, ??? (2010). doi:10.1007/978-3-642-15939-835. http://link.springer.com/10.1007/978-3-642-15939-8_35
-  Fan, W., Davidson, I.: Reverse testing. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’06, p. 147. ACM Press, New York, New York, USA (2006). doi:10.1145/1150402.1150422
-  Valindria, V.V., Lavdas, I., Bai, W., Kamnitsas, K., Aboagye, E.O., Rockall, A.G., Rueckert, D., Glocker, B.: Reverse Classification Accuracy: Predicting Segmentation Performance in the Absence of Ground Truth. IEEE Transactions on Medical Imaging, 1–1 (2017). doi:10.1109/TMI.2017.2665165
-  Zikic, D., Glocker, B., Criminisi, A.: Encoding atlases by randomized classification forests for efficient multi-atlas label propagation. Medical Image Analysis 18(8), 1262–1273 (2014). doi:10.1016/j.media.2014.06.010
-  Robinson, R., Valindria, V.V., Bai, W., Suzuki, H., Matthews, P.M., Page, C., Rueckert, D., Glocker, B.: Automatic quality control of cardiac mri segmentation in large-scale population imaging. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) Medical Image Computing and Computer Assisted Intervention - MICCAI 2017, pp. 720–727. Springer, Cham (2017)
-  Oktay, O., Bai, W., Guerrero, R., Rajchl, M., de Marvao, A., OŔegan, D.P., Cook, S.A., Heinrich, M.P., Glocker, B., Rueckert, D.: Stratified decision forests for accurate anatomical landmark localization in cardiac images. IEEE Transactions on Medical Imaging 36(1), 332–342 (2017). doi:10.1109/tmi.2016.2597270
-  Petersen, S.E., Matthews, P.M., Francis, J.M., Robson, M.D., Zemrak, F., Boubertakh, R., Young, A.A., Hudson, S., Weale, P., Garratt, S., Collins, R., Piechnik, S., Neubauer, S.: Uk biobank’s cardiovascular magnetic resonance protocol. Journal of Cardiovascular Magnetic Resonance 18(1), 8 (2016). doi:10.1186/s12968-016-0227-4
-  Petersen, S.E., Aung, N., Sanghvi, M.M., Zemrak, F., Fung, K., Paiva, J.M., Francis, J.M., Khanji, M.Y., Lukaschuk, E., Lee, A.M., Carapella, V., Kim, Y.J., Leeson, P., Piechnik, S.K., Neubauer, S.: Reference ranges for cardiac structure and function using cardiovascular magnetic resonance (CMR) in caucasians from the UK biobank population cohort. Journal of Cardiovascular Magnetic Resonance 19(1) (2017). doi:10.1186/s12968-017-0327-9
-  Petersen, S.E., Sanghvi, M.M., Aung, N., Cooper, J.A., Paiva, J.M., Zemrak, F., Fung, K., Lukaschuk, E., Lee, A.M., Carapella, V., Kim, Y.J., Piechnik, S.K., Neubauer, S.: The impact of cardiovascular risk factors on cardiac structure and function: Insights from the uk biobank imaging enhancement study. PLOS ONE 12(10), 1–14 (2017). doi:10.1371/journal.pone.0185114
-  Bai, W., Shi, W., O‘Regan, D.P., Tong, T., Wang, H., Jamil-Copley, S., Peters, N.S., Rueckert, D.: A Probabilistic Patch-Based Label Fusion Model for Multi-Atlas Segmentation With Registration Refinement: Application to Cardiac MR Images. IEEE Transactions on Medical Imaging 32(7), 1302–1315 (2013). doi:10.1109/TMI.2013.2256922
-  Robinson, R., Oktay, O., Bai, W., Valindria, V., Sanghvi, M., Aung, N., Paiva, J., Zemrak, F., Fung, K., Lukaschuk, E., Lee, A., Carapella, V., Kim, Y.J., Kainz, B., Piechnik, S., Neubauer, S., Petersen, S., Page, C., Rueckert, D., Glocker, B.: Real-time Prediction of Segmentation Quality. ArXiv e-prints (2018). 1806.06244