1 Introduction
Automated brain segmentation is a basic tool for processing magnetic resonance imaging (MRI) and provides imaging biomarkers of neuroanatomy like volume, thickness, and shape. Despite efforts to deliver robust segmentation results across scans from different age groups, diseases, field strengths, and manufacturers, inaccuracies in the segmentation outcome are inevitable (Keshavan et al., 2018)
. A fundamental limitation of existing methods for wholebrain segmentation is that they do not estimate segmentation quality. Hence, manual quality control (QC) is advised before continuing with the analysis, but it has several shortcomings: (i) time consuming, (ii) subject to intra and interrater variability, (iii) binary (pass/fail), and (iv) global for the entire scan. In particular when operating on large datasets, manual QC is very time consuming so that cohortlevel summary statistics on biomarkers have, for instance, been used for identifying outliers
(Sabuncu et al., 2016). A shortcoming of such heuristics is that they operate decoupled from the actual image and segmentation procedure.
Bayesian approaches for image segmentation are an alternative because they do not only provide the mode (i.e., the most likely segmentation) but also the posterior distribution of the segmentation. Most of such Bayesian approaches use point estimates in the inference, whereas marginalizing over parameters has only been proposed in combination with Markov Chain Monte Carlo sampling
(Iglesias et al., 2013) or the Laplace approximation (Wachinger et al., 2015). Although samplingbased approaches incorporate fewer assumptions, they are computationally intense, especially when used in conjunction with atlasbased segmentation, and thus, have only been used for segmenting substructures but not the wholebrain (Iglesias et al., 2013).Fully convolutional neural networks (FCNNs) have become the tool of choice for semantic segmentation in computer vision
(Badrinarayanan et al., 2017; Long et al., 2015) and medical imaging (Ronneberger et al., 2015). In prior work, we introduced QuickNAT (Roy et al., 2017, 2018b), an FCNN for wholebrain segmentation of MRI T1 scans that has not only outperformed existing atlasbased approaches, but also accomplished the segmentation orders of magnitude faster. QuickNAT is also much faster than DeepNAT, a previous patchbased approach for brain segmentation with neural networks (Wachinger et al., 2018). Although FCNNs provide high accuracy, they are often poorly calibrated and fail to estimate a confidence margin with the output (Guo et al., 2017). The predictive probability at the end of the network, i.e., the output of the softmax layer, does not capture the model uncertainty
(Gal and Ghahramani, 2016).Recent progress in Bayesian deep learning utilized the concept of Monte Carlo (MC) sampling via dropout to approximate samples from the posterior distribution (Gal and Ghahramani, 2016). Dropout has originally been proposed to prevent overfitting during training (Srivastava et al., 2014)
. Dropout at test time approximates sampling from a Bernoulli distribution over network weights. As dropout layers do not have learnable parameters, adding them to the network does not increase model complexity or decrease performance. Thanks to fast inference with CNNs, multiple MC samples can be generated to reliably approximate the posterior distribution in acceptable time. MC dropout for estimating uncertainty in deep learning was originally proposed for classification
(Gal and Ghahramani, 2016) and later applied to semantic segmentation with FCNNs in computer vision (Kendall et al., 2017), providing a pixelwise model uncertainty estimate.In this article, we propose to inherently measure the quality of wholebrain segmentation with a Bayesian extension of QuickNAT. For this purpose, we add dropout layers to the QuickNAT architecture, which enables highly efficient Monte Carlo sampling. Thus, for a given input brain scan, multiple possible segmentations are generated by MC sampling. Next to estimating voxelwise segmentation uncertainty, we propose four metrics for quantifying the segmentation uncertainty for each brain structure. We show that these metrics are highly correlated with the segmentation accuracy (Dice score) and can therefore be used to predict segmentation accuracy in absence of reference manual annotation. Finally, we propose to effectively use the uncertainty estimates as quality control measures in largescale group analysis to estimate reliable effect sizes.
The automated QC proposed in this article offers advantages with regards to manual QC. Most importantly, it does not require manual interactions so that an objective measure of quality control is available at the same time with the segmentation, particularly important for processing large neuroimaging repositories. Furthermore, we obtain a continuous measure of segmentation quality, which may be a more faithful representation than dichotomizing into pass and fail. Finally, the segmentation quality is estimated for each brain structure, instead of a global assessment for the entire brain in manual QC, which better reflects variation in segmentation quality within a scan.
The main contributions of the work are as follows:

First approach for wholebrain segmentation with inherent quality estimation

Monte Carlo dropout for uncertainty estimation in brain segmentation with FCNN

Four metrics to quantify structurewise uncertainty in contrast to voxelwise uncertainty

Comprehensive experiments on four unseen datasets (variation in quality, scanner, pathology) to substantiate the high correlation of structurewise uncertainty with Dice score

Integration of segmentation uncertainty in group analysis for estimating more reliable effect sizes.
While endtoend learning approaches achieve high segmentation accuracy, the ‘black box’ nature of complex neural networks may impede their wider adoption in clinical application. The lack of transparency of such models makes it difficult to trust the outcome. In addition, the performance of learningbased approaches is closely tied to the scans used during training. If scans are presented to the network during testing that are very different to those that it has seen during training, a lower segmentation accuracy is to be expected. With the uncertainty measures proposed in this work, we address these points by also estimating a quality or confidence measure of the segmentation. This will allow to identify scans with low segmentation accuracy, potentially due to low image quality or variation from the training set. While the contributions in this work do not increase the segmentation accuracy, we believe that assigning a meaningful confidence estimate will be as important for its practical use.
2 Prior Art
Prior work exists in medical image computing for evaluating segmentation performance in absence of manual annotation. In one of the earliest work, the common agreement strategy (STAPLE) was used to evaluate classifier performance for the task of segmenting brain scans into WM, GM and CSF
(Bouix et al., 2007). In another approach, the output segmentation map was used, from which features were extracted to train a separate regressor for predicting the Dice score (Kohlberger et al., 2012). More recent work proposed the reverse classification accuracy (RCA), whose pipeline involves training a separate classifier on the segmentation output of the method to evaluate, serving as pseudo ground truth (Valindria et al., 2017). Similar to previous approaches, it also tries to estimate Dice score. The idea of RCA was extended for segmentation quality control in largescale cardiac MRI scans (Robinson et al., 2017).In contrast to the approaches detailed above, our approach provides a quality measure or prediction confidence that is inherently computed (i.e. derived from the same model, in contrast to using a separate model for estimating quality) within the segmentation framework, derived from model uncertainty. Thus, it does not require to train a second, independent second classifier for evaluation, which itself might be subject to prediction errors. An earlier version of this work was presented at a conference (Roy et al., 2018a) and has here been extended with methodological improvements and more experimental evaluation. To the best of our knowledge, this is the first work to provide an uncertainty measure for each structure in wholebrain segmentation and its downstream application in group analysis for reliable estimation.
3 Method
We propose a fully convolutional neural network that produces next to the segmentation also an estimate of the confidence or quality of the segmentation for each brain structure. To this end, we use a Bayesian approach detailed in the following sections.
3.1 Background on Bayesian Inference
Given a set of training scans with its corresponding manual segmentations , we aim at learning a probabilistic function . This function generates the most likely segmentation given a test scan . The probability of the predicted segmentation is
(1) 
where are the weight parameters of the function . The posterior distribution over weights in Eq. (1) is generally intractable, where we use variational inference to approximate it. Thus, a variational distribution over network’s weights
is learned by minimizing the KullbackLeibler divergence
, yielding the approximate predictive distribution(2) 
In Bayesian neural networks, the stochastic weights are composed of layers . The variational distribution for layer is sampled as
(3)  
Here
are Bernoulli distributed random variables with probabilities
, and are variational parameters to be optimized. The diagoperator maps vectors to diagonal matrices whose diagonals are the elements of the vectors. Also,
represents the number of nodes in the layer.The integral in Eq. (2) is estimated by summing over MonteCarlo samples drawn from . Note that sampling from can be approximated by performing dropout on layer in a network whose weights are (Gal and Ghahramani, 2016)
. The binary variable
corresponds to unit in layer being dropped out as an input to the layer. Each sample ofprovides a different segmentation for the same input image. The mean of all the segmentations provides the final segmentation, whereas the variance among segmentations provides model uncertainty for the prediction.
3.2 QuickNAT architecture
As the base architecture, we use our recently proposed QuickNAT (Roy et al., 2018b). QuickNAT consists of three 2D FCNN models, segmenting an input scan slicewise along coronal, axial and sagittal axes. This is followed by a view aggregation stage where the three generated segmentations are combined to provide a final segmentation. Each 2D FCNN model has an encoderdecoder based architecture, four encoder blocks and four decoder blocks separated by a bottleneck block. Dense connections are added within each encoder and decoder block to promote feature reusability and promote learning of better representations (Huang et al., 2017). Skip connections exist between each encoder and decoder block similar to UNet (Ronneberger et al., 2015)
. The network is trained by optimizing the combined loss function of weighted Logistic loss and Dice loss. Median frequency balancing is employed to compensate for class imbalance
(Roy et al., 2018b).3.3 Bayesian QuickNAT
We use dropout layers (Srivastava et al., 2014) to introduce stochasticity during inference with the QuickNAT architecture. A dropout mask generated from a Bernoulli distribution generates a probabilistic weight , see Eq. (3
), with random neuron connectivity similar to a Bayesian neural network
(Gal and Ghahramani, 2016). For Bayesian QuickNAT, we insert dropout layers after every encoder and decoder block with a dropout rate , as illustrated in Fig. 1. Dropout is commonly used during training of neural networks to prevent overfitting, but deactivated during testing. Here, we keep dropout active in the testing phase and generate multiple segmentations from the posterior distribution of the model. To this end, the input scan is feedforwarded times through QuickNAT, each time with a different and random dropout mask. This process simulates the sampling from a space of submodels with different connectivity among the neurons. This MC sampling of the models generates samples of predicted probability maps , from which hard segmentation maps can be inferred by the ‘’ operator across the channels . This approximates the process of variational inference as in Bayesian neural networks (Gal and Ghahramani, 2016). The final segmentation is estimated by computing the average over all the MC probability maps, followed by a ‘’ operator as(4) 
The probability map consists of channels , representing probability maps for each individual class, which includes the addressed brain structures and background.
3.4 Uncertainty Measures
3.4.1 Voxelwise Uncertainty
The model uncertainty for a given voxel , for a specific structure is estimated as entropy over all MC probability maps
(5) 
The global voxelwise uncertainty is the sum over all structures, . Voxels with low uncertainty (i.e. low entropy) receive the same predictions, with different random neurons being dropped out from the network. An intuitive explanation for this is that the network is highly confident about the decision and that the result does not change much when the neuron connectivity is partially changed by using dropouts. In contrast, the prediction confidence is low, if predictions change a lot with altering neuron connectivity.
3.4.2 Structurewise Uncertainty
As most quantitative measures extracted from segmentation maps (e.g., Hippocampus volume) relate to specific brain structures, it is helpful to have an uncertainty measure corresponding to each brain structure, rather than each voxel. Here, we propose four different metrics for computing structurewise uncertainty from MC segmentations, illustrated in Fig. 2 for MC samples.
Type1: We measure the variation of the volume across the MC samples. As volume estimates are commonly used for neuroanatomical analysis, this type of uncertainty provides a confidence margin with the estimate. We compute the coefficient of variation,
(6) 
with mean
of structure for MC volume estimates. Note that this estimate is agnostic to the size of the structure.Type2: We use the overlap between samples as a measure of uncertainty. To this end, we compute the average Dice score over all possible pairs of MC samples,
(7) 
This measures the agreement in area overlap between all the MC samples in a pairwise fashion.
Type3: We use the intersection over overlap (IoU) metric, over all the MC samples for a specific structure as measure of its uncertainty. The value of IoU is constraint between and it is computed as
(8) 
Type4: We define the uncertainty for a structure as mean global voxelwise uncertainty over the voxels which were labeled as ,
(9) 
It must be noted that and IoU are directly related to segmentation accuracy, while and are inversely related to accuracy. Also, it is worth mentioning that computing voxelwise uncertainty maps requires all segmentation probability maps (each one having a size around 2 GB), which can be computationally demanding. In contrast, our proposed metrics (except ) use label maps (size around 200 KB), which are much smaller in size and can be computed faster.
3.5 Segmentation Uncertainty in Group Analyses
Commonly, image segmentation is only a means to an end, where imagederived measures are used in followup statistical analyses. We are interested in propagating the uncertainty from the segmentation to the followup analyses. The rationale is that segmentations with high uncertainty potentially corresponds to scans with poor quality whose inclusion would confound the true effect sizes and limit the statistical significance of observed group differences. We demonstrate the integration of uncertainty for generalized linear models (GLMs) in the following, but it can also be generalized to other statistical models. GLMs are frequently used in neuroimaging studies for identifying significant associations between image measures and variables of interest. For instance, in numerous group analyses studies Hippocampus volume was shown to be an important imaging biomarker with significant associations to Alzheimer’s disease.
In solving the regression model, each equation, i.e., each subject, has equal importance in the optimization routine (i.e.
). In contrast, we propose to integrate the structurewise uncertainty in the analysis. This is achieved by solving a weighted linear regression model with an unique weight
for subject ,(10) 
with design matrix , vector of coefficients , and normalized brain structure volume (normalized by intra cranial volume). We use the proposed structurewise uncertainties (, and IoU) and set the weight as,
(11) 
Including weights in the regression increases its robustness as scans with reliable segmentation are emphasized. Setting all weights to a constant results in standard regression. In our experiments, we use
(12) 
with age , sex and diagnosis for subject . Of particular interest is the regression coefficient , which estimates the effect of diagnosis on the volume of a brain structure.
4 Experimental Setup
4.1 Architecture and Training Procedure
We set the dropout rate to (other values of decreased the segmentation performance compared to not using droupouts) and produce MC samples ( 2 minutes), after which performance saturates (shown in Sec. 5.1). For training the neural network with limited data, we use the pretraining strategy with auxiliary labels proposed earlier (Roy et al., 2017). To this end, we pretrain the network on volumes of the IXI dataset^{1}^{1}1http://braindevelopment.org/ixidataset/ with segmentations produced by FreeSurfer (Fischl et al., 2002) and subsequently finetune on of the 30 manually annotated volumes from the MultiAtlas Labelling Challenge (MALC) dataset (Landman and Warfield, 2012). The remaining volumes were used for testing. The split is consistent to challenge instructions. This trained model is used for all our experiments. In this work, we segment brain structures (listed in the appendix).
4.2 Test Datasets
We test of four datasets, where three of the datasets have not be seen during training.

MALC15: 15 of the 30 volumes from the MALC dataset that were not used for training are used for testing. MALC is a subset of the OASIS repository (Marcus et al., 2007).

ADNI29: The dataset consists of 29 scans from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu), with a balanced distribution of Alzheimer’s Disease (AD) and control subjects, and scans acquired with 1.5T and 3T scanners. The objective is to observe uncertainty changes due to variability in scanner and pathologies. The ADNI was launched in 2003 as a publicprivate partnership, led by Principal Investigator Michael W. Weiner, MD. For uptodate information, see www.adniinfo.org.

CANDI13: The dataset consists of 13 brain scans of children (age 515) with psychiatric disorders, part of the CANDI dataset (Kennedy et al., 2012). The objective is to observe changes in uncertainty for data with age range not included in training.

IBSR18: The dataset consist of 18 scans publicly available at https://www.nitrc.org/projects/ibsr. The objective is to see the sensitivity of uncertainty with low resolution and poor contrast scans.
Note that the training set (MALC) did not contain scans with AD or scans from children. Manual segmentations for MALC, ADNI29, and CANDI13 were provided by Neuromorphometrics, Inc.^{2}^{2}2http://Neuromorphometrics.com/
5 Experimental Results and Discussion
5.1 Number of MC Samples
First, we examine the choice of number of MC samples () needed for our task. This choice is mainly dependent on two factors: (i) the segmentation accuracy by averaging all the MC predictions needs to be similar to the segmentation accuracy not using dropouts at test time, and (ii) the estimated uncertainty map needs to be stable, i.e., addition of more MC samples should not effect the computed entropy values. We use the CANDI13 dataset for this experiment as it represents an out of sample dataset, i.e., data not used in training the model. It therefore provides a realistic test case on unseen data. We performed experiments with .
The mean global Dice scores for different values of are reported in Tab. 1. We observe that the Dice score remains more or less constant as increases from to , which is very close to the Dice performance with no dropouts at test time. This is in contrast to prior work that reported a performance increase with more MC samples (Kendall et al., 2017). A potential reason for this is that the QuickNAT framework aggregates segmentations across the three principal axes (coronal, axial and sagittal) (Roy et al., 2018b). Hence, MC samples actually represents aggregating segmentations in our framework. Furthermore, the view aggregation step compensates from the slight decrease in segmentation performance due to dropout at test time.
#MC samples ()  Mean Dice score 

3  
6  
9  
12  
15  
18  
No Dropout 
Next, we investigate the number of MC samples needed to reliably estimate the model uncertainty. The voxelwise uncertainty can be considered stable if the estimated entropy values do not change substantially with larger . Let the uncertainty maps for and MC samples be and , respectively. We estimate the mean absolute difference between them, to quantify the stability. We report this value for different consecutive transitions of MC samples in Tab. 2. We observe that the transition yields a small difference, indicating a stable estimation of the uncertainty maps.
Transitions ()  

It is worth mentioning that as increases, not only does the segmentation time per scan increase, but also the required computational resources and complexity. This is due to the fact that all the
intermediate 3D segmentation probability maps (4D tensors) need to be loaded in the RAM for estimating the voxelwise uncertainty map. We set
for all the following experiments, which provides high segmentation accuracy and reliable uncertainty estimates, while keeping the computational complexity within acceptable margins.Datasets  Mean Dice score  Corr(, DS)  Mean  

IoU  IoU  
MALC15  
ADNI29  
CANDI13  
IBSR18 
5.2 Uncertainty based quality control across different datasets
In this section, we conduct experiments to explore the ability of the proposed structurewise uncertainty metrics in predicting the segmentation quality across different seen and unseen datasets. Towards this end, we compute the correlation coefficient between the four uncertainty metrics and the Dice scores to quantify its efficacy in providing quality control. We report the mean Dice score, the correlation coefficients and mean IoU in Table 3 for all four datasets described in Sec. 4.2. Firstly, we observed that the segmentation Dice score is the highest on MALC dataset (), while the performance drops by Dice points for other datasets (ADNI, CANDI and IBSR). The reason for this is that part of the MALC dataset was used for training, whereas the other datasets are unseen scans resembling more realistic scenarios with training and testing scans coming from different datasets. This decrease in Dice score is accompanied by decrease in mean IoU (i.e. increase in structurewise uncertainty). We also observe that all the correlation values with the four metrics for all datasets are within acceptable margins (). IoU has the highest correlation across all four datasets. Next to reporting correlations, we show the scatter plots for the four uncertainty measures with respect to actual Dice score on CANDI13 dataset in Fig. 3. In the scatter plots, we represent one dot per structure per scan, with unique colors for each of the classes. For the sake of clarity, structures from the left hemisphere of the brain are only displayed. We note that and IoU show compact point clouds, whereas is more dispersed indicating lower correlation. It must be noted that each of the three unseen datasets has unique characteristics, which are not present in the training MALC scans. IBSR consists of scans with low resolution and thick slices. ADNI contains subjects exhibiting neurodegenerative pathologies, whereas training was done on healthy subjects. CANDI consists of children scans, whereas none of the training subjects was from that particular age range. So, we believe our experiments cover a wide variability of out of sample data (resolution, pathology, age range), which the model might encounter in a more uncontrolled setting. This is shown in Fig. 4 and explained in detail in Sec. 5.5.
Dataset  Corr(IoU, DS)  MAE  Accuracy 

MALC15  
ADNI29  
CANDI13  
IBSR18 
5.3 as a proxy for the Dice score
The Dice coefficient is the most widely used metric for evaluating segmentation accuracy and provides an intuitive ‘goodness’ measure to the user. This has motivated earlier works to directly regress the Dice score for segmentation quality control (Kohlberger et al., 2012; Valindria et al., 2017). Our approach is different because we provide inherent measures of uncertainty of the segmentation model. While we have demonstrated that our measures are highly correlated to Dice scores (Sec. 5.2), the actual structurewise uncertainty values may be challenging to interpret because it is not immediately clear which values indicate a good or bad segmentation. When looking at the scatterplot in Fig. 3, we see that the uncertainty measures on the xaxis and the Dice score on the yaxis are in different ranges, with the only exception of IoU. Indeed, the values of IoU closely resembles the Dice score and we will demonstrate in the following paragraph that it is a suitable proxy for the Dice score.
We estimated the mean absolute error (MAE) between IoU and Dice score and reported the results in Table 4. Also, similar to Valindria et al. (2017), we define three categories, i.e., Dice range as ‘bad’, as ‘medium’ and as ‘good’. We categorize the segmentations with actual Dice score and IoU, and report the perclass classification accuracy in Table 4. MAE varies between , while accuracy between as reported in Table 4. All the similarity metrics (Correlation, MAE and 3class classification accuracy) between IoU and Dice score have values very similar or better to the ones reported in (Valindria et al., 2017) over 4 different datasets. This is remarkable because Valindria et al. (2017) trained a model to dedicatedly predict the Dice score, while we are simply computing the intersection over overlap of the MC samples without any supervision.
We also presented a structurewise analysis to investigate similarity between Dice score and IoU in Fig. 5. Again, only structures on the left hemisphere of the brain are shown for clarity. In the boxplot, we observe that for most of structures IoU is very close to actual Dice score. The worst similarity is observed for the inferior lateral ventricles, where there is about 15% difference between the two metrics. A potential reason could the small size of the structure. With all these experiments, we substantiate the fact that IoU can be effectively used as a proxy for actual Dice score, without any reference manual annotations.
5.4 Sensitivity of Uncertainty to scan Quality
MRI scans of poor quality can lead to a degradation of the segmentation performance. Such poor quality scans can occur due to various reasons like noise, motion artifacts, and poor contrast. Model uncertainty is expected to be sensitive to the scan quality and should increase whenever segmentation accuracy decreases due to poor data quality. In this section, we investigate whether this property holds for our proposed model. Towards this end, we performed an experiment where we artificially degraded the quality of the input brain MRI scan with Rician noise. Here we use the MALC test dataset for evaluation purposes. We corrupt the scans with dB levels and reported the mean global Dice score and mean IoU at each noise level in Tab. 5. We observe that the mean Dice score reduces as the dB level of the added Rician noise increases, whereas mean IoU also decreases (indicating an increase in uncertainty). This confirms our hypothesis than our model is sensitive to scan quality. We also observe that mean IoU falls at a faster rate than mean Dice score, indicating that uncertainty is more sensitive to noise than segmentation accuracy. It must be noted that in all our experiments with real scans, we did not encounter any scenario where segmentation failed (Dice score ). The experiment with Rician noise with resembles an artificially induced failure case.
Noise Levels  Mean Dice score  Mean IoU  MAE  Accuracy 

No Noise  
dB = 3  
dB = 5  
dB = 7  
dB = 9 
5.5 Qualitative Analysis
We present qualitative results of Bayesian QuickNAT in Fig. 4. From left to right, the input MRI scan, its corresponding segmentation, voxelwise uncertainty map and structurewise uncertainty (IoU) heat map are illustrated. The scale of the heat map replicates the Dice score , where red corresponds to , indicating higher reliability in segmentation. Each row shows an example from the four different datasets, where we selected the scan with the worst segmentation accuracy for each dataset. The first row shows results on a test sample from the MALC dataset, where segmentation is overall of high quality. This is reflected by the thin lines in the voxelwise uncertainty (anatomical boundaries) and redness in the structurewise uncertainty heat map. Since the same dataset was used for training, we obtain high segmentation accuracy on MALC. The second row presents the scan with worst performance on the IBSR18 dataset. Careful inspection of the MRI scan shows poor contrast with prominent ringing artifacts. The mean Dice score of the scan is , which is below the mean score for the overall dataset. An increase in voxelwise uncertainty can be observed visually by the thickening of the lines along anatomical boundaries (in contrast to MALC). The structurewise uncertainty maps shows lighter shades of red in some subcortical structures, indicating a lesser reliable segmentation, in comparison to MALC. The third row presents the scan with worst performance in ADNI29, which belongs to a subject of age 95 with severe AD pathology. Prominent atrophy in cortex along with enlarged ventricles can be visually observed in the scan. In addition to the pathology, ringing artifacts at the top of the scan can be observed. The mean Dice score is , which is below the mean Dice score for the dataset. Its IoU heat map shows higher uncertainty in some subcortical structures with brighter shades, whereas the reliability of cortex and lateral ventricles segmentation is good. It must be noted that training scans did not consist of any subjects with AD, and this example illustrates the performance of our framework for unseen pathology. The last row presents the MRI scan with the worst performance on CANDI13. The mean Dice score of the scan is , which is below the mean Dice performance of the dataset. This scan can be considered as an outlier in the dataset. The scan belongs to a subject of age 5 with strong motion artifacts together and poor contrast. Scans of such age range and such poor quality were not used in the training pipeline, which explain the degradation of the segmentation performance. Its voxelwise uncertainty is higher in comparison to others, with some prominent dark highly uncertain patches in subcortical regions. The heat map shows the lowest confidence for this scan, in comparison to other results. The cortical regions show shades of yellow, whereas some subcortical structures show shades of blue, which is towards the lower end of the reliability scale.
5.6 Uncertainty for Group Analysis
In the following section, we integrate structurewise uncertainty in regression models for robust group analyses.
5.6.1 Group analysis on ADNI29
ADNI29 is a small subset of the ADNI dataset with 15 control and 14 Alzheimer’s patients. We perform a group analysis as per Eq. (10) with age, sex, and diagnosis as independent variables and the volume of a brain structure as independent variable. Since we have manual annotations for ADNI29, we can compute the actual volumes and accordingly estimate the ground truth regression coefficients. Table 6 reports the regression coefficients for diagnosis for twelve brain structures. The coefficients are estimated based on manual annotations, segmentation with FreeSurfer and with Bayesian QuickNAT. Further, we use the uncertaintybased weighting on the volume measures from Bayesian QuickNAT. Weighting was done using three of the proposed structurewise uncertainty as presented in Eq. (11). Our hypothesis is that weighting will result in regression coefficients that are numerically equal or closer to the estimates from the manual annotation than those without weighting. We observe that out of the selected 12 structures, more reliable estimation of is achieved with weighting and five structures using based weighting. Also for all structures, any weighting resulted in estimation, which is closer to its actual value, thus substantiating our hypothesis. These results demonstrate that integrating segmentation quality in the statistical analysis leads to more reliable estimates.
Structures  Manual  FreeSurfer  QuickNAT  Bayesian QuickNAT  

Annotations  IoU  
White Matter  1.129  0.788  0.779  0.778  0.779  0.799 
Cortex  0.202  0.406  0.156  0.158  0.177  0.146 
Lateral ventricle  0.368  0.392  0.372  0.376  0.423  0.405 
Caudate  0.111  0.026  0.047  0.088  0.131  0.067 
Putamen  0.109  0.225  0.276  0.237  0.055  0.130 
3rd Ventricle  0.214  0.333  0.353  0.357  0.391  0.325 
4th Ventricle  0.022  0.055  0.076  0.063  0.019  0.022 
Hippocampus  1.149  0.979  1.282  1.280  1.249  1.191 
Amygdala  1.005  0.891  1.149  1.104  1.039  0.908 
Accumbens  0.343  0.738  0.516  0.469  0.384  0.473 
5.6.2 AbideI
We perform group analysis on the ABIDEI dataset (Di Martino et al., 2014) consisting of scans, with normal subjects and subjects with autism. The dataset is collected from 20 different sites with a high variability in scan quality. To factor out changes due to site, we added site as a covariate in Eq. 10. We report with corresponding pvalues for the volume of brain structures that have recently been associated to autism in a large ENIGMA study (Van Rooij et al., 2017). We compare uncertainty weighted regression (weighted by , and IoU) to normal regression in Table 7. Strikingly, uncertainty weighted regression results in significant associations to autism, identical to (Van Rooij et al., 2017), whereas normal regression is only significant for amygdala.
Standard approaches for group analysis on large cohorts involves detection of outlier volume estimates and removing the corresponding subjects from the regression process. This sometimes also requires a manual inspection of the segmentation quality. In contrast to these approaches, we propose to use all the scans and associated a continuous weight for all, providing their relative importance is estimating the regression coefficients without the need for any outlier detection or manual inspection.
Autism  Normal Regression  IoU  

Biomarkers  
Amygdala  
Lat. Ventricles  
Pallidum  
Putamen  
Accumbens 
5.7 General Discussion
We introduced an approach to not only estimate the segmentation but also the uncertainty in the segmentation. The uncertainty is directly estimated from the segmentation model. Consequently, the uncertainty increases if a test scan is presented to the network that is different to the scans that it has seen during training. On the one hand, this holds for individuals that have different demographic characteristics or pathologies. On the other hand, this holds for image quality, which is related to the image acquisition process. Learningbased approaches can produce staggering segmentation accuracy, but there is strong dependence on the scans used during training. Since it will be impossible to have all scans that can potentially occur in practice represented in the training set, uncertainty is a key concept to mark scans with lower segmentation accuracy. Uncertainty could therefore be used to decide if scans have to acquired again due to insufficient quality. Further, it could be used to guide the inclusion of particular types of scans in training.
Our experiments have demonstrated that structurewise uncertainty measures are highly correlated to the Dice score. They can therefore be used for automated quality control. In particular, the intersection over union of the Monte Carlo samples has the same range as the Dice score and is demonstrated to be highly correlated with Dice in unseen datasets. Consequently, it can be interpreted as a proxy for the Dice score when manual annotations are not available to compute the actual Dice score. This can be beneficial for judging the segmentation quality of single scans.
For the analysis of groups of images, we then went one step further and integrated uncertainty measures in the followup analysis. We have demonstrated the impact of such an integration for regression analysis, but the general concept of weighting instances by their uncertainty can be used for many approaches, although it may require some adaptation. Such an approach offers particular advantage for the analysis of large repositories, where a manual quality control is very time consuming. Our results for the regression models have shown that weighting samples according to the segmentation quality yields estimates that are more similar to those from the manual annotation.
6 Conclusion
In this article, we introduced Bayesian QuickNAT, an FCNN for whole brain segmentation with a structurewise uncertainty estimate. Dropout is used at test time to produce multiple Monte Carlo samples of the segmentation, which are used in estimating uncertainty. We introduced four different metrics to quantify structurewise uncertainty. We extensively validated on multiple unseen datasets and demonstrate that the proposed metrics have high correlation with segmentation accuracy and provide effective quality control in absence of reference manual annotation. The datasets used in the experiments include unseen data from a wide variety with scans from children, with pathologies, with low resolution and with low contrast. Strikingly, one of our proposed metrics, intersection over union of MC samples, closely approximates the Dice score. In addition to this, we proposed to integrate the uncertainty metrics as confidence in the observation into group analysis, yielding reliable effect sizes. Although, all the experiments are performed on neuroimaging applications, the basic idea is generic and can easily to extended to other segmentation applications. We believe our framework will aid in translating automated frameworks for adoption in large scale neuroimaging studies as it comes with a failsafe mechanism to indicate the user whenever the system is not sure about a decision for manual intervention.
Acknowledgement
We thank SAP SE and the Bavarian State Ministry of Education, Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B) for funding and the NVIDIA corporation for GPU donation. We thank Neuromorphometrics Inc. for providing manual annotations. Data collection and sharing was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH1220012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica Inc.; Biogen Idec Inc.; BristolMyers Squibb Company; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. HoffmannLa Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC ; Johnson & Johnson Pharmaceutical Research & Development LLC; Medpace, Inc; Merck & Co., Inc.; Meso Scale Diagnostics, LLC; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
Appendix
List of Classes: (1) Left white matter, (2) Left cortex, (3) Left lateral ventricle, (4) Left inferior lateral ventricle, (5) Left cerebellum white matter, (6) Left cerebellum cortex, (7) Left thalamus, (8) Left caudate, (9) Left putamen, (10) Left pallidum, (11) ventricle, (12) ventricle, (13) Brain stem, (14) Left hippocampus, (15) Left amygdala, (16) CSF, (17) Left accumbens, (18) Left ventral DC, (19) Right white matter, (20) Right cortex, (21) Right lateral ventricle, (22) Right inferior lateral ventricle, (23) Right cerebellum white matter, (24) Right cerebellum cortex, (25) Right thalamus, (26) Right caudate, (27) Right putamen, (28) Right pallidum, (29) Right hippocampus, (30) Right amygdala, (31) Right accumbens, (32) Left ventral, and (33) Optic Chiasma.
7 References
References
 Badrinarayanan et al. (2017) Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), 2481–2495.
 Bouix et al. (2007) Bouix, S., MartinFernandez, M., Ungar, L., Nakamura, M., Koo, M.S., McCarley, R. W., Shenton, M. E., 2007. On evaluating brain tissue classifiers without a ground truth. Neuroimage 36 (4), 1207–1224.
 Di Martino et al. (2014) Di Martino, A., Yan, C.G., Li, Q., Denio, E., Castellanos, F. X., Alaerts, K., Anderson, J. S., Assaf, M., Bookheimer, S. Y., Dapretto, M., et al., 2014. The autism brain imaging data exchange: towards a largescale evaluation of the intrinsic brain architecture in autism. Molecular psychiatry 19 (6), 659.
 Fischl et al. (2002) Fischl, B., Salat, D. H., Busa, E., Albert, M., Dieterich, M., Haselgrove, C., Van Der Kouwe, A., Killiany, R., Kennedy, D., Klaveness, S., et al., 2002. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33 (3), 341–355.

Gal and Ghahramani (2016)
Gal, Y., Ghahramani, Z., 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: international conference on machine learning. pp. 1050–1059.
 Guo et al. (2017) Guo, C., Pleiss, G., Sun, Y., Weinberger, K. Q., 2017. On calibration of modern neural networks. arXiv preprint arXiv:1706.04599.

Huang et al. (2017)
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K. Q., 2017. Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 2261–2269.
 Iglesias et al. (2013) Iglesias, J. E., Sabuncu, M. R., Van Leemput, K., 2013. Improved inference in bayesian segmentation using monte carlo sampling: Application to hippocampal subfield volumetry. Medical image analysis 17 (7), 766–778.

Kendall et al. (2017)
Kendall, A., Badrinarayanan, V., Cipolla, R., 2017. Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding. Proceedings of the British Machine Vision Conference (BMVC).
 Kennedy et al. (2012) Kennedy, D. N., Haselgrove, C., Hodge, S. M., Rane, P. S., Makris, N., Frazier, J. A., 2012. Candishare: a resource for pediatric neuroimaging data.
 Keshavan et al. (2018) Keshavan, A., Datta, E., McDonough, I. M., Madan, C. R., Jordan, K., Henry, R. G., 2018. Mindcontrol: A web application for brain segmentation quality control. NeuroImage 170, 365–372.
 Kohlberger et al. (2012) Kohlberger, T., Singh, V., Alvino, C., Bahlmann, C., Grady, L., 2012. Evaluating segmentation error without ground truth. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, pp. 528–536.
 Landman and Warfield (2012) Landman, B., Warfield, S., 2012. Miccai 2012 workshop on multiatlas labeling. In: Medical image computing and computer assisted intervention conference.
 Long et al. (2015) Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440.
 Marcus et al. (2007) Marcus, D. S., Wang, T. H., Parker, J., Csernansky, J. G., Morris, J. C., Buckner, R. L., 2007. Open access series of imaging studies (oasis): crosssectional mri data in young, middle aged, nondemented, and demented older adults. Journal of cognitive neuroscience 19 (9), 1498–1507.
 Robinson et al. (2017) Robinson, R., Valindria, V. V., Bai, W., Suzuki, H., Matthews, P. M., Page, C., Rueckert, D., Glocker, B., 2017. Automatic quality control of cardiac mri segmentation in largescale population imaging. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, pp. 720–727.
 Ronneberger et al. (2015) Ronneberger, O., Fischer, P., Brox, T., 2015. Unet: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computerassisted intervention. Springer, pp. 234–241.
 Roy et al. (2018a) Roy, A. G., Conjeti, S., Navab, N., Wachinger, C., 2018a. Inherent brain segmentation quality control from fully convnet monte carlo sampling. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, pp. 664–672.
 Roy et al. (2018b) Roy, A. G., Conjeti, S., Navab, N., Wachinger, C., 2018b. Quicknat: Segmenting mri neuroanatomy in 20 seconds. arXiv preprint arXiv:1801.04161.
 Roy et al. (2017) Roy, A. G., Conjeti, S., Sheet, D., Katouzian, A., Navab, N., Wachinger, C., 2017. Error corrective boosting for learning fully convolutional networks with limited data. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, pp. 231–239.
 Sabuncu et al. (2016) Sabuncu, M. R., Ge, T., Holmes, A. J., Smoller, J. W., Buckner, R. L., Fischl, B., Weiner, M. W., Aisen, P., Weiner, M., Petersen, R., et al., 2016. Morphometricity as a measure of the neuroanatomical signature of a trait. Proceedings of the National Academy of Sciences 113 (39), E5749–E5756.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), 1929–1958.
 Valindria et al. (2017) Valindria, V. V., Lavdas, I., Bai, W., Kamnitsas, K., Aboagye, E. O., Rockall, A. G., Rueckert, D., Glocker, B., 2017. Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE transactions on medical imaging 36 (8), 1597–1606.
 Van Rooij et al. (2017) Van Rooij, D., Anagnostou, E., Arango, C., Auzias, G., Behrmann, M., Busatto, G. F., Calderoni, S., Daly, E., Deruelle, C., Di Martino, A., et al., 2017. Cortical and subcortical brain morphometry differences between patients with autism spectrum disorder and healthy individuals across the lifespan: Results from the enigma asd working group. American Journal of Psychiatry 175 (4), 359–369.
 Wachinger et al. (2015) Wachinger, C., Fritscher, K., Sharp, G., Golland, P., 2015. Contourdriven atlasbased segmentation. IEEE transactions on medical imaging 34 (12), 2492–2505.
 Wachinger et al. (2018) Wachinger, C., Reuter, M., Klein, T., 2018. Deepnat: Deep convolutional neural network for segmenting neuroanatomy. NeuroImage 170, 434–445.
Comments
There are no comments yet.