Decisions by medical experts are increasingly enriched and augmented by intelligent machines e.g. through computer aided diagnostics (CAD). The quality of the joint decision process would improve if the automatic systems were able to indicate their uncertainty. This assumes that the provided uncertainty information is reliable i.e. valuable to be considered. A system indicating high uncertainty in image areas of incorrect segmentations could be used to detect and subsequently refer these regions to medical experts. Applying such a human-in-the-loop setting would result in increased segmentation performance. In addition, such a setting could mitigate a severe deficiency of current state-of-the-art deep learning segmentation methods which occasionally generate anatomically implausible segmentations that a medical expert would never make.
Previous research has mainly focused on the assessment of uncertainty in disease prediction  or tissue segmentation  by utilizing Bayesian neural networks (BNN) or test-time data augmentation techniques . Additional methods to estimate uncertainty are Deep Ensembles  and Learned Confidence Estimates 
. In the former multiple models are trained and the variance of their predictions is used as confidence measure, whereas in the latter the model outputs a confidence measure simultaneously with the prediction.
In this work, using multi-structures segmentation in cardiac MR images, we introduce a method that simultaneously generates segmentation masks and uncertainty maps by using a dilated convolutional network (DCNN). To obtain segmentation uncertainty maps, we compare two approaches. First, we employ Bayesian uncertainty maps (u-maps) that are obtained by Bayesian DCNNs (B-DCNN). Second, we use entropy maps (e-maps) that can be efficiently generated by any probabilistic classifier as entropy is a theoretically grounded quantification of uncertainty in information theory. In addition, we reveal that a valuable uncertainty measure can be obtained if the applied model iswell calibrated
i.e. if generated probabilities represent the likelihood of being correct. We demonstrate these by simulating a human-in-the-loop setting and provide evidence that image areas indicated as highly uncertain regarding the obtained segmentation almost entirely cover regions of incorrect segmentations. Hence, the fused information can be employed in clinical practice to inform an expert whether and where the generated segmentation should be adjusted.
2 Data description
In this work, data from the MICCAI Challenge on automated cardiac diagnosis (ACDC) 
was used. The dataset consists of cardiac cine MR images (CMRI) of 150 patients who have been clinically diagnosed in five classes: normal, dilated cardiomyopathy (DCM), hypertrophic cardiomyopathy (HCM), heart failure with infarction (MINF), or right ventricular abnormality (RVA). Cases are uniformly distributed over classes. Manual reference segmentations of the left LV cavity, RV endocardium and myocardium at ED and ES are provided forcases. For each patient, short-axis (SA) CMRIs with 28-40 frames are available, in which the ED and ES frame have been indicated. On average images consist of nine slices where each slice has a spatial resolution of voxels (on average). The image slices cover the LV from the base to the apex. In-plane voxel spacing varies from to , with slice thickness from to and sometimes inter-slice gap of . To correct for differences in voxel size, all 2D image slices were resampled to spacing. Furthermore, to correct for image intensity differences among images, each MR volume was normalized between [, ] according to the th and th percentile of intensities in the image. For detailed specifications about the acquisition protocol we refer the reader to Bernard et al. .
To perform segmentation of tissue classes in cardiac 2D MR scans, we used the DCNN developed by Wolterink et al. . The DCNN architecture comprises a sequence of ten convolutional layers with increasing levels of kernel dilation which results in a receptive field for each voxel of voxels, or
. The network has two input channels which take anatomically corresponding ED and ES slices, motivated by the assumption that the DCNN can leverage cardiac motion differences between ED and ES time points in order to better localize the target structures. The network has eight output channels to simultaneously segment the LV, RV, myocardium and background in ED and ES. Softmax probabilities are calculated over the four tissue classes for ED and ES respectively. To enhance generalization performance, the model uses batch normalization and weight decay.
To acquire spatial uncertainty maps of the segmentation during testing, two different approaches were evaluated. First, to obtain Bayesian uncertainty maps (u-maps), we implemented Monte Carlo dropout (MC dropout) introduced by Gal & Ghahramani 
for approximate Bayesian inference. We added dropout as the last operation in all but the final layer (by randomly switching off 10 percent of a layer’s hidden units). By enabling dropout during testing, softmax probabilities are obtained with 10 samples per voxel. As an overall measure of uncertainty we used the maximum softmax variance per voxel over all classes. The variance per voxel per class is obtained from the softmax samples for each class. We chose to use the maximum instead of the mean (as e.g. utilized by Leibig et al.) because we found that averaging attenuates the uncertainties. Second, to obtain entropy maps (e-maps) we computed the multi-class entropy per voxel. However, the quality of these maps depends on the calibration of the acquired probabilities.
Therefore, we trained the model with three different loss functions: soft-Dice (SD), cross-entropy (CE), and the Brier score 
(BS), which is equal to the average gap between softmax probabilities and the references. This provides information about accuracy and uncertainty of the model. Computationally the BS is equal to the squared error between a one-hot encoded label and its associated probability.
To use four-fold cross-validation we split the dataset into and training and test images, respectively. Each model is evaluated on the holdout test images and we report combined results on all images. During training we used images with
voxel samples, padded toto accommodate the voxel receptive field. Training samples were augmented by degree rotations of the images and references. The model was trained for 150,000 iterations using the snapshot ensemble technique described in , while after every 10,000th iteration we reset the learning rate to its original value of and stored the model. We used mini-batches of size and applied Adam 
as stochastic gradient descent optimizer. To compare u-maps with e-maps at test time each model was evaluated twice. First, to obtain u-maps we used the last six stored models (iterations 100,000 to 150,000) of each fold to obtain segmentation results. Tissue class per voxel was determined using the mean softmax probabilities over 60 samples (10 samples per voxel per model). In addition, these probabilities served to compute the maximum variance (as described in the beginning of this section). Second, to obtain e-maps we solely employed the last stored model of each fold to acquire segmentation results. We disabled dropout during inference and used one forward pass to compute the softmax probabilities and determine the tissue class per voxel. The corresponding e-maps were computed as the entropy in the four-class probability distribution.
Finally, for both evaluations as a post-processing step, the 3D probability volumes were filtered by selection of the largest D
-connected component for each class. The models were implemented using the PyTorch framework and trained on one Nvidia GTX Titan X GPU with 12 GB memory. .
4 Results and Discussion
To evaluate model calibration we created so called Reliability Diagrams  (RD). Figs. 0(a), 0(b) and 0(c) show the predicted probabilities discretized into ten bins and plotted against the true positive fraction for each bin (y-axis). If the model is perfectly calibrated, the diagram should match the dashed line. We conclude that a model trained with the soft-Dice loss produces inferior calibrated probabilities compared to the other two loss functions. We conjecture that this could be caused by the relatively low penalty induced by the soft-Dice loss for the model being underconfident for true positive tissue labels (see Fig. 0(d)).
To compare the quality of the obtained uncertainty maps, we simulate a human-in-the-loop setting. We combine the information of predicted segmentation masks with the u-maps or e-maps and assume that voxels above a tolerated uncertainty or entropy threshold are corrected to their reference label by an expert. For each threshold we compute the Dice score of the corrected segmentation mask. Figs. 1(a) and 1(b) visualize the Dice score as a function of the average percentage of voxels thus referred. We observe a monotonic increase in prediction accuracy when more voxels are referred. E.g. inspecting the referral curves for the Brier loss in Figure 1(b) we note that when referring on average 1% of the voxels in an image, performance increases with , and % for RV, Myo and LV, respectively. These results are similar for the u-maps and the e-maps. In each experiment, the case in which no voxels are referred for correction is considered the baseline (left most y-axis values). We observe that baseline segmentation performance is highest when the model is trained with the Brier loss, slightly lower for the soft-Dice, and lowest when cross-entropy is used. Except for the soft-Dice loss we note that u-maps and e-maps follow each other quite closely, which suggests that both carry similar information. Not including the soft-Dice loss, segmentation performance with referral using u-maps or e-maps reaches a Dice score of nearly one when sufficient number of voxels are referred. Hence, we may conclude that areas of uncertainty and entropy almost completely cover the regions of incorrect segmentations111Without covering the complete image in which case all voxels would be referred (corresponding to a trivial solution).. Results obtained after the referral using entropy maps for a model trained with the soft-Dice loss are clearly inferior compared to the performance achieved when using the Bayesian uncertainty maps. We assume that this is due to the miscalibration of the model (see Figure 0(b)). Furthermore, u-maps in general contain more voxels with high uncertainties without these voxels being incorrectly segmented. This is visually expressed for the cross-entropy loss in Figure 1(a), where the Myo referral-curve obtained with u-maps lags behind the corresponding curve that uses the entropy information.
5 New or Breakthrough Work to Be Presented
This study shows how automatic segmentation can be combined with spatial uncertainty maps to increase the segmentation performance employing a human-in-the-loop setting. Furthermore, our results reveal that we can obtain valuable spatial uncertainty maps with low computational effort using well-calibrated DCNNs.
Using a publicly available cardiac cine MRI dataset, we showed that a (Bayesian) dilated CNN trained with the Brier loss produces valuable Bayesian uncertainty and entropy maps. Our results convey that regions of high uncertainty almost completely cover areas of incorrect segmentations. Well-calibrated models enable us to obtain useful spatial entropy maps, which can be used to increase the segmentation performance of the model.
This work has not been submitted for publication or presentation elsewhere.
-  Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.-A., Cetin, I., Lekadir, K., Camara, O., Ballester, M. A. G., et al., “Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?,” IEEE Transactions on Medical Imaging (2018).
-  Leibig, C., Allken, V., Ayhan, M. S., Berens, P., and Wahl, S., “Leveraging uncertainty information from deep neural networks for disease detection,” Scientific reports 7(1), 17816 (2017).
-  Kwon, Y., Won, J.-H., Kim, B. J., and Paik, M. C., “Uncertainty quantification using bayesian neural networks in classification: Application to ischemic stroke lesion segmentation,” in [Medical Imaging with Deep Learning Conference ], (2018).
Ayhan, M. S. and Berens, P., “Test-time data augmentation for estimation of heteroscedastic aleatoric uncertainty in deep neural networks,” in [Medical Imaging with Deep Learning Conference ], (2018).
-  Lakshminarayanan, B., Pritzel, A., and Blundell, C., “Simple and scalable predictive uncertainty estimation using deep ensembles,” in [Advances in Neural Information Processing Systems ], 6402–6413 (2017).
-  DeVries, T. and Taylor, G. W., “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprint arXiv:1802.04865 (2018).
-  Wolterink, J. M., Leiner, T., Viergever, M. A., and Išgum, I., “Automatic segmentation and disease classification using cardiac cine MR images,” in [International Workshop on Statistical Atlases and Computational Models of the Heart ], 101–110, Springer (2017).
Gal, Y. and Ghahramani, Z., “Dropout as a bayesian approximation: Representing
model uncertainty in deep learning,” in [
International Conference on Machine Learning (ICML)], 1050–1059 (2016).
-  Brier, G. W., “Verification of forecasts expressed in terms of probability,” Monthey Weather Review 78(1), 1–3 (1950).
-  Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q., “Snapshot ensembles: Train 1, get m for free,” arXiv preprint arXiv:1704.00109 (2017).
-  Kingma, D. and Ba, J., “Adam: A method for stochastic optimization,” in [ICLR ], 5 (2015).
-  DeGroot, M. H. and Fienberg, S. E., “The comparison and evaluation of forecasters,” The statistician , 12–22 (1983).