1 Purpose
Decisions by medical experts are increasingly enriched and augmented by intelligent machines e.g. through computer aided diagnostics (CAD). The quality of the joint decision process would improve if the automatic systems were able to indicate their uncertainty. This assumes that the provided uncertainty information is reliable i.e. valuable to be considered. A system indicating high uncertainty in image areas of incorrect segmentations could be used to detect and subsequently refer these regions to medical experts. Applying such a humanintheloop setting would result in increased segmentation performance. In addition, such a setting could mitigate a severe deficiency of current stateoftheart deep learning segmentation methods which occasionally generate anatomically implausible segmentations[1] that a medical expert would never make.
Previous research has mainly focused on the assessment of uncertainty in disease prediction [2] or tissue segmentation [3] by utilizing Bayesian neural networks (BNN) or testtime data augmentation techniques [4]. Additional methods to estimate uncertainty are Deep Ensembles [5] and Learned Confidence Estimates [6]
. In the former multiple models are trained and the variance of their predictions is used as confidence measure, whereas in the latter the model outputs a confidence measure simultaneously with the prediction.
In this work, using multistructures segmentation in cardiac MR images, we introduce a method that simultaneously generates segmentation masks and uncertainty maps by using a dilated convolutional network (DCNN). To obtain segmentation uncertainty maps, we compare two approaches. First, we employ Bayesian uncertainty maps (umaps) that are obtained by Bayesian DCNNs (BDCNN). Second, we use entropy maps (emaps) that can be efficiently generated by any probabilistic classifier as entropy is a theoretically grounded quantification of uncertainty in information theory. In addition, we reveal that a valuable uncertainty measure can be obtained if the applied model is
well calibratedi.e. if generated probabilities represent the likelihood of being correct. We demonstrate these by simulating a humanintheloop setting and provide evidence that image areas indicated as highly uncertain regarding the obtained segmentation almost entirely cover regions of incorrect segmentations. Hence, the fused information can be employed in clinical practice to inform an expert whether and where the generated segmentation should be adjusted.
2 Data description
In this work, data from the MICCAI Challenge on automated cardiac diagnosis (ACDC) [1]
was used. The dataset consists of cardiac cine MR images (CMRI) of 150 patients who have been clinically diagnosed in five classes: normal, dilated cardiomyopathy (DCM), hypertrophic cardiomyopathy (HCM), heart failure with infarction (MINF), or right ventricular abnormality (RVA). Cases are uniformly distributed over classes. Manual reference segmentations of the left LV cavity, RV endocardium and myocardium at ED and ES are provided for
cases. For each patient, shortaxis (SA) CMRIs with 2840 frames are available, in which the ED and ES frame have been indicated. On average images consist of nine slices where each slice has a spatial resolution of voxels (on average). The image slices cover the LV from the base to the apex. Inplane voxel spacing varies from to , with slice thickness from to and sometimes interslice gap of . To correct for differences in voxel size, all 2D image slices were resampled to spacing. Furthermore, to correct for image intensity differences among images, each MR volume was normalized between [, ] according to the th and th percentile of intensities in the image. For detailed specifications about the acquisition protocol we refer the reader to Bernard et al. [1].3 Method
To perform segmentation of tissue classes in cardiac 2D MR scans, we used the DCNN developed by Wolterink et al. [7]. The DCNN architecture comprises a sequence of ten convolutional layers with increasing levels of kernel dilation which results in a receptive field for each voxel of voxels, or
. The network has two input channels which take anatomically corresponding ED and ES slices, motivated by the assumption that the DCNN can leverage cardiac motion differences between ED and ES time points in order to better localize the target structures. The network has eight output channels to simultaneously segment the LV, RV, myocardium and background in ED and ES. Softmax probabilities are calculated over the four tissue classes for ED and ES respectively. To enhance generalization performance, the model uses batch normalization and weight decay.
To acquire spatial uncertainty maps of the segmentation during testing, two different approaches were evaluated. First, to obtain Bayesian uncertainty maps (umaps), we implemented Monte Carlo dropout (MC dropout) introduced by Gal & Ghahramani [8]
for approximate Bayesian inference. We added dropout as the last operation in all but the final layer (by randomly switching off 10 percent of a layer’s hidden units). By enabling dropout during testing, softmax probabilities are obtained with 10 samples per voxel. As an overall measure of uncertainty we used the maximum softmax variance per voxel over all classes. The variance per voxel per class is obtained from the softmax samples for each class. We chose to use the maximum instead of the mean (as e.g. utilized by Leibig et al.
[2]) because we found that averaging attenuates the uncertainties. Second, to obtain entropy maps (emaps) we computed the multiclass entropy per voxel. However, the quality of these maps depends on the calibration of the acquired probabilities.Therefore, we trained the model with three different loss functions: softDice (SD), crossentropy (CE), and the Brier score [9]
(BS), which is equal to the average gap between softmax probabilities and the references. This provides information about accuracy and uncertainty of the model. Computationally the BS is equal to the squared error between a onehot encoded label and its associated probability.
To use fourfold crossvalidation we split the dataset into and training and test images, respectively. Each model is evaluated on the holdout test images and we report combined results on all images. During training we used images with
voxel samples, padded to
to accommodate the voxel receptive field. Training samples were augmented by degree rotations of the images and references. The model was trained for 150,000 iterations using the snapshot ensemble technique described in [10], while after every 10,000th iteration we reset the learning rate to its original value of and stored the model. We used minibatches of size and applied Adam [11]as stochastic gradient descent optimizer. To compare umaps with emaps at test time each model was evaluated twice. First, to obtain umaps we used the last six stored models (iterations 100,000 to 150,000) of each fold to obtain segmentation results. Tissue class per voxel was determined using the mean softmax probabilities over 60 samples (10 samples per voxel per model). In addition, these probabilities served to compute the maximum variance (as described in the beginning of this section). Second, to obtain emaps we solely employed the last stored model of each fold to acquire segmentation results. We disabled dropout during inference and used one forward pass to compute the softmax probabilities and determine the tissue class per voxel. The corresponding emaps were computed as the entropy in the fourclass probability distribution.
Finally, for both evaluations as a postprocessing step, the 3D probability volumes were filtered by selection of the largest D
connected component for each class. The models were implemented using the PyTorch framework and trained on one Nvidia GTX Titan X GPU with 12 GB memory. .
4 Results and Discussion
To evaluate model calibration we created so called Reliability Diagrams [12] (RD). Figs. 0(a), 0(b) and 0(c) show the predicted probabilities discretized into ten bins and plotted against the true positive fraction for each bin (yaxis). If the model is perfectly calibrated, the diagram should match the dashed line. We conclude that a model trained with the softDice loss produces inferior calibrated probabilities compared to the other two loss functions. We conjecture that this could be caused by the relatively low penalty induced by the softDice loss for the model being underconfident for true positive tissue labels (see Fig. 0(d)).
To compare the quality of the obtained uncertainty maps, we simulate a humanintheloop setting. We combine the information of predicted segmentation masks with the umaps or emaps and assume that voxels above a tolerated uncertainty or entropy threshold are corrected to their reference label by an expert. For each threshold we compute the Dice score of the corrected segmentation mask. Figs. 1(a) and 1(b) visualize the Dice score as a function of the average percentage of voxels thus referred. We observe a monotonic increase in prediction accuracy when more voxels are referred. E.g. inspecting the referral curves for the Brier loss in Figure 1(b) we note that when referring on average 1% of the voxels in an image, performance increases with , and % for RV, Myo and LV, respectively. These results are similar for the umaps and the emaps. In each experiment, the case in which no voxels are referred for correction is considered the baseline (left most yaxis values). We observe that baseline segmentation performance is highest when the model is trained with the Brier loss, slightly lower for the softDice, and lowest when crossentropy is used. Except for the softDice loss we note that umaps and emaps follow each other quite closely, which suggests that both carry similar information. Not including the softDice loss, segmentation performance with referral using umaps or emaps reaches a Dice score of nearly one when sufficient number of voxels are referred. Hence, we may conclude that areas of uncertainty and entropy almost completely cover the regions of incorrect segmentations^{1}^{1}1Without covering the complete image in which case all voxels would be referred (corresponding to a trivial solution).. Results obtained after the referral using entropy maps for a model trained with the softDice loss are clearly inferior compared to the performance achieved when using the Bayesian uncertainty maps. We assume that this is due to the miscalibration of the model (see Figure 0(b)). Furthermore, umaps in general contain more voxels with high uncertainties without these voxels being incorrectly segmented. This is visually expressed for the crossentropy loss in Figure 1(a), where the Myo referralcurve obtained with umaps lags behind the corresponding curve that uses the entropy information.
5 New or Breakthrough Work to Be Presented
This study shows how automatic segmentation can be combined with spatial uncertainty maps to increase the segmentation performance employing a humanintheloop setting. Furthermore, our results reveal that we can obtain valuable spatial uncertainty maps with low computational effort using wellcalibrated DCNNs.
6 Conclusions
Using a publicly available cardiac cine MRI dataset, we showed that a (Bayesian) dilated CNN trained with the Brier loss produces valuable Bayesian uncertainty and entropy maps. Our results convey that regions of high uncertainty almost completely cover areas of incorrect segmentations. Wellcalibrated models enable us to obtain useful spatial entropy maps, which can be used to increase the segmentation performance of the model.
This work has not been submitted for publication or presentation elsewhere.
References
 [1] Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Ballester, M. A. G., et al., “Deep learning techniques for automatic MRI cardiac multistructures segmentation and diagnosis: Is the problem solved?,” IEEE Transactions on Medical Imaging (2018).
 [2] Leibig, C., Allken, V., Ayhan, M. S., Berens, P., and Wahl, S., “Leveraging uncertainty information from deep neural networks for disease detection,” Scientific reports 7(1), 17816 (2017).
 [3] Kwon, Y., Won, J.H., Kim, B. J., and Paik, M. C., “Uncertainty quantification using bayesian neural networks in classification: Application to ischemic stroke lesion segmentation,” in [Medical Imaging with Deep Learning Conference ], (2018).

[4]
Ayhan, M. S. and Berens, P., “Testtime data augmentation for estimation of heteroscedastic aleatoric uncertainty in deep neural networks,” in [
Medical Imaging with Deep Learning Conference ], (2018).  [5] Lakshminarayanan, B., Pritzel, A., and Blundell, C., “Simple and scalable predictive uncertainty estimation using deep ensembles,” in [Advances in Neural Information Processing Systems ], 6402–6413 (2017).
 [6] DeVries, T. and Taylor, G. W., “Learning confidence for outofdistribution detection in neural networks,” arXiv preprint arXiv:1802.04865 (2018).
 [7] Wolterink, J. M., Leiner, T., Viergever, M. A., and Išgum, I., “Automatic segmentation and disease classification using cardiac cine MR images,” in [International Workshop on Statistical Atlases and Computational Models of the Heart ], 101–110, Springer (2017).

[8]
Gal, Y. and Ghahramani, Z., “Dropout as a bayesian approximation: Representing
model uncertainty in deep learning,” in [
International Conference on Machine Learning (ICML)
], 1050–1059 (2016).  [9] Brier, G. W., “Verification of forecasts expressed in terms of probability,” Monthey Weather Review 78(1), 1–3 (1950).
 [10] Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q., “Snapshot ensembles: Train 1, get m for free,” arXiv preprint arXiv:1704.00109 (2017).
 [11] Kingma, D. and Ba, J., “Adam: A method for stochastic optimization,” in [ICLR ], 5 (2015).
 [12] DeGroot, M. H. and Fienberg, S. E., “The comparison and evaluation of forecasters,” The statistician , 12–22 (1983).
Comments
There are no comments yet.