1 Introduction
Automatic segmentation of structures is a fundamental task in medical image analysis. Segmentations either serve as an intermediate step in a more elaborate pipeline or as an end goal by itself. The clinical interest often lies in the volume of a certain structure (e.g. the volume of a tumor, the volume of a stroke lesion), which can be derived from its segmentation [7]. The segmentation task can also carry inherent uncertainty (e.g. noise, lack of contrast, artifacts, incomplete information).
To evaluate and compare the quality of a segmentation, the similarity between the true segmentation (i.e. the segmentation derived from an expert’s delineation of the structure) and the predicted segmentation must be measured. For this purpose, multiple metrics exist. Among others, overlap measures (e.g. Dice score, Jaccard index) and surface distances (e.g. Haussdorf distance, average surface distance) are commonly used
[11].The focus on one particular metric, the Dice score, has led to the adoption of a differentiable surrogate loss, the socalled soft Dice [5, 13, 16], to train convolutional neural networks (CNNs). Many stateoftheart methods clearly outperform the established crossentropy losses using soft Dice as loss function [10, 3].
In this work, we investigate the effect on volume estimation when optimizing a CNN w.r.t. crossentropy or soft Dice, and relate this to the inherent uncertainty in a task. First, we look into this volumetric bias theoretically, with some numerical examples. We find that the use of soft Dice leads to a systematic under or overestimation of the predicted volume of a structure, which is dependent on the inherent uncertainty that is present in the task. Second, we empirically validate these results on four medical tasks: two tasks with relatively low inherent uncertainty (i.e. the segmentation of third molars from dental radiographs
[4], BRATS 2018 [14, 12, 1, 2]) and two tasks with relatively high inherent uncertainty (i.e. ISLES 2017 [8, 18], ISLES 2018 [9]).2 Theoretical analysis
Let us formalize an image into voxels, each voxel corresponding to a true class label with , forming the true class label map . Typical in medical image analysis, is the uncertainty of the true class label map (e.g. due to intra and interrater variability; see Sect. 2.2). Under the assumption of binary image segmentation with , a probabilistic label map can be constructed as , where each
is the probability of
belonging to the structure of interest. Similarly, we have the maps of voxelwise label predictions and probabilities . In this setting, the class label map is constructed from the map of predictions according to the highest likelihood.The Dice score is defined on the label maps as:
(1) 
The volumes of the true structure and of the predicted structure are then, with the volume of a single voxel:
(2) 
In case the label map is probabilistic, we need to work out the expectations:
(3) 
2.1 Risk minimization
In the setting of supervised and gradientbased training of CNNs [6] we are performing empirical risk minimization. Assume the CNN, with a certain topology, is parametrized by and represents the functions
. Further assume we have access to the entire joint probability distribution
at both training and testing time, with the information (for CNNs this is typically a centered image patch around the location of ) of the network that is used to make a prediction for . For these conditions, the general risk minimization principle is applicable and states that in order to optimize the performance for a certain nonnegative and realvalued loss (e.g. the metric or its surrogate loss) at test time, we can optimize the same loss during the learning phase [17]. The risk associated with the loss and parametrization of the CNN, without regularization, is defined as the expectation of the loss function:(4) 
For years, minimizing the negative loglikelihood has been the gold standard in terms of risk minimization. For this purpose, and due to its elegant mathematical properties, the voxelwise binary crossentropy loss () is used:
(5) 
More recently, the soft Dice loss () is used in the optimization of CNNs to directly optimize the Dice score at test time [5, 13, 16]. Rewriting Eq. 1 to its nonnegative and realvalued surrogate loss function as in [5]:
(6) 
2.2 Uncertainty
There is considerable uncertainty in the segmentation of medical images. Images might lack contrast, contain artifacts, be noisy or incomplete regarding the necessary information (e.g. in ISLES 2017 we need to predict the infarction after treatment from images taken before, which is straightforwardly introducing inherent uncertainty). Even at the level of the true segmentation, uncertainty exists due to intra and interrater variability. We will investigate what happens with the estimated volume of a certain structure in an image under the assumption of having perfect segmentation algorithms (i.e. the prediction is the one that minimizes the empirical risk).
Assuming independent voxels, or that we can simplify Eq. 3 into independent regions with true uncertainty and predicted uncertainty , and corresponding volumes , with the number of voxels belonging to region (having each voxel as an independent region when ), we get:
(7) 
We analyze for the predicted uncertainty that minimizes the risk :
(8) 
We need to find for each independent region :
(9) 
This function is continuous and its first derivative monotonously increasing in the interval . First order conditions w.r.t. give the optimal value for the predicted uncertainty . With the predicted uncertainty being the true uncertainty, becomes an unbiased volume estimator.
We analyze for the predicted uncertainty that minimizes the risk :
(10) 
We need to find for each independent region :
(11) 
This minimization is more complex and we analyze its behavior by inspecting the values of numerically. We will consider the scenarios with only a single region or with multiple independent regions with inherent uncertainty in the image. For each scenario we will vary the inherent uncertainty and the total uncertain volume.
2.2.1 Single region of uncertainty.
Imagine the segmentation of an image with independent regions, and , as depicted in Fig. 1 (A0). Region is certainly not part of the structure (, i.e. background), region belongs to the structure with probability and region is certainly part of the structure (). Let their volumes be , , , respectively, with the volume ratio of uncertain to certain part of the structure. Assuming a perfect algorithm, the optimal predictions under the empirical risk from Eq. 11 are:
(12) 
It is trivial to show that and are solutions for this equation. The behavior of w.r.t. and can be observed qualitatively in Fig. 1 (A1A4).
Indeed, only for the predicted uncertainty is exact. The location of the local minimum in switches from 0 to 1 when . Therefore, when decreases or increases from 0.5 (different opacity in A1A3), respectively under or overestimation will occur (A4). The resulting volumetric bias will be highest when the inherent uncertainty and decreases towards the points of complete certainty, being always 0 or 1. The effect of the volume ratio (colors) is twofold. With increasing, the optimal loss value increases (A1A3) and the volumetric bias increases (A4; solid lines). However, the error on the estimated uncertainty is not influenced by (A4; dashed lines).
2.2.2 Multiple regions of uncertainty.
In a similar way we can imagine the segmentation of a structure with independent regions, for which we further divided the region into equally large independent subregions with . Let us further assume they have the same inherent uncertainty and volume ratio (in order to keep the total uncertain volume the same). If we limit the analysis to a qualitative observation of Fig. 1 with (B0B4) and (C0C4), we notice three things. First, the uncertainty for which under or overestimation will happen decreases (A4, B4, C4). Second, this effect is proportional with and the maximal error on the predicted uncertainty becomes higher (B0B4, C0C4). Third, there is a trend towards easier volumetric overestimation and with the maximal error being more pronounced when the number of regions increases (A4, B4, C4).
3 Empirical analysis
In this section we will investigate whether the aforementioned characteristics can be observed under real circumstances. In a practical scenario, the joint probability distribution is unknown and presents itself as a training set. The risk (Eq. 4) becomes empirical, where the expectation of the loss function becomes the mean of the losses across the training set. Furthermore, the loss
absorbs the explicit (e.g. weight decay, L2) or implicit (e.g. early stopping, dropout) regularization, which is often present in some aspect of the optimization of CNNs. Finally, the classifier is no longer perfect and additionally to the inherent uncertainty in the task we now have inherent uncertainty introduced by the classifier itself.
To investigate how these factors impact our theoretical findings, we train three models with increasing complexity: LR (logistic regression on the input features), ConvNet (simpler version of the next) and UNet. We use fivefold crossvalidation on the training images from two tasks with relatively low inherent uncertainty (i.e. lowerleft third molar segmentation from panoramic dental radiographs (MOLARS)
[4], BRATS 2018 [14]) and from two tasks with relatively high inherent uncertainty (i.e. ISLES 2017 [8], ISLES 2018 [9]). Next, we describe the experimental setup, followed by a dissemination of the predicted volume errors by and trained models.3.1 Task description and training
We (re)formulate a binary segmentation task for each dataset having one (multimodal) input, and giving one binary segmentation map as output (for BRATS 2018 we limit the task to whole tumor segmentation). For the 3D public benchmarks we use all of the provided images, resampled to an isotropic voxelsize of 2 mm, as input (for both ISLES challenges we omit perfusion images). In MOLARS (2D dataset from [4]
), we first extract a 448x448 ROI around the geometrical center of the lowerleft third molar from the panoramic dental radiograph. We further downsample the ROI by a factor of two. The output is the segmentation of the third molar, as provided by the experts. All images are normalized according to the dataset’s mean and standard deviation.
For our UNet model we start from the successful No NewNet implementation during last year’s BRATS challenge [10]. We adapt it with three 3x3(x3) average pooling layers with corresponding linear upsampling layers and strip the instance normalization layers. Each level has two 3x3(x3) convolutional layers before and after the pooling and upsampling layer, respectively, with [[10, 20], [20, 10]], [[20, 40], [40, 20]], [[40, 80], [80, 40]] and [40, 20] filters. For the ConvNet model, we remove the final two levels. The LR model uses the inputs directly for classification, thus performing logistic regression on the input features.
The images are augmented intensively during training and inputs are central image crops of 162x162x108 (in MOLARS 243x243). We train the models w.r.t. or with ADAM, without any explicit regularization, and with the initial learning rate set at
(for LR model at 1). We lower the learning rate by a factor of five when the validation loss did not improve over the last 75 epochs and stop training with no improvement over the last 150 epochs.
3.2 Results and discussion
Model  LR  ConvNet  UNet  

Training loss  
Dataset  Metric  
MOLARS (2D)  0.240  5.534  0.194  1.456  0.024  0.103  
0.068  0.153  0.150  0.270  0.865  0.931  
28  
( pixels)  0.069  direct 2 Tr 0.3 w 302.3 direct 0 Tr 0 w  0.276  direct 2 Tr 0.3 w 87.09 direct 0 Tr 0 w  0.092  0.187  
BRATS 2018 (3D)  0.039  0.173  0.030  0.069  0.012  0.027  
0.080  0.355  0.196  0.715  0.585  0.820  
28  
()  2.841  direct 2 Tr 0.3 w 276.4 direct 0 Tr 0 w  3.936  direct 2 Tr 0.3 w 19.93 direct 0 Tr 0 w  6.778  1.905  
ISLES 2017 (3D)  0.025  0.155  0.018  0.069  0.014  0.066  
0.099  0.255  0.114  0.321  0.188  0.340  
28  
()  15.71  direct 2 Tr 0.3 w 82.42 direct 0 Tr 0 w  4.227  direct 2 Tr 0.3 w 23.83 direct 0 Tr 0 w  2.875  direct 2 Tr 0.3 w 13.44 direct 0 Tr 0 w  
ISLES 2018 (3D)  0.055  0.225  0.044  0.139  0.029  0.128  
0.136  0.329  0.200  0.449  0.362  0.518  
28  
()  0.773  direct 2 Tr 0.3 w 34.03 direct 0 Tr 0 w  0.374  direct 2 Tr 0.3 w 12.44 direct 0 Tr 0 w  0.878  direct 2 Tr 0.3 w 5.442 direct 0 Tr 0 w 
In Table. 1 the results are shown for each dataset (i.e. MOLARS, BRATS 2018, ISLES 2017, ISLES 2018), for each model (i.e. LR, ConvNet, UNet) and for each loss (i.e. , ) after fivefold crossvalidation. We performed a pairwise nonparametric significance test (bootstrapping) with a pvalue of 0.05 to assess inferiority or superiority between pairs of optimization methods.
Optimizing the loss reaches significantly higher loglikelihoods under all circumstances, while soft Dice scores (i.e. ) are significantly higher for optimized models. Looking at the volume errors , the expected outcomes are, more or less, confirmed. For the LR and ConvNet models, optimized models are unbiased w.r.t. volume estimation. For these models, optimization leads to significant overestimation due to the remaining uncertainty, partly being introduced by the models themselves.
The transition to the more complex UNet model brings forward two interesting observations. First, for the two tasks with relatively low inherent uncertainty (i.e. MOLARS, BRATS 2018), the model is able to reduce the uncertainty to such an extent it can avoid significant bias on the estimated volumes. The significant underestimation for in BRATS 2018 can be due to the optimization difficulties that arise in circumstances with high classimbalance. Second, although the model now has the ability to extend its view wide enough and propagate the information in a complex manner, the inherent uncertainty that is present in both of the ISLES tasks, brings again forward the discussed bias. In ISLES 2017, having to predict the infarction after treatment straightforwardly introduces uncertainty. In ISLES 2018, the task was to detect the acute lesion, as observed on MR DWI, from CT perfusionderived parameter maps. It is still unknown to what extent these parameter maps contain the necessary information to predict the lesion.
The optimized UNet models result in Dice scores (Eq. 1) of 0.924, 0.763, 0.177 and 0.454 for MOLARS, BRATS 2018, ISLES 2017 and ISLES 2018, respectively. The Dice scores obtained with their optimized counterparts are significantly higher, respectively 0.932, 0.826, 0.343 and 0.527. This is in line with recent theory and practice from [3] and justifies optimization when the segmentation quality is measured in terms of Dice score.
4 Conclusion
It is clear that, in cases with high inherent uncertainty, the estimated volumes with soft Diceoptimized models are biased, while crossentropyoptimized models predict unbiased volume estimates. For tasks with low inherent uncertainty, one can still favor soft Dice optimization due to a higher Dice score.
We want to highlight the importance of choosing an appropriate loss function w.r.t. the goal. In a clinical setting where volume estimates are important and for tasks with high or unknown inherent uncertainty, optimization with crossentropy can be preferred.
4.0.1 Acknowledgements.
J.B. is part of NEXIS [15], a project that has received funding from the European Union’s Horizon 2020 Research and Innovations Programme (Grant Agreement #780026). D.R. is supported by an innovation mandate of Flanders Innovation and Entrepreneurship (VLAIO).
References
 [1] (2017) Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Scientific Data 4 (March), pp. 1–13. External Links: Document, ISSN 20524463 Cited by: §1.

[2]
(2018)
Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge
. External Links: 1811.02629, Link Cited by: §1.  [3] (2019) Optimizing the Dice score and Jaccard index for medical image segmentation: theory and practice. In Medical Image Computing and ComputerAssisted Intervention, Cited by: §1, §3.2.
 [4] (2017) An automated technique to stage lower third molar development on panoramic radiographs for age estimation: a pilot study. Journal of Forensic OdontoStomatology 35 (2), pp. 49–60. Cited by: §1, §3.1, §3.
 [5] (2016) Deep Learning and Data Labeling for Medical Applications. LNCS 10008, pp. 179–187. External Links: Link, ISBN 9783319469751, Document Cited by: §1, §2.1.
 [6] (2016) Deep Learning. MIT Press. Cited by: §2.1.
 [7] (2016) Endovascular thrombectomy after largevessel ischaemic stroke: a metaanalysis of individual patient data from five randomised trials. The Lancet 387 (10029), pp. 1723–1731. External Links: Document Cited by: §1.
 [8] (2017) Ischemic Stroke Lesion Segmentation (ISLES) challenge. External Links: Link Cited by: §1, §3.
 [9] (2018) Ischemic Stroke Lesion Segmentation (ISLES) challenge. External Links: Link Cited by: §1, §3.
 [10] (2018) No NewNet. LNCS 11384, pp. 234–244. External Links: Link, ISBN 1809.10483v1 Cited by: §1, §3.1.
 [11] (2017) Efficient multiscale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Medical Image Analysis 36, pp. 61–78. Cited by: §1.
 [12] (2015) The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Transactions on Medical Imaging. External Links: Document, ISSN 1558254X Cited by: §1.
 [13] (2016) VNet: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. International Conference on 3D Vision 4, pp. 1–11. External Links: Link, ISBN 9781509054077, Document Cited by: §1, §2.1.
 [14] (2018) Multimodal Brain Tumor Segmentation (BRATS) challenge. External Links: Link Cited by: §1, §3.
 [15] NEXIS  Next gEneration Xray Imaging System. External Links: Link Cited by: §4.0.1.
 [16] (2017) Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. LNCS 10553, pp. 240–248. External Links: ISBN 9783319675572, Document, ISSN 16113349 Cited by: §1, §2.1.

[17]
(1995)
The Nature of Statistical Learning Theory
. Springer. Cited by: §2.1.  [18] (2018) ISLES 2016 and 2017benchmarking ischemic stroke lesion outcome prediction based on multispectral MRI. Frontiers in Neurology 9 (SEP). External Links: Document, ISSN 16642295 Cited by: §1.
Comments
There are no comments yet.