Automatic segmentation of structures is a fundamental task in medical image analysis. Segmentations either serve as an intermediate step in a more elaborate pipeline or as an end goal by itself. The clinical interest often lies in the volume of a certain structure (e.g. the volume of a tumor, the volume of a stroke lesion), which can be derived from its segmentation . The segmentation task can also carry inherent uncertainty (e.g. noise, lack of contrast, artifacts, incomplete information).
To evaluate and compare the quality of a segmentation, the similarity between the true segmentation (i.e. the segmentation derived from an expert’s delineation of the structure) and the predicted segmentation must be measured. For this purpose, multiple metrics exist. Among others, overlap measures (e.g. Dice score, Jaccard index) and surface distances (e.g. Haussdorf distance, average surface distance) are commonly used.
The focus on one particular metric, the Dice score, has led to the adoption of a differentiable surrogate loss, the so-called soft Dice [5, 13, 16], to train convolutional neural networks (CNNs). Many state-of-the-art methods clearly outperform the established cross-entropy losses using soft Dice as loss function [10, 3].
In this work, we investigate the effect on volume estimation when optimizing a CNN w.r.t. cross-entropy or soft Dice, and relate this to the inherent uncertainty in a task. First, we look into this volumetric bias theoretically, with some numerical examples. We find that the use of soft Dice leads to a systematic under- or overestimation of the predicted volume of a structure, which is dependent on the inherent uncertainty that is present in the task. Second, we empirically validate these results on four medical tasks: two tasks with relatively low inherent uncertainty (i.e. the segmentation of third molars from dental radiographs, BRATS 2018 [14, 12, 1, 2]) and two tasks with relatively high inherent uncertainty (i.e. ISLES 2017 [8, 18], ISLES 2018 ).
2 Theoretical analysis
Let us formalize an image into voxels, each voxel corresponding to a true class label with , forming the true class label map . Typical in medical image analysis, is the uncertainty of the true class label map (e.g. due to intra- and inter-rater variability; see Sect. 2.2). Under the assumption of binary image segmentation with , a probabilistic label map can be constructed as , where each
is the probability ofbelonging to the structure of interest. Similarly, we have the maps of voxel-wise label predictions and probabilities . In this setting, the class label map is constructed from the map of predictions according to the highest likelihood.
The Dice score is defined on the label maps as:
The volumes of the true structure and of the predicted structure are then, with the volume of a single voxel:
In case the label map is probabilistic, we need to work out the expectations:
2.1 Risk minimization
In the setting of supervised and gradient-based training of CNNs  we are performing empirical risk minimization. Assume the CNN, with a certain topology, is parametrized by and represents the functions
. Further assume we have access to the entire joint probability distributionat both training and testing time, with the information (for CNNs this is typically a centered image patch around the location of ) of the network that is used to make a prediction for . For these conditions, the general risk minimization principle is applicable and states that in order to optimize the performance for a certain non-negative and real-valued loss (e.g. the metric or its surrogate loss) at test time, we can optimize the same loss during the learning phase . The risk associated with the loss and parametrization of the CNN, without regularization, is defined as the expectation of the loss function:
For years, minimizing the negative log-likelihood has been the gold standard in terms of risk minimization. For this purpose, and due to its elegant mathematical properties, the voxel-wise binary cross-entropy loss () is used:
More recently, the soft Dice loss () is used in the optimization of CNNs to directly optimize the Dice score at test time [5, 13, 16]. Rewriting Eq. 1 to its non-negative and real-valued surrogate loss function as in :
There is considerable uncertainty in the segmentation of medical images. Images might lack contrast, contain artifacts, be noisy or incomplete regarding the necessary information (e.g. in ISLES 2017 we need to predict the infarction after treatment from images taken before, which is straightforwardly introducing inherent uncertainty). Even at the level of the true segmentation, uncertainty exists due to intra- and inter-rater variability. We will investigate what happens with the estimated volume of a certain structure in an image under the assumption of having perfect segmentation algorithms (i.e. the prediction is the one that minimizes the empirical risk).
Assuming independent voxels, or that we can simplify Eq. 3 into independent regions with true uncertainty and predicted uncertainty , and corresponding volumes , with the number of voxels belonging to region (having each voxel as an independent region when ), we get:
We analyze for the predicted uncertainty that minimizes the risk :
We need to find for each independent region :
This function is continuous and its first derivative monotonously increasing in the interval . First order conditions w.r.t. give the optimal value for the predicted uncertainty . With the predicted uncertainty being the true uncertainty, becomes an unbiased volume estimator.
We analyze for the predicted uncertainty that minimizes the risk :
We need to find for each independent region :
This minimization is more complex and we analyze its behavior by inspecting the values of numerically. We will consider the scenarios with only a single region or with multiple independent regions with inherent uncertainty in the image. For each scenario we will vary the inherent uncertainty and the total uncertain volume.
2.2.1 Single region of uncertainty.
Imagine the segmentation of an image with independent regions, and , as depicted in Fig. 1 (A0). Region is certainly not part of the structure (, i.e. background), region belongs to the structure with probability and region is certainly part of the structure (). Let their volumes be , , , respectively, with the volume ratio of uncertain to certain part of the structure. Assuming a perfect algorithm, the optimal predictions under the empirical risk from Eq. 11 are:
It is trivial to show that and are solutions for this equation. The behavior of w.r.t. and can be observed qualitatively in Fig. 1 (A1-A4).
Indeed, only for the predicted uncertainty is exact. The location of the local minimum in switches from 0 to 1 when . Therefore, when decreases or increases from 0.5 (different opacity in A1-A3), respectively under- or overestimation will occur (A4). The resulting volumetric bias will be highest when the inherent uncertainty and decreases towards the points of complete certainty, being always 0 or 1. The effect of the volume ratio (colors) is two-fold. With increasing, the optimal loss value increases (A1-A3) and the volumetric bias increases (A4; solid lines). However, the error on the estimated uncertainty is not influenced by (A4; dashed lines).
2.2.2 Multiple regions of uncertainty.
In a similar way we can imagine the segmentation of a structure with independent regions, for which we further divided the region into equally large independent sub-regions with . Let us further assume they have the same inherent uncertainty and volume ratio (in order to keep the total uncertain volume the same). If we limit the analysis to a qualitative observation of Fig. 1 with (B0-B4) and (C0-C4), we notice three things. First, the uncertainty for which under- or overestimation will happen decreases (A4, B4, C4). Second, this effect is proportional with and the maximal error on the predicted uncertainty becomes higher (B0-B4, C0-C4). Third, there is a trend towards easier volumetric overestimation and with the maximal error being more pronounced when the number of regions increases (A4, B4, C4).
3 Empirical analysis
In this section we will investigate whether the aforementioned characteristics can be observed under real circumstances. In a practical scenario, the joint probability distribution is unknown and presents itself as a training set. The risk (Eq. 4) becomes empirical, where the expectation of the loss function becomes the mean of the losses across the training set. Furthermore, the loss
absorbs the explicit (e.g. weight decay, L2) or implicit (e.g. early stopping, dropout) regularization, which is often present in some aspect of the optimization of CNNs. Finally, the classifier is no longer perfect and additionally to the inherent uncertainty in the task we now have inherent uncertainty introduced by the classifier itself.
To investigate how these factors impact our theoretical findings, we train three models with increasing complexity: LR (logistic regression on the input features), ConvNet (simpler version of the next) and U-Net. We use five-fold cross-validation on the training images from two tasks with relatively low inherent uncertainty (i.e. lower-left third molar segmentation from panoramic dental radiographs (MOLARS), BRATS 2018 ) and from two tasks with relatively high inherent uncertainty (i.e. ISLES 2017 , ISLES 2018 ). Next, we describe the experimental setup, followed by a dissemination of the predicted volume errors by and trained models.
3.1 Task description and training
We (re-)formulate a binary segmentation task for each dataset having one (multi-modal) input, and giving one binary segmentation map as output (for BRATS 2018 we limit the task to whole tumor segmentation). For the 3D public benchmarks we use all of the provided images, resampled to an isotropic voxel-size of 2 mm, as input (for both ISLES challenges we omit perfusion images). In MOLARS (2D dataset from 
), we first extract a 448x448 ROI around the geometrical center of the lower-left third molar from the panoramic dental radiograph. We further downsample the ROI by a factor of two. The output is the segmentation of the third molar, as provided by the experts. All images are normalized according to the dataset’s mean and standard deviation.
For our U-Net model we start from the successful No New-Net implementation during last year’s BRATS challenge . We adapt it with three 3x3(x3) average pooling layers with corresponding linear up-sampling layers and strip the instance normalization layers. Each level has two 3x3(x3) convolutional layers before and after the pooling and up-sampling layer, respectively, with [[10, 20], [20, 10]], [[20, 40], [40, 20]], [[40, 80], [80, 40]] and [40, 20] filters. For the ConvNet model, we remove the final two levels. The LR model uses the inputs directly for classification, thus performing logistic regression on the input features.
The images are augmented intensively during training and inputs are central image crops of 162x162x108 (in MOLARS 243x243). We train the models w.r.t. or with ADAM, without any explicit regularization, and with the initial learning rate set at
(for LR model at 1). We lower the learning rate by a factor of five when the validation loss did not improve over the last 75 epochs and stop training with no improvement over the last 150 epochs.
3.2 Results and discussion
|( pixels)||-0.069||direct 2 Tr 0.3 w 302.3 direct 0 Tr 0 w||-0.276||direct 2 Tr 0.3 w 87.09 direct 0 Tr 0 w||0.092||-0.187|
|BRATS 2018 (3D)||0.039||0.173||0.030||0.069||0.012||0.027|
|()||-2.841||direct 2 Tr 0.3 w 276.4 direct 0 Tr 0 w||3.936||direct 2 Tr 0.3 w 19.93 direct 0 Tr 0 w||-6.778||-1.905|
|ISLES 2017 (3D)||0.025||0.155||0.018||0.069||0.014||0.066|
|()||15.71||direct 2 Tr 0.3 w 82.42 direct 0 Tr 0 w||-4.227||direct 2 Tr 0.3 w 23.83 direct 0 Tr 0 w||-2.875||direct 2 Tr 0.3 w 13.44 direct 0 Tr 0 w|
|ISLES 2018 (3D)||0.055||0.225||0.044||0.139||0.029||0.128|
|()||0.773||direct 2 Tr 0.3 w 34.03 direct 0 Tr 0 w||-0.374||direct 2 Tr 0.3 w 12.44 direct 0 Tr 0 w||-0.878||direct 2 Tr 0.3 w 5.442 direct 0 Tr 0 w|
In Table. 1 the results are shown for each dataset (i.e. MOLARS, BRATS 2018, ISLES 2017, ISLES 2018), for each model (i.e. LR, ConvNet, U-Net) and for each loss (i.e. , ) after five-fold cross-validation. We performed a pairwise non-parametric significance test (bootstrapping) with a p-value of 0.05 to assess inferiority or superiority between pairs of optimization methods.
Optimizing the loss reaches significantly higher log-likelihoods under all circumstances, while soft Dice scores (i.e. ) are significantly higher for optimized models. Looking at the volume errors , the expected outcomes are, more or less, confirmed. For the LR and ConvNet models, optimized models are unbiased w.r.t. volume estimation. For these models, optimization leads to significant overestimation due to the remaining uncertainty, partly being introduced by the models themselves.
The transition to the more complex U-Net model brings forward two interesting observations. First, for the two tasks with relatively low inherent uncertainty (i.e. MOLARS, BRATS 2018), the model is able to reduce the uncertainty to such an extent it can avoid significant bias on the estimated volumes. The significant underestimation for in BRATS 2018 can be due to the optimization difficulties that arise in circumstances with high class-imbalance. Second, although the model now has the ability to extend its view wide enough and propagate the information in a complex manner, the inherent uncertainty that is present in both of the ISLES tasks, brings again forward the discussed bias. In ISLES 2017, having to predict the infarction after treatment straightforwardly introduces uncertainty. In ISLES 2018, the task was to detect the acute lesion, as observed on MR DWI, from CT perfusion-derived parameter maps. It is still unknown to what extent these parameter maps contain the necessary information to predict the lesion.
The optimized U-Net models result in Dice scores (Eq. 1) of 0.924, 0.763, 0.177 and 0.454 for MOLARS, BRATS 2018, ISLES 2017 and ISLES 2018, respectively. The Dice scores obtained with their optimized counterparts are significantly higher, respectively 0.932, 0.826, 0.343 and 0.527. This is in line with recent theory and practice from  and justifies optimization when the segmentation quality is measured in terms of Dice score.
It is clear that, in cases with high inherent uncertainty, the estimated volumes with soft Dice-optimized models are biased, while cross-entropy-optimized models predict unbiased volume estimates. For tasks with low inherent uncertainty, one can still favor soft Dice optimization due to a higher Dice score.
We want to highlight the importance of choosing an appropriate loss function w.r.t. the goal. In a clinical setting where volume estimates are important and for tasks with high or unknown inherent uncertainty, optimization with cross-entropy can be preferred.
J.B. is part of NEXIS , a project that has received funding from the European Union’s Horizon 2020 Research and Innovations Programme (Grant Agreement #780026). D.R. is supported by an innovation mandate of Flanders Innovation and Entrepreneurship (VLAIO).
-  (2017) Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Scientific Data 4 (March), pp. 1–13. External Links: Cited by: §1.
Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge. External Links: Cited by: §1.
-  (2019) Optimizing the Dice score and Jaccard index for medical image segmentation: theory and practice. In Medical Image Computing and Computer-Assisted Intervention, Cited by: §1, §3.2.
-  (2017) An automated technique to stage lower third molar development on panoramic radiographs for age estimation: a pilot study. Journal of Forensic Odonto-Stomatology 35 (2), pp. 49–60. Cited by: §1, §3.1, §3.
-  (2016) Deep Learning and Data Labeling for Medical Applications. LNCS 10008, pp. 179–187. External Links: Cited by: §1, §2.1.
-  (2016) Deep Learning. MIT Press. Cited by: §2.1.
-  (2016) Endovascular thrombectomy after large-vessel ischaemic stroke: a meta-analysis of individual patient data from five randomised trials. The Lancet 387 (10029), pp. 1723–1731. External Links: Cited by: §1.
-  (2017) Ischemic Stroke Lesion Segmentation (ISLES) challenge. External Links: Cited by: §1, §3.
-  (2018) Ischemic Stroke Lesion Segmentation (ISLES) challenge. External Links: Cited by: §1, §3.
-  (2018) No New-Net. LNCS 11384, pp. 234–244. External Links: Cited by: §1, §3.1.
-  (2017) Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Medical Image Analysis 36, pp. 61–78. Cited by: §1.
-  (2015) The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Transactions on Medical Imaging. External Links: Cited by: §1.
-  (2016) V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. International Conference on 3D Vision 4, pp. 1–11. External Links: Cited by: §1, §2.1.
-  (2018) Multimodal Brain Tumor Segmentation (BRATS) challenge. External Links: Cited by: §1, §3.
-  NEXIS - Next gEneration X-ray Imaging System. External Links: Cited by: §4.0.1.
-  (2017) Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. LNCS 10553, pp. 240–248. External Links: Cited by: §1, §2.1.
The Nature of Statistical Learning Theory. Springer. Cited by: §2.1.
-  (2018) ISLES 2016 and 2017-benchmarking ischemic stroke lesion outcome prediction based on multispectral MRI. Frontiers in Neurology 9 (SEP). External Links: Cited by: §1.