In the medical imaging community challenges have become a prominent forum to benchmark segmentation performance of the latest methodological advances [menze2014multimodal, sekuboyina2020verse, bilic2019liver, bakas2018identifying]
. Across all challenges, convolutional neural networks (CNNs), and in particular the U-Net architecture[ronneberger2015u], gained increasing popularity over the years. Typically, segmentation models are evaluated (and trained) using well-established criteria, such as the Dice similarity coefficient or the Hausdorff distance, or combinations thereof. Segmentation models for the BraTS challenge [bakas2018identifying] for example are typically trained and evaluated on three label channels: enhancing tumor (ET), tumor core (TC) and whole tumor (WT) compartments. For the final ranking in the BraTS challenge these three channels are treated equally. In contrast, from a medical perspective, the ET channel is of higher importance, as in glioma surgery, gross total resection of the ET is aimed for [weller2014eano] and tumor progression is mainly defined by an increase in ET volume [wen2010updated].
Like BraTS, most challenges rely upon a combination of Dice coefficient (DICE) with other metrics [maier2018rankings] for their scoring [menze2014multimodal]. However, it is not well understood and investigated how such metrics reflect expert opinion on segmentation quality and clinical relevance. Taha et al. [taha2015metrics] investigated how different aspects of segmentation quality are captured by different metrics. Despite the central importance of segmentation metrics for evaluation of segmentation performance, there is a clear gap in knowledge how to best capture expert assessment of a ”good” (clinically meaningful) segmentation. This problem arises especially when selecting a loss function for CNN training. In contrast to a plethora of volumetric loss functions [hashemi2018asymmetric, milletari2016v, sudre2017generalised, rahman2016optimizing, brosch2015deep, salehi2017tversky], only few non-volumetric losses are established for CNN training such as [karimi2019reducing]111This is especially true, when searching for multi-class segmentation losses, which support GPU-based training. For instance according to the authors there is no implementation of [karimi2019reducing] with GPU support..
Our contribution to the field is multifaceted. First, our study develops a better understanding of the qualitative human expert assessment in the example of glioma image analysis by identifying its quantitative correlates. Building upon these insights, we propose a method exploiting techniques of classical statistics and experimental psychology to form new compound losses for modern neural network training. These losses achieve a better fit with human quality assessment and improve on conventional baselines.
2.1 Collection of expert ratings
To understand how expert radiologists assess the quality of tumor segmentations, we conduct experiments where participants rate segmentation quality on a six-degree Likert scale (see Fig. 1). The selection options range from one star for strong rejection, to six stars for perfect segmentations.
2.1.1 Experiment 1:
We randomly select three exams from eight institutions that provided data to the BraTS 2019 [menze2014multimodal] test set. We also present one additional exam with apparently erroneous segmentation labels to check whether participants follow the instructions. We display stimulus material according to four experimental conditions, i.e. four different segmentation labels for each exam: the ground truth segmentations, segmentations from team zyx [zhao2019multi], a SIMPLE [langerak2010label] fusion of seven segmentations [zhao2019multi, xfeng, micdkfz, scan, lfb, gbmnet, econib] obtained from BraTS-Toolkit [kofler2020brats], and a segmentation randomly selected without replacement from all the teams participating in BraTS 2019. This results in a total of 300 trials that are presented to each expert radiologist.
2.1.2 Experiment 2:
Experts evaluate only the axial views of the gliomas’ center of mass, on a random sample of 50 patients from our test set. We present our four candidate losses vs. the ground truth and DICE+BCE baseline as conditions, so the experiment incorporates again 300 trials.
2.2 Metrics and losses vs. qualitative expert assessment
For each experimental condition, we compute a comprehensive set of segmentation quality metrics comparing the segmentation to the expert annotated ground truth, inspired by [taha2015metrics]. The metrics are calculated for the BraTS evaluation criteria: WT, TC, and ET, as well as the individual labels necrosis and edema. In addition, we compute mean aggregates for BraTS and individual labels, Fig. 4A. Computation of metrics is implemented using pymia [jungo2021pymia].
We use the binary labels to compute established segmentation losses between the candidate segmentations and the ground truth (GT) following the same evaluation criteria222To test the validity of this approach we conduct a supportive experiment, comparing binary segmentations to non-binary network outputs and achieve comparable results across all loss functions.. The losses we considered333For Tversky loss we choose combinations which have proven successful for similar segmentation problems, parameters for and are written behind the code name. are shown in Table 1.
2.3 Construction of compound losses
We aim to design a loss function, combining complementary loss functions, to achieve a better correlation with human segmentation quality assessment. Our compound loss is structured as a weighted linear combination of losses per label channel:
where denotes weights per channel, and denotes weights per loss.
2.3.1 Identifying promising loss combinations:
2.3.2 Obtaining weights from the linear mixed model:
Second, we evaluate the predictive performance of our combinations throuh linear mixed models using thelme:4 package [bates]. We average the human expert rating across views to obtain a quasi-metric variable444We deem this approach valid, as the distribution is consistent across views. See supplemental material.
. We then model the human expert assessment as a dependent variable and predict it by plugging in the loss values of our candidate combinations as predictor variables. Mixed models allow to account for thenon-independence of data points by modelling exam and experimental condition as random factors. To identify loss candidates for CNN training, we evaluate the predictive power of our models by computing Pseudo R2 [nakagawa2017coefficient]
, while monitoring typical mixed regression model criteria such as multi-collinearity, non-normality of residuals and outliers, homoscedasticity, homogeneity of variance and normality of random effects.
2.4 CNN training
We train nnU-Net [isensee2021nnu] using a standard BraTS trainer [isensee2020nnu] with a moderate augmentation pipeline on the fold 0 split of the BraTS 2020 training set, using the rest of the training set for validation. The official BraTS 2020 validation set is used for testing. Apart from replacing the default DICE+BCE loss with our custom-tailored loss candidates, and in some cases necessary learning rate adjustments, we keep all training parameters constant555The code for our nnU-net training will be publicly available on Github.
3.1 Qualitative assessment vs. quantitative performance measures
A total of n=15 radiologists from six institutions took part in our first segmentation quality assessment experiment. Participants had an average experience of 10.05.1 working years as radiologists. Fig. 3 depicts how experts evaluated the segmentation quality. Next, we explore how the human expert assessment correlates with quantitative measures of segmentation performance.
3.1.1 Segmentation metrics vs. expert assessment:
Fig. 4 depicts the Pearson correlation matrix between segmentation quality metrics and the expert assessment. We find only low to moderate correlations across all metrics, correlations are especially low for the clinically important enhancing tumor label. It is important to note that DICE is outperformed by other less established metrics.
3.1.2 Segmentation losses vs. expert assessment:
Next we look into how these findings translate into CNN training. To train a CNN with stochastic gradient descent (SGD) a differentiable loss function is required. As this property does not apply to all metrics, we investigate how established loss functions correlate with expert quality assessment, the results are presented in Fig.4B.
3.2 Performance evaluation
Our methodology allows generating a variety of loss candidates. To evaluate the performance of our approach we obtain four promising loss candidates.
3.2.1 Construction of four loss candidates
The first two candidates use the same combination across all label channels. Candidate gdice_bce uses GDICE_W in combination with BCE. Candidate gdice_ss_bce iterates upon this by adding SS loss to the equation. In contrast the candidates channel_wise_weighted and channel_wise use different losses per channel: Channel whole tumor uses the gdice_ss_bce candidate loss, while channel tumor core is computed via the gdice_bce variant. In contrast the enhancing tumor channel relies solely on GDICE_W, as this is the only loss candidate with at least moderate correlation with expert assessment, compare Fig. 4B. While candidate channel_wise treats all channels equally the weighted variant prioritizes the more clinically relevant tumor core and enhancing tumor channels by a factor of five.
3.2.2 Evaluation of the loss candidates
As depicted in Fig. 5, our loss candidates outperform the established Dice and Dice+BCE losses with regard to the clinically most important enhancing tumor label while maintaining performance for the tumor core and whole tumor channel. This finding is also supported by a second qualitative experiment, where three expert radiologists from two institutions detect a subtle improvement in segmentation quality (see Fig. 6).
Existing literature [reinke_weaknesses_2018, maier2018rankings] reported on the discrepancy between expert segmentation quality assessment and established segmentation quality metrics. Our approach combining methods of experimental psychology and classical statistics identifies the quantitative correlates of qualitative human expert perception to shed more light on this complex phenomenon. While both entities, segmentation quality metrics and expert radiologists try to measure segmentation quality, their signals correlate only moderately, especially for enhancing tumor (ET).
Even though our training set is small and the 15 radiologists judge the complex 3D segmentations on three 2D views, our method manages to extract the signal. When training a modern convolutional neural network with our expert inspired loss functions, all four proposed candidates lead to quantitative improvement on the ET channel, outperforming established baselines. This is in line with our expectations as the ET channel is of high clinical relevance for human experts.
We find that BCE is one of the few loss functions complementary to volumetric losses, which might explain why the empirically found Dice+BCE provides a solid baseline across many segmentation tasks. Our findings question the status of Dice coefficient as the de facto gold standard for measuring glioma segmentation quality.666As experts consistently scored CNN segmentations higher than the human annotated ground truth in our experiments (see Fig. 3, 6), our findings raise the question whether segmentation performance in general and challenge rankings in particular should be determined based upon Dice similarity with the latter. This phenomenon might occur as humans, in contrast to machines, are prone to random errors. We highlight the need for development of new segmentation metrics and losses that better correlate with expert assessment. As primate perception follows a fixed, partially understood, set of rules e.g. [wagemans2012century, roelfsema2006cortical] we speculate that our loss generation approach might generalize well to other segmentation problems, however further research is required to explore this.
Bjoern Menze, Benedikt Wiestler and Florian Kofler are supported through the SFB 824, subproject B12.
Supported by Deutsche Forschungsgemeinschaft (DFG) through TUM International Graduate School of Science and Engineering (IGSSE), GSC 81.
With the support of the Technical University of Munich – Institute for Advanced Study, funded by the German Excellence Initiative.
J.C. Paetzold is supported by the Graduate School of Bioengineering, Technical University of Munich.
Research reported in this publication was partly supported by the National Cancer Institute (NCI) and the National Institute of Neurological Disorders and Stroke (NINDS) of the National Institutes of Health (NIH), under award numbers NCI:U01CA242871, NCI:U24CA189523, NINDS:R01NS042645. The content of this publication is solely the responsibility of the authors and does not represent the official views of the NIH.
Part of this work was funded by the Helmholtz Imaging Platform (HIP), a platform of the Helmholtz Incubator on Information and Data Science.
I. Ezhov and S. Shit are supported by the Translational Brain Imaging Training Network under the EU Marie Sklodowska-Curie programme (Grant ID: 765148).