Are we using appropriate segmentation metrics? Identifying correlates of human expert perception for CNN training beyond rolling the DICE coefficient

In this study, we explore quantitative correlates of qualitative human expert perception. We discover that current quality metrics and loss functions, considered for biomedical image segmentation tasks, correlate moderately with segmentation quality assessment by experts, especially for small yet clinically relevant structures, such as the enhancing tumor in brain glioma. We propose a method employing classical statistics and experimental psychology to create complementary compound loss functions for modern deep learning methods, towards achieving a better fit with human quality assessment. When training a CNN for delineating adult brain tumor in MR images, all four proposed loss candidates outperform the established baselines on the clinically important and hardest to segment enhancing tumor label, while maintaining performance for other label channels.



page 4

page 7


Brain Tumor Segmentation Using Deep Learning by Type Specific Sorting of Images

Recently deep learning has been playing a major role in the field of com...

A Feasibility study for Deep learning based automated brain tumor segmentation using Magnetic Resonance Images

Deep learning algorithms have accounted for the rapid acceleration of re...

Sequential 3D U-Nets for Biologically-Informed Brain Tumor Segmentation

Deep learning has quickly become the weapon of choice for brain lesion s...

3D Semantic Segmentation of Brain Tumor for Overall Survival Prediction

Glioma, the malignant brain tumor, requires immediate treatment to impro...

Brain Tumor Segmentation: A Comparative Analysis

Five different threshold segmentation based approaches have been reviewe...

Robustifying deep networks for image segmentation

Purpose: The purpose of this study is to investigate the robustness of a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the medical imaging community challenges have become a prominent forum to benchmark segmentation performance of the latest methodological advances [menze2014multimodal, sekuboyina2020verse, bilic2019liver, bakas2018identifying]

. Across all challenges, convolutional neural networks (CNNs), and in particular the U-Net architecture

[ronneberger2015u], gained increasing popularity over the years. Typically, segmentation models are evaluated (and trained) using well-established criteria, such as the Dice similarity coefficient or the Hausdorff distance, or combinations thereof. Segmentation models for the BraTS challenge [bakas2018identifying] for example are typically trained and evaluated on three label channels: enhancing tumor (ET), tumor core (TC) and whole tumor (WT) compartments. For the final ranking in the BraTS challenge these three channels are treated equally. In contrast, from a medical perspective, the ET channel is of higher importance, as in glioma surgery, gross total resection of the ET is aimed for [weller2014eano] and tumor progression is mainly defined by an increase in ET volume [wen2010updated].
Like BraTS, most challenges rely upon a combination of Dice coefficient (DICE) with other metrics [maier2018rankings] for their scoring [menze2014multimodal]. However, it is not well understood and investigated how such metrics reflect expert opinion on segmentation quality and clinical relevance. Taha et al. [taha2015metrics] investigated how different aspects of segmentation quality are captured by different metrics. Despite the central importance of segmentation metrics for evaluation of segmentation performance, there is a clear gap in knowledge how to best capture expert assessment of a ”good” (clinically meaningful) segmentation. This problem arises especially when selecting a loss function for CNN training. In contrast to a plethora of volumetric loss functions [hashemi2018asymmetric, milletari2016v, sudre2017generalised, rahman2016optimizing, brosch2015deep, salehi2017tversky], only few non-volumetric losses are established for CNN training such as [karimi2019reducing]111This is especially true, when searching for multi-class segmentation losses, which support GPU-based training. For instance according to the authors there is no implementation of [karimi2019reducing] with GPU support..

Our contribution to the field is multifaceted. First, our study develops a better understanding of the qualitative human expert assessment in the example of glioma image analysis by identifying its quantitative correlates. Building upon these insights, we propose a method exploiting techniques of classical statistics and experimental psychology to form new compound losses for modern neural network training. These losses achieve a better fit with human quality assessment and improve on conventional baselines.

2 Methods

2.1 Collection of expert ratings

To understand how expert radiologists assess the quality of tumor segmentations, we conduct experiments where participants rate segmentation quality on a six-degree Likert scale (see Fig. 1). The selection options range from one star for strong rejection, to six stars for perfect segmentations.

2.1.1 Experiment 1:

We randomly select three exams from eight institutions that provided data to the BraTS 2019 [menze2014multimodal] test set. We also present one additional exam with apparently erroneous segmentation labels to check whether participants follow the instructions. We display stimulus material according to four experimental conditions, i.e. four different segmentation labels for each exam: the ground truth segmentations, segmentations from team zyx [zhao2019multi], a SIMPLE [langerak2010label] fusion of seven segmentations [zhao2019multi, xfeng, micdkfz, scan, lfb, gbmnet, econib] obtained from BraTS-Toolkit [kofler2020brats], and a segmentation randomly selected without replacement from all the teams participating in BraTS 2019. This results in a total of 300 trials that are presented to each expert radiologist.

2.1.2 Experiment 2:

Experts evaluate only the axial views of the gliomas’ center of mass, on a random sample of 50 patients from our test set. We present our four candidate losses vs. the ground truth and DICE+BCE baseline as conditions, so the experiment incorporates again 300 trials.

Figure 1: Stimulus material presented to the participants implemented via JS Psych [de2015jspsych]. One trial consisted of presentation of an MR exam in either axial, sagital or coronal view, along with the controls for the quality rating and a legend for the segmentation labels. We presented the glioma’s center of mass according to the TC, defined by the necrosis and enhancing tumor voxels in 2D slices of T1, T1c, T2, T2 FLAIR in a horizontal row arrangement. The stimulus presentation is conducted in line with best practices in experimental psychology (details are outlined in supplementary materials).

2.2 Metrics and losses vs. qualitative expert assessment

2.2.1 Metrics:

For each experimental condition, we compute a comprehensive set of segmentation quality metrics comparing the segmentation to the expert annotated ground truth, inspired by [taha2015metrics]. The metrics are calculated for the BraTS evaluation criteria: WT, TC, and ET, as well as the individual labels necrosis and edema. In addition, we compute mean aggregates for BraTS and individual labels, Fig. 4A. Computation of metrics is implemented using pymia [jungo2021pymia].

2.2.2 Losses:

We use the binary labels to compute established segmentation losses between the candidate segmentations and the ground truth (GT) following the same evaluation criteria222To test the validity of this approach we conduct a supportive experiment, comparing binary segmentations to non-binary network outputs and achieve comparable results across all loss functions.. The losses we considered333For Tversky loss we choose combinations which have proven successful for similar segmentation problems, parameters for and are written behind the code name. are shown in Table 1.

to X[2c] X[2c] X[c] X[c] Loss name & Loss abbreviation & Implementation & Reference
Asymmetric & ASYM & junma & [hashemi2018asymmetric]

Binary cross-entropy & BCE & pytorch & n/a

Dice & DICE & monai & [milletari2016v]

Generalized Dice & GDICE_L & liviaets & [sudre2017generalised]

Generalized Dice & GDICE_W & wolny & [sudre2017generalised]

Generalized Dice & GDICE_M & monai & [sudre2017generalised]

Hausdorff DT & HDDT & patryg & [karimi2019reducing]

Hausdorff ER & HDER & patryg & [karimi2019reducing]

Jaccard & IOU & junma & [rahman2016optimizing]

Jaccard & JAC & monai & [milletari2016v]

Sensitivity-Specifity & SS & junma & [brosch2015deep]

Soft Dice & SOFTD & nnUNet & [drozdzal2016importance]

Tversky & TVERSKY & monai & [salehi2017tversky]

Table 1: Losses considered in our analysis. Implementations are Github links.

2.3 Construction of compound losses

We aim to design a loss function, combining complementary loss functions, to achieve a better correlation with human segmentation quality assessment. Our compound loss is structured as a weighted linear combination of losses per label channel:


where denotes weights per channel, and denotes weights per loss.

2.3.1 Identifying promising loss combinations:

Losses produce different signals (gradients) when they react to input signals (network outputs). With hierarchical clustering we analyze similarity in loss response patterns, see Fig.

2. We can now build complementary loss combinations by selecting elements from the different cluster groups.

Figure 2: Phylogenetic tree of a hierarchical clustering on an Euclidean distance matrix. Losses are colored according to ten cluster groups. We included the expert assessment for reference, one can observe how it resides somewhere between the distance based losses and the group of generalized Dice losses and SS loss. We obtain comparable findings from a Principal Component Analysis (PCA), see supplemental materials.

2.3.2 Obtaining weights from the linear mixed model:

Second, we evaluate the predictive performance of our combinations throuh linear mixed models using the

lme:4 package [bates]. We average the human expert rating across views to obtain a quasi-metric variable444We deem this approach valid, as the distribution is consistent across views. See supplemental material.

. We then model the human expert assessment as a dependent variable and predict it by plugging in the loss values of our candidate combinations as predictor variables. Mixed models allow to account for the

non-independence of data points by modelling exam and experimental condition as random factors. To identify loss candidates for CNN training, we evaluate the predictive power of our models by computing Pseudo R2 [nakagawa2017coefficient]

, while monitoring typical mixed regression model criteria such as multi-collinearity, non-normality of residuals and outliers, homoscedasticity, homogeneity of variance and normality of random effects.

2.4 CNN training

We train nnU-Net [isensee2021nnu] using a standard BraTS trainer [isensee2020nnu] with a moderate augmentation pipeline on the fold 0 split of the BraTS 2020 training set, using the rest of the training set for validation. The official BraTS 2020 validation set is used for testing. Apart from replacing the default DICE+BCE loss with our custom-tailored loss candidates, and in some cases necessary learning rate adjustments, we keep all training parameters constant555The code for our nnU-net training will be publicly available on Github.

3 Results

3.1 Qualitative assessment vs. quantitative performance measures

A total of n=15 radiologists from six institutions took part in our first segmentation quality assessment experiment. Participants had an average experience of 10.05.1 working years as radiologists. Fig. 3 depicts how experts evaluated the segmentation quality. Next, we explore how the human expert assessment correlates with quantitative measures of segmentation performance.

Figure 3: Expert assessment in star rating. Diamonds indicate mean scores. Expert radiologists rated the simple fusion, slightly higher than the best individual docker algorithm zyx. The fusions’ mean improvement is mainly driven by more robust segmentation performance with less outliers towards the bottom, interestingly this effect is partially compensated by minor imperfections. We observe that both of these algorithmic segmentations receive slightly higher scores than the human expert annotated ground truth (gt). The quality of the randomly selected BraTS submissions (rnd) still lags behind.

3.1.1 Segmentation metrics vs. expert assessment:

Fig. 4 depicts the Pearson correlation matrix between segmentation quality metrics and the expert assessment. We find only low to moderate correlations across all metrics, correlations are especially low for the clinically important enhancing tumor label. It is important to note that DICE is outperformed by other less established metrics.

3.1.2 Segmentation losses vs. expert assessment:

Next we look into how these findings translate into CNN training. To train a CNN with stochastic gradient descent (SGD) a differentiable loss function is required. As this property does not apply to all metrics, we investigate how established loss functions correlate with expert quality assessment, the results are presented in Fig.


Figure 4: A: Pearson correlation matrix: Expert assessment vs. segmentation quality metrics. The rows show the correlations for the individual label channels. In addition we present the mean aggregates for the BraTS labels (enhancing tumor, tumor core and whole tumor), as well as the single channels (enhancing tumor, necrosis and edema). For the abbreviations in the figure refer to [taha2015metrics, jungo2021pymia].
B: Pearson correlation matrix: Expert assessment vs. segmentation losses. With the exception of one Generalized Dice Score implementation we observe only low correlations for the enhancing tumor and tumor core channel. In contrast multiple losses are moderately correlated for the whole tumor channel. While the IOU and JAC implementations provide very similar signals as expected, we observe huge variance across the implementations of GDICE.

3.2 Performance evaluation

Our methodology allows generating a variety of loss candidates. To evaluate the performance of our approach we obtain four promising loss candidates.

3.2.1 Construction of four loss candidates

The first two candidates use the same combination across all label channels. Candidate gdice_bce uses GDICE_W in combination with BCE. Candidate gdice_ss_bce iterates upon this by adding SS loss to the equation. In contrast the candidates channel_wise_weighted and channel_wise use different losses per channel: Channel whole tumor uses the gdice_ss_bce candidate loss, while channel tumor core is computed via the gdice_bce variant. In contrast the enhancing tumor channel relies solely on GDICE_W, as this is the only loss candidate with at least moderate correlation with expert assessment, compare Fig. 4B. While candidate channel_wise treats all channels equally the weighted variant prioritizes the more clinically relevant tumor core and enhancing tumor channels by a factor of five.

3.2.2 Evaluation of the loss candidates

As depicted in Fig. 5, our loss candidates outperform the established Dice and Dice+BCE losses with regard to the clinically most important enhancing tumor label while maintaining performance for the tumor core and whole tumor channel. This finding is also supported by a second qualitative experiment, where three expert radiologists from two institutions detect a subtle improvement in segmentation quality (see Fig. 6).

Figure 5:

Dice comparison of loss candidates vs. baselines across BraTS label channels. Diamonds indicate mean scores. P-values for paired-samples T-tests comparing our candidates with the dice_bce baseline from left to right:

0.1547, 0.04642, 0.05958, 0.01881 for the 95% significance level. While not all t-tests generate significant p-values on our small test set, it is important to note that the achieved improvement is bigger than the training variance, see supplemental materials.
Figure 6: Expert assessment of the four loss candidates vs ground truth and the DICE+BCE baseline. Diamonds indicate mean scores. We observe only subtle differences in expert rating between the DICE+BCE baseline and our candidates. Like in experiment 1 the human annotated ground truth is rated worse, compare Fig. 3

4 Discussion

Existing literature [reinke_weaknesses_2018, maier2018rankings] reported on the discrepancy between expert segmentation quality assessment and established segmentation quality metrics. Our approach combining methods of experimental psychology and classical statistics identifies the quantitative correlates of qualitative human expert perception to shed more light on this complex phenomenon. While both entities, segmentation quality metrics and expert radiologists try to measure segmentation quality, their signals correlate only moderately, especially for enhancing tumor (ET).

Even though our training set is small and the 15 radiologists judge the complex 3D segmentations on three 2D views, our method manages to extract the signal. When training a modern convolutional neural network with our expert inspired loss functions, all four proposed candidates lead to quantitative improvement on the ET channel, outperforming established baselines. This is in line with our expectations as the ET channel is of high clinical relevance for human experts.

We find that BCE is one of the few loss functions complementary to volumetric losses, which might explain why the empirically found Dice+BCE provides a solid baseline across many segmentation tasks. Our findings question the status of Dice coefficient as the de facto gold standard for measuring glioma segmentation quality.666As experts consistently scored CNN segmentations higher than the human annotated ground truth in our experiments (see Fig. 3, 6), our findings raise the question whether segmentation performance in general and challenge rankings in particular should be determined based upon Dice similarity with the latter. This phenomenon might occur as humans, in contrast to machines, are prone to random errors. We highlight the need for development of new segmentation metrics and losses that better correlate with expert assessment. As primate perception follows a fixed, partially understood, set of rules e.g. [wagemans2012century, roelfsema2006cortical] we speculate that our loss generation approach might generalize well to other segmentation problems, however further research is required to explore this.

5 Acknowledgments

Bjoern Menze, Benedikt Wiestler and Florian Kofler are supported through the SFB 824, subproject B12.

Supported by Deutsche Forschungsgemeinschaft (DFG) through TUM International Graduate School of Science and Engineering (IGSSE), GSC 81.

With the support of the Technical University of Munich – Institute for Advanced Study, funded by the German Excellence Initiative.

J.C. Paetzold is supported by the Graduate School of Bioengineering, Technical University of Munich.

Research reported in this publication was partly supported by the National Cancer Institute (NCI) and the National Institute of Neurological Disorders and Stroke (NINDS) of the National Institutes of Health (NIH), under award numbers NCI:U01CA242871, NCI:U24CA189523, NINDS:R01NS042645. The content of this publication is solely the responsibility of the authors and does not represent the official views of the NIH.

Part of this work was funded by the Helmholtz Imaging Platform (HIP), a platform of the Helmholtz Incubator on Information and Data Science.

I. Ezhov and S. Shit are supported by the Translational Brain Imaging Training Network under the EU Marie Sklodowska-Curie programme (Grant ID: 765148).