Log In Sign Up

Effect of Prior-based Losses on Segmentation Performance: A Benchmark

Today, deep convolutional neural networks (CNNs) have demonstrated state-of-the-art performance for medical image segmentation, on various imaging modalities and tasks. Despite early success, segmentation networks may still generate anatomically aberrant segmentations, with holes or inaccuracies near the object boundaries. To enforce anatomical plausibility, recent research studies have focused on incorporating prior knowledge such as object shape or boundary, as constraints in the loss function. Prior integrated could be low-level referring to reformulated representations extracted from the ground-truth segmentations, or high-level representing external medical information such as the organ's shape or size. Over the past few years, prior-based losses exhibited a rising interest in the research field since they allow integration of expert knowledge while still being architecture-agnostic. However, given the diversity of prior-based losses on different medical imaging challenges and tasks, it has become hard to identify what loss works best for which dataset. In this paper, we establish a benchmark of recent prior-based losses for medical image segmentation. The main objective is to provide intuition onto which losses to choose given a particular task or dataset. To this end, four low-level and high-level prior-based losses are selected. The considered losses are validated on 8 different datasets from a variety of medical image segmentation challenges including the Decathlon, the ISLES and the WMH challenge. Results show that whereas low-level prior-based losses can guarantee an increase in performance over the Dice loss baseline regardless of the dataset characteristics, high-level prior-based losses can increase anatomical plausibility as per data characteristics.


page 5

page 6

page 7

page 8

page 9


High-level Prior-based Loss Functions for Medical Image Segmentation: A Survey

Today, deep convolutional neural networks (CNNs) have demonstrated state...

A novel shape-based loss function for machine learning-based seminal organ segmentation in medical imaging

Automated medical image segmentation is an essential task to aid/speed u...

Beyond pixel-wise supervision for segmentation: A few global shape descriptors might be surprisingly good!

Standard losses for training deep segmentation networks could be seen as...

A Location-Sensitive Local Prototype Network for Few-Shot Medical Image Segmentation

Despite the tremendous success of deep neural networks in medical image ...

A survey on shape-constraint deep learning for medical image segmentation

Since the advent of U-Net, fully convolutional deep neural networks and ...

Incorporating prior knowledge in medical image segmentation: a survey

Medical image segmentation, the task of partitioning an image into meani...

Deep Implicit Statistical Shape Models for 3D Medical Image Delineation

3D delineation of anatomical structures is a cardinal goal in medical im...

1 Introduction

Medical image segmentation is the process of making per-pixel predictions in an image in order to identify organs or lesions from the background. Generally, medical images are largely versatile in nature, depending on the acquisition process and the type of object to be segmented. Imaging modalities include magnetic resonance imaging (MRI), computed tomography (CT), nuclear medicine functional imaging, ultrasound imaging, microscopy, to name a few. Hence, they vary in characteristics and nature and are broad with regards to the anatomical object of interest. As such, guaranteeing high performance for medical image segmentation can be considered very challenging when compared to other types of images or segmentation tasks. Regardless, segmentation in the medical domain is considered a key step in assisting early disease detection, diagnosis, monitoring treatment and follow up.

In the recent era, deep learning has registered a pivotal milestone in many fields including pattern recognition, object detection, natural language processing, with medical image segmentation being no exception to the rule. Convolutional neural networks (CNNs), a class of deep learning models, have been known to achieve considerable results due to their generalization ability. Since the segmentation process involves indicating not only what is present in an image but also where, medical image segmentation via CNNs considers a trade-off between contextual and spatial understanding. A pioneering approach for segmentation is the U-Net

[27] model, which is known for the ability to consider semantic and contextual information while achieving promising performance. U-Net has gained a high-level of success within image segmentation generally, and medical image segmentation particularly, due to its enhanced properties and powerful predictive notions.

Despite undeniable success, segmentation networks for medical images, including U-Net and its variants, may still generate anatomically aberrant segmentations, with holes or inaccuracies near the object boundaries as demonstrated in papers[14, 4] . Thus, such models lack the anatomical plausibility and background that a medical expert has. Moreover, they often require large amounts of annotated training data, which is not easy to obtain in the medical domain. Un-annotated or partially labeled data are, rather, more easily available or less computationally expensive.

To mitigate these limitations, recent research studies have focused on incorporating medical expert information, known as prior knowledge, as constraints within the deep learning framework. Prior knowledge can be information concerning the object shapes, size, topology, boundary or location, and has been known to be useful via variational approaches prior to the deep learning era. The exploitation of prior knowledge allows enforcing anatomical plausibility within segmentations provided by deep networks and can also overcome the need for fully labeled data [16, 19] .

Constraints via prior knowledge can be incorporated in CNNs either at the level of the network architecture [10, 11, 30, 13, 26, 21] or at the level of the loss function [17, 20, 18, 28, 6, 24, 1]. Whereas structural constraints are rather robust, loss constraints are more generic and can be plugged into any backbone network. Thus, prior-based loss functions offer a versatile way to include constraints at different scales, while maintaining interactions between regions as well as the computational efficiency of the backbone network.

Integrated prior can be low-level, which resembles reformulated ground-truth representation and is extracted from the ground-truth segmentations. For example, distance maps [6, 16, 25] and Laplacian filters [1]. Prior could also be high-level representing actual external medical information such as the shape of the organ, compactness or size, and are optimized directly based on ground-truth prior tags [17, 9, 24].

Over the past few years, prior-based losses, whether low-level or high-level, present a rising trend in today’s research in semantic image segmentation, particularly in the medical field. Given the diversity of prior-based losses on different medical imaging challenges and tasks, it has become hard to identify what loss works best for which dataset. For this reason, we establish in this paper, a benchmark of recent prior-based losses for medical image segmentation. Our main objective is to provide intuition onto which losses to choose given a particular task or dataset, based on dataset characteristics and properties.

In the literature, benchmarks most relevant to ours is the one proposed in paper [23] and paper [22]. In paper [22] , a benchmark of 20 losses is conducted with a thorough comparison on 4 main segmentation tasks: Liver, Liver Tumor, Pancreas and Multi-Abdominal Organ Segmentation. However, the authors do not address prior-based losses. Instead, they consider regular fitting losses like Dice, Cross entropy and their variants. Their benchmark is limited to only 4 datasets. The benchmark proposed in paper[23] targets some low-level prior-losses. However, this benchmark is limited to the scope of losses based on distance maps, such as the boundary loss [16] or the Hausdorff loss [15] , and do not compare relative to high-level prior losses. Benchmark [23] also demonstrate results on some structural constraints (i.e., regarding the architecture), that do not lie within the scope of our work. In addition, the benchmark is limited to two datasets: an organ segmentation task of the left atrial structure within MRI images and a liver tumor segmentation task within CT scans. In this work, we target specifically prior-based losses, both high-level and low-level, on 8 datasets of different tasks and modalities. Hence up to our knowledge, there is no benchmark that aims to compare prior-based losses on a number of datasets in order to quantify common trends and limitations.

The main objective of the proposed benchmark is to study the performance of prior-based losses, on a variety of datasets, tasks and modalities. In this way, we provide the readers with intuition onto which losses to choose given a particular task of interest. Prior-based losses are quite interesting because they allow integration of expert knowledge while still being architecture-agnostic, that is to say, they can be plugged into any backbone network. As a result, we are able to unify the segmentation network given the same learning environment, while varying the prior-based losses accordingly. We note that each of the considered losses has been proposed in their respective papers, in order to carry on a particular task. We believe that aside from the initial motive that the considered losses were designed for, additional significance may be drawn on other segmentation tasks and dataset characteristics. For this reason, we validate the chosen prior-based losses on 8 different datasets from a variety of medical image segmentation challenges including the Decathlon111 , the ISLES 222 and the WMH333 challenge. The main contributions of this paper are summarized as follows:

  • We present a benchmark of architecture-agnostic prior-based losses for medical image segmentation.

  • We attempt to shed light on the underlying relationship between the prior-based losses and some dataset characteristics.

The rest of the paper is organized as follows. Section 2 presents the selected loss functions for the proposed benchmark and elaborate on why the proposed losses were chosen. Section 4 illustrates the experimental setting adopted in order to evaluate the considered prior-based losses on the different datasets. In section 3

, we describe the datasets considered and the meta-features extracted to compare the loss performances. Finally, section

5 demonstrates the results and analyzes the loss performances relative to segmentation tasks and dataset characteristics.

2 Selected Loss Functions

In this section, we present the chosen prior-based losses for our proposed benchmark. Prior-based losses can be high-level, when the type of prior considered is based on external knowledge (e.g. shape), or low-level, that integrate ground-truth map transformations such as distance or contour maps, in order to reveal geometrical and location properties as demonstrated in survey [12] . The proposed benchmark mainly focuses on 4 recent prior-based losses that have raised interest within the field of medical image segmentation, 2 are low-level, and the other 2 are high-level.

2.1 Low-level prior-based losses

Possible low-level prior can be based on distance map as demonstrated in papers [16, 15] . In this context, two major contributions are the boundary loss [16] and the Hausdorff loss [15] .

The Boundary loss

is an approximation of the distance between the real and the estimated boundaries. Based on graph theories


, an equivalent term that finetunes the probability distribution via ground-truth distance maps is derived in paper

[16] and is defined as:


where denotes the distance of pixel to the closest contour () point, being the predicted value at pixel , and the image spatial domain.

The Hausdorff loss [15] conducts a direct point-by-point optimization of the predicted and ground-truth contours arriving to the following loss term:


The boundary loss has been initially designed in order to segment lesions within the brain, with the WMH and the ISLES datasets, whereas the Hausdorff loss has been tested on 4 different single-organ segmentation tasks, including the prostate, liver and pancreas from the Decathlon and PROMISE challenges. However, these losses were not evaluated in multi-organ segmentation. Since both losses lie in the same spectrum of low-level prior-based losses, and rely on the distance map, it may be interesting to investigate their performance on the same datasets in order to pinpoint common behaviors. Moreover, we aim to also extend the scope of these losses to the multi-organ case.

2.2 High-level prior-based losses

Regarding high-level prior losses, we analyze the performance of the clDice loss [28] and the size loss [17] .

The Size loss [17]

estimates the organ size from a soft probability map and constrains it, based on higher and lower threshold value of the organ size, according to the following:


where and

are respectively the upper and lower permissible bounds that the size of the considered object can attain. The size loss was originally designed for weakly supervised learning, to guide the network through the training despite the lack of full label maps. We are particularly interested in studying the effect of the size loss on small structures that are known to be more difficult to segment.

The clDice loss [28]

, also called skeleton loss, exploits skeletonization maps that are compact representations of images and objects that preserve topological properties. The objective of this loss is to constrain the skeleton of the predicted map to match the skeleton of the ground-truth map. This prior was used in the segmentation of vessels and neurons in both 2D and 3D. Let

and be the ground-truth and the predicted skeleton respectively, of size . The sensitivity (or recall) between the predicted segmentation and ground-truth skeleton is introduced as . Likewise, the precision between the ground-truth mask and the predicted skeleton is defined as: . The clDice is defined as the F1-score between precision and sensitivity as follows:


The clDice was originally designed to segment vessels; however, due to the nature of the skeletonization feature that they target, we believe that they may be good at distinguishing between different structures lying in close proximity to each other, such as when the organs are made of multiple instances.

Figure 1: Brain lesion segmentation task
# of patients Organ Size # of classes # of modality # of
Train Test mean std CC
WMH 48 12 0.33 0.56 1 2 0 26
Isles 74 20 2.11 1.91 1 5 0 3
Atrium 16 4 0.69 0.43 1 1 0 4
Colon 100 38 0.6 0.59 1 1 0 3
Spleen 32 9 1.57 1.03 1 1 0 1

H1 206 54 4.08 3.87 2 1 0 4
H2 3.53 2.53 0 3

CG 26 6 0.9 0.89 2 2
PZ 3.1 2.98 0 1

RVC 99 24 1.29 1.03 3 1 0 1
MYO 1.38 0.69
LVC 1.28 0.84

Table 1: Dataset Description: # of patients: patient split is 80 % / 20 % on the original dataset; Organ Size: % of pixels occupied by the organ w.r.t. the entire image; # of CC: number of connected components;

3 Datasets and Tasks

In this section, we present a brief description of the datasets under consideration. The datasets were chosen to cover different tasks, modalities and characteristics. Each dataset encompasses a particular set of challenges the segmentation network must consider while training. A summary of the meta-dataset characteristics is presented in Table 1.

Figure 2: Single-organ segmentation tasks from the Decathlon Challenge [29].

3.1 Brain Lesion Segmentation

To investigate the significance of prior-based losses on Brain lesion segmentation tasks, we mainly focus on the segmentation of white matter hyperintensities (WMH) dataset and the ischemic stroke lesion segmentation dataset (ISLES). Both datasets are multi-modal with anatomical objects that are characterized by being sparse and composed of multi-instances (See Figure 1).

3.2 Single Organ Segmentation

Organs can generally be single-connected of only 1 structure, or multi-connected composed of multi structures that are close to each other. To investigate the segmentation performance of prior-based losses on single-organ segmentation tasks where the organ considered is characterized with multi-connected structures, we targeted the segmentation of the atrium and Colon from the Decathlon Challenge. Alternatively, we target the spleen to investigate the performance of prior-based losses relative to single-label single-connected organs. The spleen and colon are characterized with a largely varying size and mild convexity issues at boundary levels. On the other hand, the atrium is a multi-instance anatomical object with up to 4 elements of varying sizes and lying in close proximity to each other (See Figure 2).

Figure 3: Multi-Organ Segmentation Tasks.

3.3 Multi-Organ Segmentation

For multi-organ segmentation, we have targeted the Prostate (Prostate central gland and peripheral zone) and Hippocampus (tissues H1 & H2) datasets from the Decathlon Challenge and the ACDC dataset (Three Cardiac Structures).

3.4 Meta-dataset Features

In order to reveal the underlying relationship between loss performance and dataset characteristics, we propose a set of meta-features that describe the datasets. This includes the size of the anatomical object taken as the percentage of occupation from the entire image, the number of connected components, which means how many instances an anatomical object is constituted of, and the number of classes, i.e., whether the segmentation task is single or multi-label.

4 Experimental Setting

We deploy a unified U-Net based framework [16, 17] and modify the loss function accordingly. Training is done using a batch size of 8 and a learning rate of

. The learning rate is halved if the validation performance does not improve during 20 epochs. The U-Net model is trained via each prior-based loss in conjunction with the Dice loss weighted by a parameter

according to the following equation:


The parameter is fine-tuned via the dynamic training strategy [16] such that its value was initially set to 0.01 and increased by 0.01 per epoch for 200 epochs. Our code is publically available on GitHub 444

For pre-processing, we have resized the images to 256 256 pixels and normalized them to the range [0, 1]. For multi-modal datasets, we have concatenated the channels at the level of the input. Each dataset was split into train and validation based on an 80 % / 20 % partition respectively, as shown in Table 1, and validated via three Monte-Carlo simulations [2].

Dataset + + + +
WMH 74.64 1.34 77.29 0.75 78.77 0.70 78.06 1.61 66.97 11.48

53.41 4.61 62.93 2.24 63.53 1.66 46.86 7.74 62.53 5.22

83.67 3.66 82.80 3.68 84.57 1.86 84.59 2.62 83.85 2.56

84.82 1.71 88.71 0.48 88.30 0.78 88.71 0.48 84.52 2.64

76.80 7.59 80.38 5.46 91.79 2.67 86.44 15.86 87.15 13.61

H1 49.38 0.33 65.20 0.31 68.54 1.46 66.24 0.33 68.39 2.60
H2 71.70 1.30 81.33 0.74 82.12 0.44 81.84 0.63 82.82 1.22

CG 45.17 6.41 44.89 7.09 44.15 5.61 34.12 7.49 42.45 7.03

PZ 65.13 11.57 68.99 9.94 64.38 9.33 29.61 12.07 61.57 11.44
ACDC RVC 80.79 0.95 81.04 0.87 80.54 1.30 41.02 38.39 83.83 1.39

MYO 83.92 0.13 84.16 0.83 83.91 0.85 83.41 0.72 83.24 0.66
LVC 90.26 0.13 89.53 0.74 88.98 0.90 89.74 0.71 89.56 1.10
Table 2: Average Dice scores standard deviation. Blue (resp. pink) background represents Dice Accuracy superior (resp. inferior) to the corresponding Dice baseline. The bold result is the best Dice score (i.e., the greatest) obtained on the dataset.
Data-Set + + + +

0.98 0.13 0.94 0.17 0.93 0.16 0.94 0.18 1.16 0.38

3.75 0.35 3.05 0.22 3.07 0.18 3.45 0.79 3.29 0.62
Atrium 1.62 , 0.16 1.64 0.16 1.67 0.13 1.59 0.17 1.64 0.16

0.58 0.04 0.50 0.02 0.51 0.03 0.50 0.02 0.58 0.07
Spleen 0.92 0.15 1.07 0.53
Hippocampus H1 2.31 0.05 1.99 0.01 1.98 0.03 1.97 0.02 1.99 0.04
H2 3.82 0.14 3.09 0.01 2.97 0.05 3.07 0.01 3.20 0.18
Prostate CG 2.80 0.34 2.77 0.43 2.88 0.27 3.10 0.26 3.48 0.66
PZ 3.24 0.35 2.94 0.27 3.17 0.47 4.41 0.90 3.45 0.58

RVC 2.44 0.04 2.41 0.05 2.33 0.04 3.88 1.44 2.34 0.08

MYO 2.60 0.01 2.57 0.01 2.65 0.01 2.62 0.00 2.71 0.04
LVC 1.95 0.02 1.95 0.02 1.98 0.01 1.94 0.01 1.98 0.04
Table 3: Average Hausdorff Distances standard deviation. Blue (resp. pink) background represent HD inferior (resp. superior) to the corresponding Dice baseline. The bold result is the best (i.e. the smallest) Hausdorff Distance obtained on the dataset.
Data-Set + + + +

1.04 0.14 0.98 0.17 1.01 0.23 0.91 0.22 2.14 1.26
Isles 0.69 0.19 0.48 0.04 0.57 0.19 1.34 1.08 0.39 0.10
Atrium 0.25 0.01 0.29 0.02 0.32 0.01 0.28 0.03 0.28 0.03
Colon 0.17 0.02 0.13 0.01 0.13 0.01 0.13 0.01 0.18 0.03
Spleen 0.22 0.05 0.24 0.09 0.09 0.01 0.18 0.13 0.12 0.15
Hippocampus-H1 3.76 0.14 1.81 0.14 2.67 0.11 2.88 0.39 1.30 0.96
Hippocampus-H2 0.95 0.01 0.23 0.01 0.87 0.10 0.74 0.06 0.10 0.03
Prostate-CG 8.96 3.33 9.05 3.33 8.89 3.11 8.78 2.79 8.98 3.32
Prostate-PZ 0.36 0.13 0.23 0.09 0.33 0.07 0.80 0.12 0.26 0.11
ACDC-RVC 0.18 0.03 0.16 0.00 0.13 0.02 0.11 0.04 0.12 0.03
ACDC-MYO 0.04 0.02 0.06 0.01 0.08 0.03 0.07 0.03 0.06 0.02
ACDC-LVC 0.06 0.01 0.06 0.01 0.07 0.01 0.07 0.02 0.05 0.02
Table 4: Mean Absolute Error (MAE) on the number of connected components (CC) of the ground truth vs the number of CC of the predicted segmentation map. Blue (resp. pink) background represents an MAE inferior (resp. superior) to the corresponding Dice baseline.

5 Results and Analysis

In this section, we report results of the benchmark datasets relative to the losses under consideration based on the training strategy explained in section 4. The segmentation performances are compared via the 2 usual segmentation metrics: the Dice score[8] (DSC) presented in Table 2 , the Hausdorff distance metric[3] (HD) presented in Table 3. In addition, we have computed the mean absolute error on the number of instances (connected components) presented in Table 4.

5.1 Added value of prior losses over the Dice loss baseline

From the performance tables, we realize that there is always at least one prior-based loss that is superior to the Dice baseline (denoted by cells with blue background in the tables). Thus, the exploitation of prior-based losses generally has enhanced segmentation performance in 10 out of 12 anatomical objects of the 8 datasets. For example, the Hausdorff loss has registered best performances on brain lesion segmentation tasks (WMH, Isles) and single-organ segmentation datasets. On the other hand, the boundary loss registered performances close to the best case performance on lesion tasks (Isles, WMH). The clDice registers best performances in 1 out of 3 multi-organ segmentation datasets and the size loss got good results on a selection of datasets including WMH, Atrium and Colon.

A close look at the Dice baseline performance over the entire set of datasets (first column in the tables), one can observe that the Prostate is quite challenging since it has the lowest Dice baseline performance. On the contrary, the ACDC dataset is the easiest with the highest Dice accuracy, and the problem of cardiac structure segmentation is well known and has been argued to be almost solved as demonstrated in paper[4]. Intuitively, an easy dataset would already register good performance given the simple Dice baseline and one would expect the addition of prior-based losses to have no added value, other than adding to the complexity of the training and degrading system performance. Indeed, the results benchmarked on the ACDC dataset registers little to no added value on the performance relative to the baseline. Alternatively, if the dataset is too complex such as the case of the Prostate (multi-label segmentation, large organ size imbalance, large number of connected components), customized prior-based losses may be needed to accommodate its characteristics: almost no gain is obtained from prior losses for the Prostate dataset.

5.2 Low-level vs. High-level Prior-based Losses

Both Hausdorff and Boundary losses register good performances on most datasets and over all segmentation tasks: brain lesions, single-organs and multi-organ segmentation. The Hausdorff loss has a superiority over the Boundary loss in some dataset cases (Spleen, Hippocampus). For example, for the spleen dataset, the Hausdorff loss has registered best case performance in both dice accuracy (added value of 14 %) and reduced the Hausdorff distances by over 30 % in comparison to the Boundary loss. The superiority of the Hausdorff loss over the Boundary loss is mainly due to the fact that the Hausdorff loss extracts distance maps from both predicted and ground-truth contours, and minimizes the error between the two maps accordingly, whereas the Boundary loss simply fine-tunes the probability distribution via the ground-truth distance maps. Based on this, one can say that since Hausdorff targets optimizing the distance map entity directly between predicted and ground-truth labels, it can guarantee a better mapping between predicted outputs and the ground-truth than the Boundary loss. Despite this significance, the Hausdorff loss is very computationally expensive since it consists in computing the predicted distance maps online while training, which directly affects training time. Hence, one may consider that the Boundary loss may represent a reasonable trade-off between good segmentation performance and computational cost.

Regarding the high-level prior-based losses, results are mixed: the size-based loss can either provide great improvement (e.g., WMH, Atrium, Colon), or much worse results (e.g., ISLES). For example, the size loss registers equivalent performance in the case of the WMH dataset relative to the best case segmentation result, but performs poorly on ISLES, despite the similarity in nature between the two datasets. We hypothesize that this may be due to the overall lesion sizes. A closer look at Table 1 showing the meta-data characteristic, we can gather that, on small sized organs (e.g.: WMH, Atrium, Colon), the size loss registers performance either better or equivalent to the Dice baseline. Given datasets that have large size variability (e.g., isles, Prostate, or ACDC), the exploitation of the size loss degrades segmentation performance. This is mainly due to the fact that, generally, the exploitation of the size loss allows the network to learn average sizes of the organs. In the same essence, based on the results, one can see that size loss can not accommodate multi-organ segmentations. The above observations are illustrated in Figure 4 showing the Dice performance relative to organ sizes. We note that the datasets where the size loss registered degraded results (red dots) are for those whose organ sizes are of large variability or that include multi-label segmentation. Hence, despite the fact that the size loss was initially customized to accommodate weakly supervised segmentation, it may be useful in full supervision, when the anatomical objects under consideration are very small structures, and occupying a tiny percentage of the overall image as in the case of the WMH dataset.

The clDice has a similar behavior but to a lesser extent. It generally registers better performance than the Dice baseline in most single-label segmentation cases and one multi-organ segmentation dataset. However, the clDice loss degraded performance on other datasets such as the WMH and the ACDC. Despite the equivalence in Dice accuracy between the Hausdorff loss and clDice loss on the Hippocampus dataset, the Hausdorff loss outperformed the clDice relative to the Hausdorff distance (Hausdorff loss is about 8 % lower than clDice loss in Hausdorff distance). This indicates its ability to take into consideration shape and border specifications. The degraded performance of clDice on Hausdorff distance can be explained by the fact that the loss is based on the skeleton maps, which tends to blur boundary specifications for the sake of revealing topological properties. This limitation is further verified by the clDice with the Hausdorff distance results on the ACDC dataset. Thus, even when the clDice registered best ranked results relative to the Dice Accuracy, the Hausdorff distance is degraded, even lower than the Dice baseline with regards to the Myocardium, for instance. Given tasks with high border irregularities, such as lesions, failing to consider boundary specifications can hinder overall performance (e.g. case of brain lesion in the WMH dataset).

When studying other meta-data features such as the number of connected components, one can see that the exploitation of high-level prior-based losses does not have a great influence on the results (see Table 4). We hypothesize that this may be due to the fact that high-level prior-based losses are rather customized to serve a particular task, or satisfy a particular constraint. If the task at hand does not conform with the dataset characteristics or attributes, the prior-based loss may generally have no added value.

Overall, we can hypothesize that contour-based losses are rather generic, and can be useful for enhancing segmentation performance on any type of dataset. However, if we are aiming at preserving a particular characteristic or anatomical property, a customized high-level prior-based loss may be a feasible solution. Thus, high-level losses may provide improvement; however, they are not very stable and can not be generalized to all datasets and tasks.

Figure 4: Influence on the organ size on the average Dice score. Each dot represents a dataset. Blue (resp. red) dots show the Dice score obtained with the Dice loss (resp. the Size loss).

5.3 Limitations of the Current Proposed Benchmark

Despite our intuitive analysis with regards to some relationships between loss performance and dataset characteristics however, we admit to many limitations. For starters, the proposed benchmark can not be generic, as there are many existent prior-based losses that we fail to include: low-level prior [6, 31, 25] , high-level topological [7] or shape prior [9, 24]

. Moreover, due to the fact that high-level prior-based losses are customized to target a particular property, providing means of comparison with respect to their effectiveness is subjected to debate. Another key component to take into consideration is their optimization algorithms. Many prior-based losses are discrete in nature; hence, they require particular optimization strategies in order to insure good performances. Our proposed benchmark is based on plugging the considered losses into a penalty-based Lagrangian optimization technique and training via stochastic gradient descent and the ADAM optimizer. On the level of the datasets, despite some similarities between datasets (Lesion task: ISLES, WMH, task: Single vs. Multi), however, the datasets are rather very different, each given a set of characteristics and properties. Hence, there are a lot of variables to take into consideration, which makes the means of comparison often limited. Despite these limitations, presenting a benchmark that can test prior-based losses on different tasks and datasets is important, because it can give the reader an intuitive initial judgment on which loss to choose based on the considered requirements and datasets properties.

6 Conclusion

In this paper, we proposed a benchmark of prior-based losses on medical image segmentation datasets. We provided intuitive explanations on a few existing relationships between prior-based loss significance and dataset characteristics. We summarized the paper’s realizations as follows: the size loss is generally significant when considering datasets of small structures and limited size variability. The contour-based losses generally, and Hausdorff loss particularly, accommodates objects of multi-structures and border irregularities.

Future work includes expanding the proposed benchmark in order to encompass a broader perspective of losses. Moreover, we aim to add other metadata features, in order to better characterize the organ and the task at hand, develop robust similarity feature vectors between datasets for more accurate comparison and conduct meta-learning to predict loss ranks and outputs so as to address the computational complexity issues between losses and their peers.


The authors have no relevant financial interests in this article and no potential conflicts of interest to disclose.


The authors would like to acknowledge the ANR (Project APi, grant ANR-18-CE23-0014) and the CRIANN for providing computational resources. This work is part of the WeSmile project funded by PHC VanGogh.

Data, Materials, and Code Availability

The code used to conduct the the benchmark is publically available on GitHub:


  • [1] A. Arif, S. M. M. Rahman, K. Knapp, and G. Slabaugh (2018) Shape-aware deep convolutional neural network for vertebrae segmentation. In Computational Methods and Clinical Applications in Musculoskeletal Imaging, pp. 12–24. External Links: ISBN 978-3-319-74113-0 Cited by: §1, §1.
  • [2] S. Arlot and A. Celisse (2010) A survey of cross-validation procedures for model selection. Statistics Surveys 4 (none), pp. 40–79. External Links: Document Cited by: §4.
  • [3] M. Beauchemin, K. P. B. Thomson, and G. Edwards (1998) On the hausdorff distance used for the evaluation of segmentation results. Cited by: §5.
  • [4] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. Gonzalez Ballester, G. Sanroma, S. Napel, S. Petersen, G. Tziritas, E. Grinias, M. Khened, V. A. Kollerathu, G. Krishnamurthi, M. Rohé, X. Pennec, M. Sermesant, F. Isensee, P. Jäger, K. H. Maier-Hein, P. M. Full, I. Wolf, S. Engelhardt, C. F. Baumgartner, L. M. Koch, J. M. Wolterink, I. Išgum, Y. Jang, Y. Hong, J. Patravali, S. Jain, O. Humbert, and P. Jodoin (2018-11) Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?. IEEE Transactions on Medical Imaging 37 (11), pp. 2514–2525. External Links: ISSN 1558-254X Cited by: §1, §5.1.
  • [5] Y. Boykov, O. Veksler, and R. Zabih (2001) Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (11), pp. 1222–1239. Cited by: §2.1.
  • [6] F. Caliva, C. Iriondo, A. M. Martinez, S. Majumdar, and V. Pedoia (2019-07) Distance map loss penalty term for semantic segmentation. In International Conference on Medical Imaging with Deep Learning – Extended Abstract Track, London, UK. Cited by: §1, §1, §5.3.
  • [7] J. R. Clough, I. Oksuz, N. Byrne, J. A. Schnabel, and A. P. King (2019-01) Explicit topological priors for deep-learning based image segmentation using persistent homology. arXiv:1901.10244 [cs]. External Links: 1901.10244 Cited by: §5.3.
  • [8] W. R. Crum, O. Camara, and D. L. G. Hill (2006) Generalized overlap measures for evaluation and validation in medical image analysis. IEEE Transactions on Medical Imaging 25 (11), pp. 1451–1461. External Links: ISSN 1558-254X, Document Cited by: §5.
  • [9] J. Dolz, I. Ben Ayed, and C. Desrosiers (2017) Unbiased shape compactness for segmentation. In MICCAI, pp. 755–763. External Links: ISBN 978-3-319-66182-7 Cited by: §1, §5.3.
  • [10] R. El Jurdi, C. Petitjean, P. Honeine, and F. Abdallah (2020) BB-UNet: U-net with bounding box prior. IEEE Journal of Selected Topics in Signal Processing 14 (6), pp. 1189–1198. Cited by: §1.
  • [11] R. El Jurdi, T. Dargent, C. Petitjean, P. Honeine, and F. Abdallah (2020-11) Investigating CoordConv for fully and weakly supervised medical image segmentation. In Tenth International Conference on Image Processing Theory, Tools and Applications, IPTA 2020, Paris, France, Cited by: §1.
  • [12] R. El Jurdi, C. Petitjean, P. Honeine, V. Cheplygina, and F. Abdallah (2021) High-level Prior-based Loss Functions for Medical Image Segmentation: ASurvey.

    Submitted to Computer Vision and Image Understanding

    Cited by: §2.
  • [13] M. Ghafoorian, N. Karssemeijer, T. Heskes, I. Van Uden, C. Sanchez, G. Litjens, F. Leeuw, B. Ginneken, E. Marchiori, and B. Platel (2016-10) Location sensitive deep convolutional neural networks for segmentation of white matter hyperintensities. Scientific Reports 7. Cited by: §1.
  • [14] F. Isensee, P. F. Jaeger, P. M. Full, I. Wolf, S. Engelhardt, and K. H. Maier-Hein (2018) Automatic Cardiac Disease Assessment on cine-MRI via Time-Series Segmentation and Domain Specific Features. In Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges, Lecture Notes in Computer Science, pp. 120–129 (English). External Links: Document, ISBN 978-3-319-75541-0 Cited by: §1.
  • [15] D. Karimi and S. E. Salcudean (2019-04) Reducing the Hausdorff Distance in Medical Image Segmentation with Convolutional Neural Networks. arXiv:1904.10030 [cs, eess, stat]. External Links: 1904.10030 Cited by: §1, §2.1, §2.1.
  • [16] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, and I. Ben Ayed (2019-07) Boundary loss for highly unbalanced segmentation. In Medical Imaging with Deep Learning,

    Proceedings of Machine Learning Research

    , Vol. 102, London, UK, pp. 285–296.
    Cited by: §1, §1, §1, §2.1, §2.1, §4, §4.
  • [17] H. Kervadec, J. Dolz, M. Tang, E. Granger, Y. Boykov, and I. B. Ayed (2018) Constrained-CNN losses for weakly supervised segmentation. In 1st Conference on Medical Imaging with Deep Learning (MIDL), Amsterdam, the Netherlands, Cited by: §1, §1, §2.2, §2.2, §4.
  • [18] H. Kervadec, J. Dolz, S. Wang, E. Granger, and I. B. Ayed (2020-04) Bounding boxes for weakly supervised segmentation: Global constraints get close to full supervision. arXiv:2004.06816 [cs] (en). External Links: 2004.06816 Cited by: §1.
  • [19] H. Kervadec, J. Dolz, S. Wang, E. Granger, and I. ben Ayed (2020) Bounding boxes for weakly supervised segmentation: Global constraints get close to full supervision. In Medical Imaging with Deep Learning, Cited by: §1.
  • [20] H. Kervadec, J. Dolz, J. Yuan, C. Desrosiers, E. Granger, and I. B. Ayed (2019-09) Constrained Deep Networks: Lagrangian Optimization via Log-Barrier Extensions. arXiv:1904.04205 [cs]. External Links: 1904.04205 Cited by: §1.
  • [21] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele (2017-07) Simple does it: Weakly supervised instance and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1665–1674. External Links: ISSN 1063-6919, Document Cited by: §1.
  • [22] J. Ma, J. Chen, M. Ng, R. Huang, Y. Li, C. Li, X. Yang, and A. L. Martel (2021) Loss odyssey in medical image segmentation. Medical Image Analysis 71, pp. 102035. External Links: ISSN 1361-8415, Document, Link Cited by: §1.
  • [23] J. Ma, Z. Wei, Y. Zhang, Y. Wang, R. Lv, C. Zhu, G. Chen, J. Liu, C. Peng, L. Wang, Y. Wang, and J. Chen (2020) How distance transform maps boost segmentation {cnn}s: an empirical study. In Medical Imaging with Deep Learning, External Links: Link Cited by: §1.
  • [24] Z. Mirikharaji and G. Hamarneh (2018) Star shape prior in fully convolutional networks for skin lesion segmentation. In MICCAI, Vol. 11073, pp. 737–745. External Links: Document Cited by: §1, §1, §5.3.
  • [25] A. Mosinska, P. Márquez-Neila, M. Kozinski, and P. Fua (2018) Beyond the pixel-wise loss for topology-aware delineation. In CVPR, pp. 3136–3145. Cited by: §1, §5.3.
  • [26] H. Ravishankar, R. Venkataramani, S. Thiruvenkadam, P. Sudhakar, and V. Vaidya (2017) Learning and incorporating shape models for semantic segmentation. In MICCAI, pp. 203–211. External Links: ISBN 978-3-319-66182-7 Cited by: §1.
  • [27] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. External Links: ISBN 978-3-319-24574-4 Cited by: §1.
  • [28] S. Shit, J. C. Paetzold, A. Sekuboyina, A. Zhylka, I. Ezhov, A. Unger, J. P. W. Pluim, G. Tetteh, and B. H. Menze (2019) clDice - a topology-preserving loss function for tubular structure segmentation. In Medical Imaging Meets NeurIPS 2019 Workshop, Cited by: §1, §2.2, §2.2.
  • [29] A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. van Ginneken, A. Kopp-Schneider, B. A. Landman, G. Litjens, B. H. Menze, O. Ronneberger, R. M. Summers, P. Bilic, P. F. Christ, R. K. G. Do, M. Gollub, J. Golia-Pernicka, S. Heckers, W. R. Jarnagin, M. McHugo, S. Napel, E. Vorontsov, L. Maier-Hein, and M. J. Cardoso (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. CoRR abs/1902.09063. External Links: Link, 1902.09063 Cited by: Figure 2.
  • [30] R. Trullo, C. Petitjean, S. Ruan, B. Dubray, D. Nie, and D. Shen (2017) Joint segmentation of multiple thoracic organs in CT images with two collaborative deep architectures. MICCAI’17 workshop Deep Learning in Medical Image Analysis. Cited by: §1.
  • [31] X. Yang, C. Bian, L. Yu, D. Ni, and P. Heng (2018-01) Class-balanced deep neural network for automatic ventricular structure segmentation. In STACOM@MICCAI, pp. 152–160. Cited by: §5.3.