CHAOS Challenge – Combined (CT-MR) Healthy Abdominal Organ Segmentation

by   A. Emre Kavur, et al.

Segmentation of abdominal organs has been a comprehensive, yet unresolved, research field for many years. In the last decade, intensive developments in deep learning (DL) have introduced new state-of-the-art segmentation systems. Despite outperforming the overall accuracy of existing systems, the effects of DL model properties and parameters on the performance is hard to interpret. This makes comparative analysis a necessary tool to achieve explainable studies and systems. Moreover, the performance of DL for emerging learning approaches such as cross-modality and multi-modal tasks have been rarely discussed. In order to expand the knowledge in these topics, CHAOS – Combined (CT-MR) Healthy Abdominal Organ Segmentation challenge has been organized in the IEEE International Symposium on Biomedical Imaging (ISBI), 2019, in Venice, Italy. Despite a large number of the previous abdomen related challenges, the majority of which are focused on tumor/lesion detection and/or classification with a single modality, CHAOS provides both abdominal CT and MR data from healthy subjects. Five different and complementary tasks have been designed to analyze the capabilities of the current approaches from multiple perspectives. The results are investigated thoroughly, compared with manual annotations and interactive methods. The outcomes are reported in detail to reflect the latest advancements in the field. CHAOS challenge and data will be available online to provide a continuous benchmark resource for segmentation.


Abdominal multi-organ segmentation with cascaded convolutional and adversarial deep networks

Objective : Abdominal anatomy segmentation is crucial for numerous appli...

Organ At Risk Segmentation with Multiple Modality

With the development of image segmentation in computer vision, biomedica...

Unpaired MR-CT brain dataset for unsupervised image translation

The data presented in this article deals with the problem of brain tumor...

Evaluation of Algorithms for Multi-Modality Whole Heart Segmentation: An Open-Access Grand Challenge

Knowledge of whole heart anatomy is a prerequisite for many clinical app...

A Prior Knowledge Based Tumor and Tumoral Subregion Segmentation Tool for Pediatric Brain Tumors

In the past few years, deep learning (DL) models have drawn great attent...

Automatic lesion segmentation and Pathological Myopia classification in fundus images

In this paper we present algorithms to diagnosis Pathological Myopia (PM...

I Introduction

In the last decade, medical imaging and image processing benchmarks have become the strategy to compare the performance of different approaches in clinically important tasks [2]. These benchmarks have gained a particularly important role in the analysis of learning-based systems by enabling the use of the same dataset for training and testing [38]. Challenges, which use these benchmarks, get a prominent role to report the outcomes of the state-of-the-art results in a structured way [25]. In this respect, the benchmarks establish standard datasets, evaluation strategies, fusion possibilities (eg. ensembles), and (un)resolved difficulties related to the specific biomedical image processing task(s) being tested [30]. An extensive website, [42], has been designed for hosting the challenges related to medical image segmentation and currently includes around 200 challenges.

Comprehensive researches of biomedical image analysis challenges reveal that construction of the datasets, inter- and intra- observer variations for ground truth generation, and evaluation criteria might prevent establishing the true potential of such events [35]. Suggestions, caveats, and roadmaps are being provided by reviews [29, 34] to improve the challenges.

Challenge Task(s) Structure (Modality) Organization and year
SLIVER07 Single model segmentation Liver (CT) MICCAI 2007, Australia
LTSC08 Single model segmentation Liver tumor (CT) MICCAI 2008, USA
Shape 2014 Building organ model Liver (CT) Delémont, Switzerland
Shape 2015 Completing partial segmentation Liver (CT) Delémont, Switzerland
Anatomy3 Multi-model segmentation Kidney, urinary bladder, gallbladder, spleen, liver, and pancreas (CT and MRI for all organs) VISCERAL Consortium, 2014
LiTS Single model segmentation Liver and liver tumor (CT) ISBI 2017, Australia;
MICCAI 2017, Canada
Pancreatic Cancer Survival Prediction Quantitative assessment of cancer Pancreas (CT) MICCAI 2018, Spain
MSD Multi-model segmentation Liver (CT), liver tumor (CT), spleen (CT), hepatic vessels in the liver (CT), pancreas and pancreas tumor (CT) MICCAI 2018, Spain
KiTS19 Single model segmentation Kidney and kidney tumor (CT) MICCAI 2019, China
PAIP 2019 Detection Liver cancer (Whole-slide images) MICCAI 2019, China
CHAOS Multi-model segmentation Liver, kidney(s), spleen (CT, MRI for all organs) ISBI 2019, Italy
TABLE I: Overview of challenges that have upper abdomen data and task. (Other structures are not shown in the table.)

Considering the dominance of the machine learning (ML) approaches, two main points are continuously being emphasized. 1) Recognition of the current roadblocks in applying ML to medical imaging. 2) Increasing the dialogue between radiologists and data scientists

[33]. Accordingly, challenges are either continuously updated [30], repeated after some time [39], or new ones having similar focuses are being organized to overcome the pitfalls and shortcomings of the existing ones.

A detailed literature review about the challenges related to abdominal organs (see Section II) revealed that the existing challenges in the field are significantly dominated by CT scans and tumor/lesion classification tasks. Up to now, there have only been a few benchmarks containing abdominal MRI series (Table I). Although this situation was typical for the last decades, the emerging technology of MRI makes it the preferred modality for further and detailed analysis of the abdomen. The remarkable developments in MRI technology in terms of resolution, dynamic range, and speed enable joint analyses of these modalities [14].

To gauge the current state-of-the-art in automated abdominal segmentation and observe the performance of various approaches on different tasks such as cross-modality learning and multi-modal segmentation, we organized Combined (CT-MR) Healthy Abdominal Organ Segmentation (CHAOS) in conjunction with the IEEE International Symposium on Biomedical Imaging (ISBI) in 2019. For this purpose, we prepared and made available a unique dataset of CT and MR scans from unpaired abdominal image series. A consensus-based multiple expert annotation strategy was used to generate the ground truths. A subset of this dataset was provided to the participants for training, and the remaining images were used to test performance against the (hidden) manual delineations using various metrics. In this paper, we report the setup as well as the results of this CHAOS benchmark and its outcomes.

The rest of the paper is organized as follows. A review of the current challenges in abdominal organ segmentation is given in Section II together with surveys on benchmark methods. Next, CHAOS datasets, setup, ground truth generation, and employed tasks are presented in Section III. Section IV describes the evaluation strategy. Then, participating methods are comparatively summarized in Section V. Section VI presents the results, and Section VII concludes the paper.

Ii Related Work

According to our literature analysis, currently, there exist 11 challenges focusing on abdominal organs [42]. Being one of the pioneering challenges, SLIVER07 initialized the liver benchmarking [13, 43]. It provided a comparative study of a range of algorithms for liver segmentation under several intentionally included difficulties such as patient orientation variations or tumors and lesions. Its outcomes reported a snapshot of the methods that were popular for medical image analysis. However, since then, abdomen related challenges were mostly aimed at disease and tumor detection rather than organ segmentation. In 2008, “3D Liver Tumor Segmentation Challenge (LTSC08)” [6] was organized as the continuation of the SLIVER07 to segment liver tumors from the abdomen CT scans. Similarly, Shape 2014 and 2015 [23] challenges focused on liver segmentation from CT data. Anatomy3 [18] provided a unique challenge, which was a very comprehensive platform for segmenting not only upper-abdominal organs, but also various others such as left/right lung, urinary bladder, and pancreas. In a similar vein, LiTS - Liver Tumor Segmentation Challenge [3] is another example that covers liver and liver tumor segmentation tasks in CT. Other similar challenges can be listed as Pancreatic Cancer Survival Prediction [10], which targets pancreas cancer tissues in CT scans; KiTS19 [24] challenge, which provides CT data for the kidney tumor segmentation; and PAIP 2019 [32], which aims at automatically detecting liver cancer from whole slice images.

In 2018, Medical Segmentation Decathlon (MSD) [38] was organized by a joint team and provided an immense challenge that contained many structures such as liver parenchyma, hepatic vessels and tumors, spleen, brain tumors, hippocampus, and lung tumors. The focus of the challenge was not only measuring the performance for each structure, but to observe generalizability, translatability, and transferability of a system for unseen data. Thus, the main idea behind MSD was to understand the key elements of DL systems that can work on many tasks. To provide such a source, MSD included a wide range of challenges including small and unbalanced sample sizes, varying object scales and multi-class labels. The approach of MSD underlines the ultimate goal of the challenges that is tı provide large datasets on several highly different tasks, and evaluation through a standardized analysis and validation process.

In this respect, a recent survey shows that another trend in medical image segmentation is the development of more comprehensive computational anatomical models leading to multi-organ related tasks rather than traditional organ and/or disease-specific tasks [4]. By incorporating inter-organ relations into the process, multi-organ related tasks require a complete representation of the complex and flexible abdominal anatomy. Thus, this emerging field requires new efficient computational and machine learning models.

Under the influence of the above mentioned visionary studies, CHAOS has been organized to strengthen the field by aiming at objectives that involve emerging ML concepts (cross-modality learning, multi-modal segmentation, etc.) through an extensive dataset. In this respect, it focuses on segmenting multiple organs from unpaired patient datasets acquired by two modalities, CT and MR (including two different pulse sequences).

Iii CHAOS Challenge

Iii-a Aims and Tasks

CHAOS challenge has two separate but related aims:

  1. Achieving accurate segmentation of the liver from CT.

  2. Achieving accurate segmentation of abdominal organs (liver, spleen, kidneys) from MRI sequences.

CHAOS provides different segmentation algorithm design opportunities to the participants through five individual tasks:

Task 1: Liver Segmentation (CT-MRI) focuses on using a single system that can segment the liver from both CT and multi-modal MRI (T1-DUAL and T2-SPIR sequences). This corresponds to ”cross-modality” learning, which is expected to be used more frequently as the abilities of DL develop [41].

Task 2: Liver Segmentation (CT) covers a regular segmentation task, which can be considered relatively easier due to the inclusion of only healthy livers aligned in the same direction and patient position. On the other hand, the diffusion of contrast agent to parenchyma and the enhancement of the inner vascular tree creates challenging difficulties.

Task 3: Liver Segmentation (MRI) has the aim of Task 2, but presents multi-modal MRI datasets, which are randomly collected within routine clinical workflow. The methods are expected to work on both T1-DUAL (in & oppose phases) and T2-SPIR MR sequences.

Task 4: Segmentation of abdominal organs (CT-MRI) is similar to Task 1 with an extension to multiple organ segmentation from MR. In this task, the interesting part is that only the liver is annotated as ground truth in the CT datasets, but the MRI datasets have four annotated abdominal organs.

Task 5: Segmentation of abdominal organs (MRI) is the same as Task 3 but extended to four abdominal organs.

In all tasks, a fusion of individual models for different modalities (i.e. two models, one working on CT and the other on MRI) is not valid. However, the fusion of individual models for MRI sequences (T1-DUAL and T2-SPIR) is allowed in all MRI-included tasks. More details about the tasks are available on the CHAOS challenge website.111CHAOS Description: 222CHAOS FAQ:

Iii-B Data Information and Details

The CHAOS challenge data contains 80 patients. 40 of them went through a single CT scan and 40 of them went through MR scans including 2 pulse sequences of the upper abdomen area. Both CT and MR datasets include healthy abdomen organs without any tumors, lesions, etc. The datasets were collected from the Department of Radiology, Dokuz Eylul University Hospital, Izmir, Turkey. The scan protocols are briefly explained in the following subsections. Further details and explanations are available on the CHAOS website. 333CHAOS Data Info:

Iii-B1 CT Data Specifications

The CT volumes were acquired at the portal venous phase after contrast agent injection. In this phase, the liver parenchyma is enhanced maximally through blood supply by the portal vein. Portal veins are well enhanced but some enhancements also exist for hepatic veins. This phase is widely used for liver and vessel segmentation prior to surgery. Since the tasks related to CT data only include liver segmentation, this set has only annotations for the liver. The details of the data are presented in Table II.

Iii-B2 MRI Data Specifications

The MRI dataset includes two different sequences (T1 and T2) for 40 patients. In total, there are 120 DICOM datasets from T1-DUAL in phase (40 datasets), oppose phase (40 datasets), and T2-SPIR (40 datasets). Each of these sets is routinely performed to scan the abdomen in the clinical routine. T1-DUAL in and oppose phase images are registered. Therefore their ground truths are the same. On the other hand, T1 and T2 sequences are not registered. The datasets were acquired on a 1.5T Philips MRI, which produces 12-bit DICOM images. The details of this dataset are given in Table II.

Specification CT MR
Number of patients (Train + Test) 20 + 20 20 + 20
Number of sets (Train + Test) 20 + 20 60 + 60*
Spatial resolution of files 512 x 512 256 x 256
Number of files in all sets [min-max] [78 - 294] [26 - 50]
Average number of files in a set 160 32x3*
Total files in the whole dataset 6407 3868x3*
X space (mm) [min-max] [0.54 - 0.79] [0.72 - 2.03]
Y space (mm) [min-max] [0.54 - 0.79] [0.72 - 2.03]
Slice thickness (mm.) [min-max] [2.0 - 3.2] [4.4 - 8.0]

* MRI sets are collected from 3 different pulse sequences. For each patient T1 (in) and (oppose) phases (registered) and T2 phase are acquired.

TABLE II: Statistics about CHAOS CT and MRI dataset.

Iii-C Annotations for reference segmentation

All 2D slices were labeled manually by three different radiology experts who have 10, 12, and 28 years of experience, respectively. The final shapes of the reference segmentations were decided by majority voting. Also, in some extraordinary situations, experts have made joint decisions. Although this handcrafted annotation process has taken a significant amount of time, it was preferred to create thea consistent and consensus-based ground truth image series.

Iii-D Challenge Setup and Distribution of the Data

Both CT and MRI datasets were divided into 20 sets for training and 20 sets for testing. Typically, training data is presented with ground truth labels, while testing data only contains original images. To provide sufficient data that contains enough variability, the datasets in the training data were selected to represent all the difficulties that are observed on the whole database.

The images are distributed as DICOM files to present the data in its original form. The only modification was removing patient-related information for anonymization. The ground truths are also presented as image series to match the original format. CHAOS data can be accessed with its DOI number via webpage under CC-BY-SA 4.0 license [21].

Iv Evaluation

Iv-a Metrics

Since the outcomes of medical image segmentation are used for various clinical procedures, using a single metric for 3D segmentation evaluation is not a proper approach to ensure acceptable results for all requirements  [29, 46]. Thus, in the CHAOS challenge, four different metrics are combined. The metrics have been chosen among the most preferred ones in previous challenges [29] and to analyze results in terms of overlapping, volumetric, and spatial differences:

Assume that represents the voxels in a segmentation result, represents the voxels in the ground truth.

DICE coefficient (DICE)

Dice coefficient is calculated as Dice , where denotes cardinality (the larger, the better).

Relative absolute volume difference (RAVD)

RAVD compares two volumes. RAVD = , where ‘abs’ denotes the absolute value (the smaller, the better).

Average symmetric surface distance (ASSD)

This metric is the average Hausdorff distance between border voxels in and . The unit of this metric is millimeters.

Maximum symmetric surface distance (MSSD)

This metric is the maximum Hausdorff distance between border voxels in and . The unit of this metric is millimeters.

Iv-B Scoring System

In the literature, there are two main ways of ranking results via multiple metrics. One way is ordering the results by metrics’ statistical significance with respect to all results. Another way is converting the metric outputs to the same scale and averaging all  [27]. In CHAOS, we adopted the second approach. Values coming from each metric have been transformed to span the interval so that higher values correspond to better segmentation. For this transformation, it was reasonable to apply thresholds in order to cut off the unacceptable results and increase the sensitivity of the corresponding metric. We are aware of the fact that decisions on metrics and thresholds have a very critical impact on ranking [29]. Instead of setting arbitrary thresholds, we used intra- and inter-user similarities among our experts who created the ground truth. We asked some of the experts to repeat the annotation process at different times. These collections of reference masks were used for the calculation of our metrics in a pair-wise manner. These values were used to specify the thresholds as given in Table III. By using these thresholds, two manual segmentations performed by the same expert on the same CT data set resulted in liver volumes of 1491 mL and 1496 mL. The volumetric overlap is found to be 97.21%, while RVD is 0.347%, ASSD is 0.611 (0.263 mm), RMSD is 1.04 (0.449 mm), and MSSD is 13.038 (5.632 mm). These measurements yielded a total grade of 95.14. Similar analysis of the segmentation of the liver from MRI showed a slightly lower grade of 93.01%.

Metric name Best value Worst value Threshold
DICE 1 0 DICE >0.8
RAVD 0% 100% RAVD <5%
ASSD 0 mm ASSD <15 mm
MSSD 0 mm MSSD <60 mm
TABLE III: Summary of metrics and threshold values. represents longest possible distance in the 3D image.

The metric values outside the threshold range get zero points. The values within the range are mapped to the interval . Then, the score of each case in the testing data is calculated as the mean of the four scores. The missing cases (sets which do not have segmentation results) get zero points and these points are included in the final score calculation. The average of the scores across all test cases determines the overall score of the team for the specified task. The code for all metrics (in MATLAB, Python, and Julia) is available at Also, more details about the metrics, CHAOS scoring system, and a mini-experiment that compares sensitivities of different metrics to distorted segmentations are provided on the same website.

Team Details of the method Training strategy
OvGUMEMoRIAL (P. Ernst, S. Chatterjee, O. Speck, A. Nürnberger)
  • [leftmargin=*]

  • Modified Attention U-Net [1], employing soft attention gates and multiscaled input image pyramid for better feature representation is used.

  • Parametric ReLU activation is used instead of ReLU, where an extra parameter, i.e. coefficient of leakage, is learned during training.

  • [leftmargin=*]

  • Tversky loss is computed for the four different scaled levels.

  • Adam optimizer is used, training is accomplished by 120 epochs with a batch size of 256.

(D. D. Pham, G. Dovletov, J. Pauli)
  • [leftmargin=*]

  • The proposed architecture consists of three main modules:

    • [leftmargin=0.6mm]

    • Autoencoder net composed of a prior encoder , and decoder ;

    • Hourglass net composed of an imitating encoder , and decoder ;

    • U-Net module, i.e. , which is used to enhance the decoder by guiding the decoding process for better localization capabilities.

  • [leftmargin=*]

  • The segmentation networks are optimized separately using the Dice-loss and regularized by with weight of .

  • The autoencoder is optimized separately using Dice loss.

  • Adam optimizer with an initial learning rate of 0.001, and 2400 iterations are performed to train each model.

Lachinov (D. Lachinov)
  • [leftmargin=*]

  • 3D U-Net, with skip connections between contracting/expanding paths and exponentially growing number of channels across consecutive resolution levels [26].

  • The encoding path is constructed by a residual network for efficient training.

  • Group normalization [44] is adopted instead of batch [16] (# of groups = 4).

  • Pixel shuffle is used as an upsampling operator

  • [leftmargin=*]

  • The network was trained with ADAM optimizer with learning rate 0.001 and decaying with a rate of 0.1 at 7th and 9th epoch.

  • The network is trained with batch size 6 for 10 epochs. Each epoch has 3200 iterations in it.

  • The loss function employed is Dice loss.

IITKGP-KLIV (R. Sathish, R. Rajan, D. Sheet)
  • [leftmargin=*]

  • To achieve multi-modality segmentation using a single framework, a multi-task adversarial learning strategy is employed to train a base segmentation network SUMNet [31]

    with batch normalization.

  • Adversarial learning is performed by two auxiliary classifiers, namely C1 and C2, and a discriminator network D.

  • [leftmargin=*]

  • The segmentation network and C2 are trained using cross-entropy loss while the discriminator D and auxiliary classifier C1 are trained by binary cross-entropy loss.

  • Adam optimizer. Input is the combination of all four modalities, i.e. CT, MRI T1 DUAL In and Oppose Phases, MRI T2 SPIR.

METU_MMLAB (S. Özkan, B. Baydar, G. B. Akar)
  • [leftmargin=*]

  • A U-Net variation and a Conditional Adversarial Network (CAN) is introduced.

  • Batch Normalization is performed before convolution to prevent vanishing gradients and increase selectivity.

  • Parametric ReLU to preserve negative values using a trainable leakage parameter.

  • [leftmargin=*]

  • To improve the performance around the edges, a CAN is employed during training (not as a post-process operation).

  • This introduces a new loss function to the system which regularizes the parameters for sharper edge responses.

PKDIA (P.-H. Conze)
  • [leftmargin=*]

  • Conditional Generative Adversarial Networks (cGANs): the generator is built by cascaded pre-trained encoder-decoder (ED) networks extending the standard U-Net (sU-Net) [36] (VGG19, following [5]

    ), with 64 channels (instead of 32 for sU-Net) generated by first convolutional layer. After each max-pooling, channel number doubles until 512 (256 for sU-Net). Max-pooling followed by 4 consecutive conv. layers instead of 2. The auto-context paradigm is adopted by cascading two EDs

    [45]: the output of the first is used as features for the second.

  • [leftmargin=*]

  • Adam optimizer with a learning rate of is used.

  • Fuzzy Dice score is employed as loss function.

  • Batch size was set to 3 for CT and 5 for MR scans.

MedianCHAOS (V. Groza)
  • [leftmargin=*]

  • Averaged ensemble of five different networks is used. The first one is DualTail-Net that is composed of an encoder, central block and 2 dependent decoders.

  • Other four networks are U-Net variants, i.e. TernausNet (U-Net with VGG11 backbone [15]), LinkNet34 [37], and two with ResNet-50 and SE-Resnet50.

  • [leftmargin=*]

  • The training for each network was performed with Adam.

  • DualTail-Net and LinkNet34 were trained with soft Dice loss and the other three networks were trained with the combined loss: 0.5*soft Dice + 0.5*BCE (binary cross-entropy).

Mountain (Shuo Han)
  • [leftmargin=*]

  • 3D network adopting U-Net variant in [11] and differs from sU-Net in [36]

    , by: 1) A pre-activation residual block in each scale level at the encoder, 2) Convolutions with stride 2 to reduce the spatial size, 3) Instance normalization


  • Two nets, i.e. NET1 and NET2, adopting [11] with different channels and levels. NET1 locates organ and outputs a mask for NET2 performing finer segmentation.

  • [leftmargin=*]

  • Adam optimizer is used with the initial learning rate , , , and .

  • Dice coefficient was used as the loss function. Batch size was set to 1.

CIRMPerkonigg (M. Perkonigg)
  • [leftmargin=*]

  • For joint training with all modalities, the IVD-Net [7] is used with a number of modifications: 1) dense connections between encoder path of IVD-Net are not used since no improvement is achieved, 2) training images are split.

  • Moreover, residual convolutional blocks [12] are used.

  • [leftmargin=*]

  • Modality Dropout [28] is used as the regularization technique to decrease over-fitting on certain modalities.

  • Training is done by using Adam optimizer with a learning rate of 0.001 for 75 epochs.

(F. Isensee, K. H. Maier-Hein)
  • [leftmargin=*]

  • An internal variant of nnU-Net [17], which is the winner of Medical Segmentation Decathlon (MSD) in 2018 [38], is used.

  • Ensemble of five 3D U-Nets (“3d_fullres” configuration), which originate from cross-validation on the training cases. Ensemble of T1 in and oppose phases was used.

  • [leftmargin=*]

  • T1 in and out are treated as separate training examples, resulting in a total of 60 training examples for the tasks.

  • Task 3 is a subset of 5, so training was done only once and the predictions for Task 3 were generated by isolating the liver.

TABLE IV: Participating methods

V Participating Methods

The participated methods are summarized in Table IV and VI. Detailed descriptions with figures can be found in the challenge proceedings444 and here, an overview is given.

The majority of the applied methods (i.e. all except IITKGP-KLIV) used variations of U-Net [36]. This seems to be a typical situation as the corresponding architecture dominates most of the recent DL based studies. Among all, two of them rely on ensembles (i.e. MedianCHAOS and nnU-Net), which uses multiple models and combine their results.

To compare the automatic DL methods with semi-automatic ones, interactive methods including both traditional iterative models and more recent techniques are employed from our previous work [20]. In this respect, the results also present and discuss the accuracy and repeatability of emerging automatic DL algorithms with those of well-established interactive methods, which are applied by a team of imaging scientists and radiologists through two dedicated viewers, Slicer [22] and exploreDICOM [8].

Submission numbers Task 1 Task 2 Task 3 Task 4 Task 5
ISBI 2019 5 14 7 4 5
Online 20 263 42 15 71
The most by the same team (ISBI 2019) 0 5 0 0 0
The most by the same team (Online) 3 8 8 2 7
TABLE V: Submission statistics

Vi Results

The training dataset was published approximately three months before the ISBI 2019. The testing dataset was given 24 hours before the challenge session. The submissions were evaluated during the conference, and the winners were announced. After ISBI 2019, training and testing (only DICOM images) datasets were published on website [21] and the online submission system was activated on the challenge website. 555

Team Pre-process Data augmentation Postprocess
OvGUMEMoRIAL TRI* (). Inference: full-sized. -

All teams thresholded their outputs by 0.5. PKDIA, METUMMLAB, and Mountain used connected component analysis for selecting/eliminating some of the model outputs. ISDUE used bicubic interpolation for refinement.

ISDUE TRI (96,128,128) Random translate and rotate
Lachinov RS** z-score normalization Random ROI crop , mirror X-Y, transpose X-Y, WL-WW***
IITKGP-KLIV TRI (), whitening. Additional class for body. -
METUMMLAB Min-max normalization for CT -
PKDIA TRI:  MR,  CT. Random scale, rotate, shear and shift
MedianCHAOS LUT [-240,160] HU range, normalization. -
Mountain RS

, zero padding. TRI:

. Rigid register MR.
Random rotate, scale, elastic deformation

Normalization to zero mean unit variance.

2D Affine and elastic transforms, histogram shift, flip and adding Gaussian noise.
nnU-Net Normalization to zero mean unit variance, RS Add Gaussian noise / blur, rotate, scale, WL-WW, simulated low resolution, Gamma, mirroring
*TRI=Training with resized images **RS=Resampling ***WL-WW: Window Level -Width
TABLE VI: Pre-, post-processing and data augmentation operations

There exist two separate leaderboards at the challenge website, one for the conference session666 and another for post-conference online submissions.777 In this section, we present the majority of the results from the conference participants and the best two of the most significant post-conference results collected among the online submissions. To be specific, Metu_MMLab and nnU-Net results belong to online submissions while others are from the conference session. Each method is assigned a unique color code as shown in the figures and Table VII.

Box plots of all results for each task are presented separately in Fig.2. Also, scores on each testing case are shown in Fig.2 for all tasks. As expected, the tasks, which received the highest number of submissions and scores, were the ones focusing on the segmentation of a single organ from a single modality. Thus, the vast majority of the submissions were for liver segmentation from CT images (Task 2), followed by liver segmentation from MR images (Task 3). Accordingly, in the following subsections, the results are presented in the order of performance/participation Table V (i.e. from the task having the highest submissions and scores to the one having the lowest). In this way, the segmentation from cross- and multi-modality/organ concepts (Tasks 1 and 4) are discussed in the light of the performances obtained for more conventional approaches (Tasks 2, 3 and 5).

Fig. 1: Box plot of the methods’ score for (a) Task 1, (b) Task 2, (c) Task 3, (d) Task 4, and (e) Task 5 on testing data. White diamonds represent the mean values of the scores. Solid vertical lines inside of the boxes represent medians. Separate dots show scores of each individual case.
Fig. 2: Distribution of the methods’ scores for individual cases on testing data.

Vi-a CT Liver Segmentation (Task 2)

This task includes one of the most studied cases and a very mature field of abdominal segmentation. Therefore, it provides a good opportunity to test the effectiveness of the participated models compared to the existing approaches. Although the provided datasets only include healthy organs, the injection of contrast media creates several additional challenges as described in Section III.B. Nevertheless, the highest scores of the challenge were obtained in this task (Fig.2.b).

The on-site winner was MedianCHAOS with a score of 80.458.61 and the online winner is PKDIA with 82.468.47. Being an ensemble strategy, the performances of the sub-networks of MedianCHAOS are illustrated in Fig. 2.a. When individual metrics are analyzed, DICE performances seem to be outstanding (i.e. 0.980.00) for both winners (i.e. scores 97.790.43 for PKDIA and 97.550.42 for MedianCHAOS). Similarly, ASSD performances have very high mean and small variance (i.e. 0.890.36 [score: 94.062.37] for PKDIA and 0.900.24 [94.021.6]for MedianCHAOS). On the other hand, RAVD and MSSD scores are significantly low resulting in reduced overall performance. Actually, this outcome is valid for all tasks and participating methods.

Regarding semi-automatic approaches in [20], the best three have received scores 72.8 (active contours with mean interaction time (MIT) 25 minutes ), 68.1 (robust static segmenter having MIT of 17 minutes), and 62.3 (i.e. watershed MIT 8 minutes). Thus, the successful designs among participated in deep learning-based automatic segmentation algorithms have outperformed the interactive approaches by a large margin. This increase reaches almost to the inter-expert level for volumetric analysis and average surface differences. However, there is still a need for improvement considering the metrics related to maximum error margins (i.e. RAVD and MSSD). An important drawback of the deep approaches is observed as they might completely fail and generate unreasonably low scores for particular cases.

Regarding the effect of architectural design differences on performance, comparative analyses have been performed through some well established deep frameworks (i.e. DeepMedic [19] and NiftyNet [9]). These models have been applied with their default parameters and they have both achieved scores around 70. Thus, considering the participating models that have received below 70, it is safe to conclude that, even after the intense research studies and literature, the new deep architectural designs and parameter tweaking does not necessarily translate into more successful systems.

Vi-B MR Liver Segmentation (Task 3)

Segmentation from MR can be considered as a more difficult operation compared to segmentation from CT because CT images have a typical histogram and dynamic range defined by Hounsfield Units (HU), whereas MRI does not have such a standardization. Moreover, the artifacts and other factors in clinical routine cause significant degradation of MR image quality. The on-site winner of this task is PKDIA with a score of 70.716.40, which had the most successful results not only for the mean score but also for the distribution of the results (shown in 2.c and 2.d). Robustness to the deviations in MR data quality is an important factor that affects performance. For instance, CIR_MPerkonigg, which has the most successful scores for some cases, but could not show a high overall score.

The online winner is nnU-Net with 75.107.61. When the scores of individual metrics are analyzed for PKDIA and nnU-Net, DICE (i.e. 0.940.01 [score: 94.471.38] for PKDIA and 0.950.01 [score:95.421.32] for nnU-Net) and ASSD (i.e. 1.320.83 [score: 91.195.55] for nnU-Net and 1.560.68 [89.584.54] for PKDIA) performances are again extremely good, while RAVD and MSSD scores are significantly lower than the CT results. The reason behind this can also be attributed to the low resolution and higher spacing of the MR data, which cause a higher spatial error for each mis-classified pixel/voxel (See Table II). Comparisons with the interactive methods show that they tend to make regional mistakes due to the spatial enlargement strategies. The main challenge for them is to differentiate the outline when the liver is adjacent to isodense structures. On the other hand, automatic methods show much more distributed mistakes all over the liver. Further analysis also revealed that interactive methods seem to make fever over-segmentations. This is partly related to iterative parameter adjustment of the operator which prevents unexpected results. Overall, the participated methods performed equally well with interactive methods if only volumetry metrics are considered. However, the interaction seems to outperform deep models at other measures.

Vi-C CT-MR Liver Segmentation (Task 1)

This task aims cross-modality learning and it involves the usage of CT and MR information together during training. A model that can effectively accomplish cross-modality learning would: 1) help to satisfy the big data needs of deep models by providing more images and 2) reveal common features of incorporated modalities for an organ. To compare cross modality learning with individual ones, Fig.2.a should be compared to Fig.2.c for CT. Such a comparison clearly reveals that participated models trained only on CT data show significantly better performance than models trained on both modalities. A similar observation can also be made for MR results by observing Fig.2.b and Fig.2.d.

The on-site winner of this task was OvGUMEMoRIAL with a score of 55.7819.20. Although its DICE performances are quite satisfactory (i.e. 0.880.15 correspond to a score of 83.140.43), the other measures cause the low grade. Here, a very interesting observation is that the score of OvGUMEMoRIAL is lower than its score on CT (61.1319.72) but higher than MR (41.1521.61). Another interesting observation of the highest-scoring non-ensemble model, PKDIA, both for Task 2 (CT) and Task 1 (MR), had a significant performance drop in this task. Finally, it is worth to point that the online results have reached up to 73, but those results could not be validated (i.e. participant did not respond) and achieved after multiple submissions by the same team. In such cases, the possibility of peeking is very strong and therefore, not included in the manuscript.

It is important to examine the scores of cases with their distribution across all data. This can help to analyze the generalization capabilities and real-life use of these systems. For example, Fig.2

.a shows a noteworthy situation. The winner of Task 1, OvGUMEMoRIAL, shows lower performances than the second method (ISDUE) in terms of standard deviation. Figures

2.a and 2

.b show that the competing algorithms have slightly higher scores on the CT data than on the MR data. However, if we consider the scattering of the individual scores along with the data, CT scores have higher variability. This shows that reaching equal generalization for multiple modalities is a challenging task for Convolutional Neural Networks (CNNs).

Team Name Mean Score DICE DICE Score RAVD RAVD Score ASSD ASSD Score MSSD MSSD Score

Task 1

OvGUMEMoRIAL 55.78 19.20 0.88 0.15 83.14 28.16 13.84 30.26 24.67 31.15 11.86 65.73 76.31 21.13 57.45 67.52 31.29 26.01
ISDUE 55.48 16.59 0.87 0.16 83.75 25.53 12.29 15.54 17.82 30.53 5.17 8.65 75.10 22.04 36.33 21.97 44.83 21.78
PKDIA 50.66 23.95 0.85 0.26 84.15 28.45 6.65 6.83 21.66 30.35 9.77 23.94 75.84 28.76 46.56 45.02 42.28 27.05
Lachinov 45.10 21.91 0.87 0.13 77.83 33.12 10.54 14.36 21.59 32.65 7.74 14.42 63.66 31.32 83.06 74.13 24.30 27.78
METU_MMLAB 42.54 18.79 0.86 0.09 75.94 32.32 18.01 22.63 14.12 25.34 8.51 16.73 60.36 28.40 62.61 51.12 24.94 25.26
IITKGP-KLIV 40.34 20.25 0.72 0.31 60.64 44.95 9.87 16.27 24.38 32.20 11.85 16.87 50.48 37.71 95.43 53.17 7.22 18.68

Task 2

PKDIA* 82.46 8.47 0.98 0.00 97.79 0.43 1.32 1.302 73.6 26.44 0.89 0.36 94.06 2.37 21.89 13.94 64.38 20.17
MedianCHAOS6 80.45 8.61 0.98 0.00 97.55 0.42 1.54 1.22 69.19 24.47 0.90 0.24 94.02 1.6 23.71 13.66 61.02 21.06
OvGUMEMoRIAL 61.13 19.72 0.90 0.21 90.18 21.25 9x 4x 44.35 35.63 4.89 12.05 81.03 20.46 55.99 38.47 28.96 26.73
ISDUE 55.79 11.91 0.91 0.04 87.08 20.6 13.27 7.61 4.16 12.93 3.25 1.64 78.30 10.96 27.99 9.99 53.60 15.76
IITKGP-KLIV 55.35 17.58 0.92 0.22 91.51 21.54 8.36 21.62 30.41 27.12 27.55 114.04 81.97 21.88 102.37 110.9 17.50 21.79
Lachinov 39.86 27.90 0.83 0.20 68 40.45 13.91 20.4 22.67 33.54 11.47 22.34 53.28 33.71 93.70 79.40 15.47 24.15

Task 3

nnU-Net 75.10 7.61 0.95 0.01 95.42 1.32 2.85 1.55 47.92 25.36 1.32 0.83 91.19 5.55 20.85 10.63 65.87 15.73
PKDIA 70.71 6.40 0.94 0.01 94.47 1.38 3.53 2.14 41.8 24.85 1.56 0.68 89.58 4.54 26.06 8.20 56.99 12.73
Mountain 60.82 10.94 0.92 0.02 91.89 1.99 5.49 2.77 25.97 27.95 2.77 1.32 81.55 8.82 35.21 14.81 43.88 17.60
ISDUE 55.17 20.57 0.85 0.19 82.08 28.11 11.8 15.69 24.65 27.58 6.13 10.49 73.50 25.91 40.50 24.45 40.45 20.90
CIR_MPerkonigg 53.60 17.92 0.91 0.07 84.35 19.83 10.69 20.44 31.38 25.51 3.52 3.05 77.42 18.06 82.16 50 21.27 23.61
METU_MMLAB 53.15 10.92 0.89 0.03 81.06 18.76 12.64 6.74 10.94 15.27 3.48 1.97 77.03 12.37 35.74 14.98 43.57 17.88
Lachinov 50.34 12.22 0.90 0.05 82.74 18.74 8.85 6.15 21.04 21.51 5.87 5.07 68.85 19.21 77.74 43.7 28.72 15.36
OvGUMEMoRIAL 41.15 21.61 0.81 0.15 64.94 37.25 49.89 71.57 10.12 14.66 5.78 4.59 64.54 24.43 54.47 24.16 25.01 20.13
IITKGP-KLIV 34.69 8.49 0.63 0.07 46.45 1.44 6.09 6.05 43.89 27.02 13.11 3.65 40.66 9.35 85.24 23.37 7.77 12.81

Task 4

ISDUE 58.69 18.65 0.85 0.21 81.36 28.89 14.04 18.36 14.08 27.3 9.81 51.65 78.87 25.82 37.12 60.17 55.95 28.05
PKDIA 49.63 23.25 0.88 0.21 85.46 25.52 8.43 7.77 18.97 29.67 6.37 18.96 82.09 23.96 33.17 38.93 56.64 29.11
OvGUMEMoRIAL 43.15 13.88 0.85 0.16 79.10 29.51 5x 5x 12.07 23.83 5.22 12.43 73.00 21.83 74.09 52.44 22.16 26.82
IITKGP-KLIV 35.33 17.79 0.63 0.36 50.14 46.58 13.51 20.33 15.17 27.32 16.69 19.87 40.46 38.26 130.3 67.59 8.39 22.29

Task 5

nnU-Net 72.44 5.05 0.95 0.02 94.6 1.59 5.07 2.57 37.17 20.83 1.05 0.55 92.98 3.69 14.87 5.88 75.52 8.83
PKDIA 66.46 5.81 0.93 0.02 92.97 1.78 6.91 3.27 28.65 18.05 1.43 0.59 90.44 3.96 20.1 5.90 66.71 9.38
Mountain 60.2 8.69 0.90 0.03 85.81 10.18 8.04 3.97 21.53 15.50 2.27 0.92 84.85 6.11 25.57 8.42 58.66 10.81
ISDUE 56.25 19.63 0.83 0.23 79.52 28.07 18.33 27.58 12.51 15.14 5.82 11.72 77.88 26.93 32.88 33.38 57.05 21.46
METU_MMLAB 56.01 6.79 0.89 0.03 80.22 12.37 12.44 4.99 15.63 13.93 3.21 1.39 79.19 8.01 32.70 9.65 49.29 12.69
OvGUMEMoRIAL 44.34 14.92 0.79 0.15 64.37 32.19 76.64 122.44 9.45 11.98 4.56 3.15 71.11 18.22 42.93 17.86 39.48 16.67
IITKGP-KLIV 25.63 5.64 0.56 0.06 41.91 11.16 13.38 11.2 11.74 11.08 18.7 6.11 35.92 8.71 114.51 45.63 11.65 13.00
* Corrected submission of PKDIA right after the ISBI 2019 conference (i.e. During the challenge, they have submitted the same results, but in reversed orientation. Therefore, the winner of Task 2 at conference session is the MedianCHAOS6).
TABLE VII: Metric values and corresponding scores of submissions. The given values represent the average of all cases and all organs of the related tasks in the test data (The best results are given in bold).

Vi-D Multi-Modal MR Abdominal Organ Segmentation (Task 5)

Task 5 investigates how DL models contribute to the development of more comprehensive computational anatomical models leading to multi-organ related tasks. Deep models have the potential to provide a complete representation of the complex and flexible abdominal anatomy by incorporating inter-organ relations through their internal hierarchical feature extraction process.

The on-site winner was PKDIA with a score of 66.46 .81 and the online winner is nnU-Net with 72.445.05. When the scores of individual metrics are analyzed in comparison to Task 3, the DICE performances seems to remain almost the same for nnU-Net and PKDIA. This is a significant outcome as all four organs are segmented instead of a single one. It is also worth to point that the model that the third-place (i.e. Mountain) have almost exactly the same overall score for Task 3 and 5. The same observation is also valid for the standard deviation of these models. Considering RAVD, the performance decrease seems to be higher compared to DICE. These reduced DICE and RAVD performances are partially compensated by better MSSD and ASSD performances. On the other hand, this increase might not be directly related to multi-organ segmentation. One should keep in mind that generally the most complex abdominal organ for segmentation is the liver (Task 3) and the other organs in Task 5 can be considered relatively easier to analyze.

Vi-E CT-MR Abdominal Organ Segmentation (Task 4)

This task covers segmentation of both the liver in CT and four abdominal organs in MRI data. Hence, it can be considered as the most difficult task since it contains both cross-modality learning and multiple organ segmentation. Therefore, it is not surprising that it has the lowest attendance and scores.

The on-site winner was ISDUE with a score of 58.6918.65. Fig. 2.e-f shows that their solution had consistent and high-performance distribution in both CT and MR data. It can be thought that two convolutional encoders in their system boost performance on cross-modality data. These encoders are able to compress information about anatomy. On the other hand, PKDIA also shows promising performance with a score of 49.63

23.25. Despite their success on MRI sets, the CT performance can be considered unsatisfactory similar to their situation at Task 1. This reveals that the CNN may not perform efficient training. The encoder part of their solution uses transfer learning and the pre-trained weights approach might not be successful on multiple modalities. The OvGUMEMoRIAL team achieved third position with an average score and they have a balanced performance on both modalities. Their method can be considered successful in terms of generalization.

Together with the outcomes of Task 1 and 5, it is shown that in current strategies and architectures, CNNs have better segmentation performance on single modality tasks. This might be considered as an expected outcome because the success of CNNs is very dependent on the consistency and homogeneity of the data. Using multiple modalities creates a high variance in the data even though all data were normalized. On the other hand, the results also revealed that CNNs have good potential for cross-modality tasks if appropriate models are constructed. This potential was not that clear before the development of deep learning strategies for segmentation.

Vii Conclusions and Discussions

In this paper, we presented the CHAOS abdominal healthy organ segmentation benchmark. We generated an unpaired multi-modality (CT-MR), multi-Sequence (T1 in / oppose, T2) public dataset for five tasks and evaluated a significant number of well-established and state-of-the-art segmentation methods. The evaluation is performed using four metrics. Our results indicate various important outcomes. First, deep learning-based automatic methods outperformed interactive semi-automatic strategies for CT liver segmentation. They have reached inter-expert variability for DICE and volumetry, but still need some more improvements at distance-based measures, which are critical for determining surgical error margins. Second, considering MR liver segmentation, the participated deep models have performed almost equally well with interactive ones for DICE, but lack performance for distance-based measures. Third, when all four abdominal organs are considered, the performance of deep models get better compared to only liver segmentation. However, it is unclear if this improvement can be attributed to multi-tasking since the liver can be considered as the most complex organ to segment among the ones in the challenge. Fourth, cross-modality (CT-MR) learning still proved more challenging than individual training. Last, but not least, multi-organ cross-modality segmentation remains the most challenging problem until appropriate ways to take advantage of multi-tasking properties of deep models and bigger data advantage of cross-modal medical data are developed. Such complicated tasks would benefit from spatial priors, global topological or shape-representations in their loss functions as employed by some of the suggested models.

Given the outstanding results for single-modality tasks and the fact that the resulting volumes will be visualized by a radiologist-surgeon team prior to clinical operations in the context of clinical workflow, it can be concluded that minimal user interaction, especially in the post-processing phase, would easily bring the single modality results to clinically acceptable levels. This would require not only having a software implementation of the participated successful methods, but also their integration to an adequate workstation/DICOM viewer that is integrated and easily accessible in the daily workflow of clinicians.

Except for one, all teams involved in this challenge have used a modification of U-Net as a primary classifier or as a support system. However, the high variance between reported scores, even though they use the same baseline CNN structure, shows the interpretability of the model performance still relies on many parameters including architectural design, implementation, parametric modifications, and tuning. Although several common algorithmic properties can be derived for high-scoring models, an interpretation and/or explanation of why a particular model performs well or not is far from being trivial as even one of these factors is poorly determined. As discussed in the previous challenges, such an analysis is almost impossible to be performed on a heterogeneous set of models developed by different teams and programming environments. Moreover, the selection of evaluation metrics, their transformations to scoring and calculation of the final scores might have a significant impact on the reported performances.

Since the start of the competition, the most popular task, Task 2, has received more than 200 submissions in eight months. Quantitative analyses on Task 2 show that CNNs for segmentation of the liver from CT have achieved great success. Supporting the quantitative analyses, our qualitative observations unveil the top methods can be used in real-life solutions with little effort on post-processing. That is why we believe that the solutions of Task 2 have reached saturation. Of course, new future results may have better segmentation performance than the methods we review here. However, the impact and the significance of these slight improvements may not justify the effort in developing them. We suggest to researchers that they should focus more on the implementation of their solutions to real-world applications instead of pushing hard to gain minimal score upgrades. They should try to reduce the computational cost, increase generalization, attach importance to repeatability and reproducibility, and make the solutions easy to implement.


The organizers would like to thank the whole ISBI 2019 team, especially Ivana Isgum and Tom Vercauteren in the challenge committee for their guidance and support. We express gratitude to all authors and supporting organizations of the platform for hosting our challenge. We thank Esranur Kazaz, Umut Baran Ekinci, Ece Köse, David Völgyes, and Javier Coronel for their contributions. Last but not least, our special thanks go to Ludmila I. Kuncheva for her valuable contributions.


  • [1] N. Abraham and N. M. Khan (2019) A novel focal tversky loss function with improved attention u-net for lesion segmentation. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 683–687. Cited by: 1st item.
  • [2] N. Ayache and J. Duncan (2016-10) 20th anniversary of the medical image analysis journal (MedIA). Medical Image Analysis 33, pp. 1–3. External Links: Link, Document Cited by: §I.
  • [3] P. Bilic, P. F. Christ, E. Vorontsov, G. Chlebus, H. Chen, Q. Dou, C. Fu, X. Han, P. Heng, J. Hesser, et al. (2019) The liver tumor segmentation benchmark (lits). arXiv preprint arXiv:1901.04056. Cited by: §II.
  • [4] J. J. Cerrolaza, M. L. Picazo, L. Humbert, Y. Sato, D. Rueckert, M. Á. G. Ballester, and M. G. Linguraru (2019) Computational anatomy for multi-organ analysis in medical imaging: a review. Medical Image Analysis 56, pp. 44 – 67. External Links: ISSN 1361-8415, Document Cited by: §II.
  • [5] P. Conze, C. Pons, V. Burdin, F. T. Sheehan, and S. Brochard (2019) Deep convolutional encoder-decoders for deltoid segmentation using healthy versus pathological learning transferability. In IEEE International Symposium on Biomedical Imaging, Venice, Italy, pp. 36–39. Cited by: 1st item.
  • [6] X. Deng and G. Du (2008) 3D segmentation in the clinic: a grand challenge ii-liver tumor segmentation. In MICCAI workshop, Cited by: §II.
  • [7] J. Dolz, C. Desrosiers, and I. B. Ayed (2018) IVD-net: intervertebral disc localization and segmentation in mri with a multi-modal unet. In International Workshop and Challenge on Computational Methods and Clinical Applications for Spine Imaging, pp. 130–143. Cited by: 1st item.
  • [8] F. Fischer, M. Alper Selver, W. Hillen, and C. Guzelis (2010-07) Integrating segmentation methods from different tools into a visualization program using an object-based plug-in interface. IEEE Transactions on Information Technology in Biomedicine 14 (4), pp. 923–934. External Links: Document, ISSN 1558-0032 Cited by: §V.
  • [9] E. Gibson, W. Li, C. Sudre, L. Fidon, D. I. Shakir, G. Wang, Z. Eaton-Rosen, R. Gray, T. Doel, Y. Hu, T. Whyntie, P. Nachev, M. Modat, D. C. Barratt, S. Ourselin, M. J. Cardoso, and T. Vercauteren (2018) NiftyNet: a deep-learning platform for medical imaging. Computer Methods and Programs in Biomedicine 158, pp. 113 – 122. External Links: ISSN 0169-2607, Document Cited by: §VI-A.
  • [10] J. Guinney, T. Wang, T. D. Laajala, K. K. Winner, J. C. Bare, E. C. Neto, S. A. Khan, G. Peddinti, A. Airola, T. Pahikkala, et al. (2017) Prediction of overall survival for patients with metastatic castration-resistant prostate cancer: development of a prognostic model through a crowdsourced challenge with open clinical trial data. The Lancet Oncology 18 (1), pp. 132–142. Cited by: §II.
  • [11] S. Han, Y. He, A. Carass, S. H. Ying, and J. L. Prince (2019) Cerebellum parcellation with convolutional neural networks. In Medical Imaging 2019: Image Processing, Vol. 10949, pp. 109490K. Cited by: 1st item, 2nd item.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: 2nd item.
  • [13] T. Heimann, B. Van Ginneken, M. A. Styner, Y. Arzhaeva, V. Aurich, C. Bauer, A. Beck, C. Becker, R. Beichel, G. Bekes, et al. (2009) Comparison and evaluation of methods for liver segmentation from ct datasets. IEEE transactions on medical imaging 28 (8), pp. 1251–1265. Cited by: §II.
  • [14] Y. Hirokawa, H. Isoda, Y. S. Maetani, S. Arizono, K. Shimada, and K. Togashi (2008) MRI artifact reduction and quality improvement in the upper abdomen with propeller and prospective acquisition correction (pace) technique. American Journal of Roentgenology 191 (4), pp. 1154–1158. Cited by: §I.
  • [15] V. Iglovikov and A. Shvets (2018)

    Ternausnet: u-net with vgg11 encoder pre-trained on imagenet for image segmentation

    arXiv preprint arXiv:1801.05746. Cited by: 2nd item.
  • [16] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp. 448–456. Cited by: 3rd item.
  • [17] F. Isensee, J. Petersen, S. A. A. Kohl, P. F. Jäger, and K. H. Maier-Hein (2019) NnU-net: breaking the spell on successful medical image segmentation. CoRR abs/1904.08128. External Links: Link, 1904.08128 Cited by: 1st item.
  • [18] O. Jimenez-del-Toro, H. Müller, M. Krenn, K. Gruenberg, A. A. Taha, M. Winterstein, I. Eggel, A. Foncubierta-Rodríguez, O. Goksel, A. Jakab, et al. (2016) Cloud-based evaluation of anatomical structure segmentation and landmark detection algorithms: visceral anatomy benchmarks. IEEE transactions on medical imaging 35 (11), pp. 2459–2475. Cited by: §II.
  • [19] K. Kamnitsas, C. Ledig, V. F.J. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker (2017-02) Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical Image Analysis 36, pp. 61–78. External Links: ISSN 1361-8415, Document Cited by: §VI-A.
  • [20] A. E. Kavur, N. S. Gezer, M. Barış, Y. Şahin, S. Özkan, Baydar,Bora, U. Yüksel, Ç. Kılıkçıer, Ş. Olut, G. Bozdağı Akar, G. Ünal, O. Dicle, and M. A. Selver (2020-01) Comparison of semi-automatic and deep learning based automatic methods for liver segmentation in living liver transplant donors. Diagnostic and Interventional Radiology 26, pp. 11–21. External Links: Document Cited by: §V, §VI-A.
  • [21] A. E. Kavur, M. A. Selver, O. Dicle, M. Barış, and N. S. Gezer (2019-04-11)(Website) Note: Accessed: 2019-04-11 External Links: Document, Link Cited by: §III-D, §VI.
  • [22] R. Kikinis, S. D. Pieper, and K. G. Vosburgh (2014) 3D slicer: a platform for subject-specific image analysis, visualization, and clinical support. In Intraoperative Imaging and Image-Guided Therapy, F. A. Jolesz (Ed.), pp. 277–289. External Links: ISBN 978-1-4614-7657-3, Document Cited by: §V.
  • [23] M. Kistler, S. Bonaretti, M. Pfahrer, R. Niklaus, and P. Büchler (2013) The virtual skeleton database: an open access repository for biomedical research and collaboration. Journal of medical Internet research 15 (11), pp. e245. Cited by: §II.
  • [24] KiTS19 challenge. Note: Accessed: 2019-07-08 External Links: Link Cited by: §II.
  • [25] M. Kozubek (2016) Challenges and benchmarks in bioimage analysis. In Focus on Bio-Image Informatics, pp. 231–262. Cited by: §I.
  • [26] D. Lachinov (2019) Segmentation of thoracic organs using pixel shuffle. In Proceedings of the 2019 Challenge on Segmentation of THoracic Organs at Risk in CT Images, SegTHOR@ISBI 2019, April 8, 2019, External Links: Link Cited by: 1st item.
  • [27] A. N. Langville and C. D. (. D. Meyer (2013) Who’s #1? : the science of rating and ranking. pp. 247. External Links: ISBN 069116231X Cited by: §IV-B.
  • [28] F. Li, N. Neverova, C. Wolf, and G. Taylor (2016) Modout: learning to fuse modalities via stochastic regularization. Journal of Computational Vision and Imaging Systems 2 (1). Cited by: 1st item.
  • [29] L. Maier-Hein, M. Eisenmann, A. Reinke, S. Onogur, M. Stankovic, P. Scholz, T. Arbel, H. Bogunovic, A. P. Bradley, A. Carass, C. Feldmann, A. F. Frangi, P. M. Full, B. van Ginneken, A. Hanbury, K. Honauer, M. Kozubek, B. A. Landman, K. März, O. Maier, K. Maier-Hein, B. H. Menze, H. Müller, P. F. Neher, W. Niessen, N. Rajpoot, G. C. Sharp, K. Sirinukunwattana, S. Speidel, C. Stock, D. Stoyanov, A. A. Taha, F. van der Sommen, C. W. Wang, M. A. Weber, G. Zheng, P. Jannin, and A. Kopp-Schneider (2018-12) Why rankings of biomedical image analysis competitions should be interpreted with care. Nature Communications 9 (1), pp. 5217. External Links: ISSN 20411723, Document Cited by: §I, §IV-A, §IV-B.
  • [30] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al. (2014) The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34 (10), pp. 1993–2024. Cited by: §I, §I.
  • [31] S. Nandamuri, D. China, P. Mitra, and D. Sheet (2019) SUMNet: fully convolutional model for fast segmentation of anatomical structures in ultrasound volumes. arXiv preprint arXiv:1901.06920. Cited by: 1st item.
  • [32] PAIP 2019 challenge. Note: Accessed: 2019-07-08 External Links: Link Cited by: §II.
  • [33] L. M. Prevedello, S. S. Halabi, G. Shih, C. C. Wu, M. D. Kohli, F. H. Chokshi, B. J. Erickson, J. Kalpathy-Cramer, K. P. Andriole, and A. E. Flanders (2019)

    Challenges related to artificial intelligence research in medical imaging and the importance of image analysis competitions

    Radiology: Artificial Intelligence 1 (1), pp. e180031. Cited by: §I.
  • [34] A. Reinke, M. Eisenmann, S. Onogur, M. Stankovic, P. Scholz, P. M. Full, H. Bogunovic, B. A. Landman, O. Maier, B. Menze, et al. (2018) How to exploit weaknesses in biomedical challenge design and organization. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 388–395. Cited by: §I.
  • [35] A. Reinke, S. Onogur, M. Stankovic, P. Scholz, T. Arbel, H. Bogunovic, A. P. Bradley, A. Carass, C. Feldmann, A. F. Frangi, et al. (2018) Is the winner really the best? a critical analysis of common research practice in biomedical image analysis competitions. arXiv preprint arXiv:1806.02051. Cited by: §I.
  • [36] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: 1st item, 1st item, §V.
  • [37] A. A. Shvets, A. Rakhlin, A. A. Kalinin, and V. I. Iglovikov (2018) Automatic instrument segmentation in robot-assisted surgery using deep learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 624–628. Cited by: 2nd item.
  • [38] A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. van Ginneken, A. Kopp-Schneider, B. A. Landman, G. Litjens, B. Menze, et al. (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063. Cited by: §I, §II, 1st item.
  • [39] J. Staal, M. D. Abràmoff, M. Niemeijer, M. A. Viergever, and B. Van Ginneken (2004) Ridge-based vessel segmentation in color images of the retina. IEEE transactions on medical imaging 23 (4), pp. 501–509. Cited by: §I.
  • [40] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6924–6932. Cited by: 1st item.
  • [41] V. V. Valindria, N. Pawlowski, M. Rajchl, I. Lavdas, E. O. Aboagye, A. G. Rockall, D. Rueckert, and B. Glocker (2018-03) Multi-modal learning from unpaired images: application to multi-organ segmentation in ct and mri. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 547–556. External Links: Document, ISSN Cited by: §III-A.
  • [42] B. van Ginneken and S. Kerkstra (1999) Grand challenges in biomedical image analysis. Note: Accessed: 2019-07-07 External Links: Link Cited by: §I, §II.
  • [43] B. Van Ginneken, T. Heimann, and M. Styner (2007) 3D segmentation in the clinic: a grand challenge. 3D segmentation in the clinic: a grand challenge, pp. 7–15. Cited by: §II.
  • [44] Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: 3rd item.
  • [45] Y. Yan, P.-H. Conze, E. Decencière, M. Lamard, G. Quellec, B. Cochener, and G. Coatrieux (2019) Cascaded multi-scale convolutional encoder-decoders for breast mass segmentation in high-resolution mammograms. In Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Vol. , Berlin, Germany, pp. 6738–6741. External Links: Document, ISSN 1557-170X Cited by: 1st item.
  • [46] V. Yeghiazaryan and I. Voiculescu (2015) An overview of current evaluation methods used in medical image segmentation. Technical report Technical Report RR-15-08, Department of Computer Science, Oxford, UK. Cited by: §IV-A.