In the last decade, medical imaging and image processing benchmarks have become the strategy to compare the performance of different approaches in clinically important tasks . These benchmarks have gained a particularly important role in the analysis of learning-based systems by enabling the use of the same dataset for training and testing . Challenges, which use these benchmarks, get a prominent role to report the outcomes of the state-of-the-art results in a structured way . In this respect, the benchmarks establish standard datasets, evaluation strategies, fusion possibilities (eg. ensembles), and (un)resolved difficulties related to the specific biomedical image processing task(s) being tested . An extensive website, grand-challenge.org , has been designed for hosting the challenges related to medical image segmentation and currently includes around 200 challenges.
Comprehensive researches of biomedical image analysis challenges reveal that construction of the datasets, inter- and intra- observer variations for ground truth generation, and evaluation criteria might prevent establishing the true potential of such events . Suggestions, caveats, and roadmaps are being provided by reviews [29, 34] to improve the challenges.
|Challenge||Task(s)||Structure (Modality)||Organization and year|
|SLIVER07||Single model segmentation||Liver (CT)||MICCAI 2007, Australia|
|LTSC08||Single model segmentation||Liver tumor (CT)||MICCAI 2008, USA|
|Shape 2014||Building organ model||Liver (CT)||Delémont, Switzerland|
|Shape 2015||Completing partial segmentation||Liver (CT)||Delémont, Switzerland|
|Anatomy3||Multi-model segmentation||Kidney, urinary bladder, gallbladder, spleen, liver, and pancreas (CT and MRI for all organs)||VISCERAL Consortium, 2014|
|LiTS||Single model segmentation||Liver and liver tumor (CT)||ISBI 2017, Australia;
MICCAI 2017, Canada
|Pancreatic Cancer Survival Prediction||Quantitative assessment of cancer||Pancreas (CT)||MICCAI 2018, Spain|
|MSD||Multi-model segmentation||Liver (CT), liver tumor (CT), spleen (CT), hepatic vessels in the liver (CT), pancreas and pancreas tumor (CT)||MICCAI 2018, Spain|
|KiTS19||Single model segmentation||Kidney and kidney tumor (CT)||MICCAI 2019, China|
|PAIP 2019||Detection||Liver cancer (Whole-slide images)||MICCAI 2019, China|
|CHAOS||Multi-model segmentation||Liver, kidney(s), spleen (CT, MRI for all organs)||ISBI 2019, Italy|
Considering the dominance of the machine learning (ML) approaches, two main points are continuously being emphasized. 1) Recognition of the current roadblocks in applying ML to medical imaging. 2) Increasing the dialogue between radiologists and data scientists. Accordingly, challenges are either continuously updated , repeated after some time , or new ones having similar focuses are being organized to overcome the pitfalls and shortcomings of the existing ones.
A detailed literature review about the challenges related to abdominal organs (see Section II) revealed that the existing challenges in the field are significantly dominated by CT scans and tumor/lesion classification tasks. Up to now, there have only been a few benchmarks containing abdominal MRI series (Table I). Although this situation was typical for the last decades, the emerging technology of MRI makes it the preferred modality for further and detailed analysis of the abdomen. The remarkable developments in MRI technology in terms of resolution, dynamic range, and speed enable joint analyses of these modalities .
To gauge the current state-of-the-art in automated abdominal segmentation and observe the performance of various approaches on different tasks such as cross-modality learning and multi-modal segmentation, we organized Combined (CT-MR) Healthy Abdominal Organ Segmentation (CHAOS) in conjunction with the IEEE International Symposium on Biomedical Imaging (ISBI) in 2019. For this purpose, we prepared and made available a unique dataset of CT and MR scans from unpaired abdominal image series. A consensus-based multiple expert annotation strategy was used to generate the ground truths. A subset of this dataset was provided to the participants for training, and the remaining images were used to test performance against the (hidden) manual delineations using various metrics. In this paper, we report the setup as well as the results of this CHAOS benchmark and its outcomes.
The rest of the paper is organized as follows. A review of the current challenges in abdominal organ segmentation is given in Section II together with surveys on benchmark methods. Next, CHAOS datasets, setup, ground truth generation, and employed tasks are presented in Section III. Section IV describes the evaluation strategy. Then, participating methods are comparatively summarized in Section V. Section VI presents the results, and Section VII concludes the paper.
Ii Related Work
According to our literature analysis, currently, there exist 11 challenges focusing on abdominal organs . Being one of the pioneering challenges, SLIVER07 initialized the liver benchmarking [13, 43]. It provided a comparative study of a range of algorithms for liver segmentation under several intentionally included difficulties such as patient orientation variations or tumors and lesions. Its outcomes reported a snapshot of the methods that were popular for medical image analysis. However, since then, abdomen related challenges were mostly aimed at disease and tumor detection rather than organ segmentation. In 2008, “3D Liver Tumor Segmentation Challenge (LTSC08)”  was organized as the continuation of the SLIVER07 to segment liver tumors from the abdomen CT scans. Similarly, Shape 2014 and 2015  challenges focused on liver segmentation from CT data. Anatomy3  provided a unique challenge, which was a very comprehensive platform for segmenting not only upper-abdominal organs, but also various others such as left/right lung, urinary bladder, and pancreas. In a similar vein, LiTS - Liver Tumor Segmentation Challenge  is another example that covers liver and liver tumor segmentation tasks in CT. Other similar challenges can be listed as Pancreatic Cancer Survival Prediction , which targets pancreas cancer tissues in CT scans; KiTS19  challenge, which provides CT data for the kidney tumor segmentation; and PAIP 2019 , which aims at automatically detecting liver cancer from whole slice images.
In 2018, Medical Segmentation Decathlon (MSD)  was organized by a joint team and provided an immense challenge that contained many structures such as liver parenchyma, hepatic vessels and tumors, spleen, brain tumors, hippocampus, and lung tumors. The focus of the challenge was not only measuring the performance for each structure, but to observe generalizability, translatability, and transferability of a system for unseen data. Thus, the main idea behind MSD was to understand the key elements of DL systems that can work on many tasks. To provide such a source, MSD included a wide range of challenges including small and unbalanced sample sizes, varying object scales and multi-class labels. The approach of MSD underlines the ultimate goal of the challenges that is tı provide large datasets on several highly different tasks, and evaluation through a standardized analysis and validation process.
In this respect, a recent survey shows that another trend in medical image segmentation is the development of more comprehensive computational anatomical models leading to multi-organ related tasks rather than traditional organ and/or disease-specific tasks . By incorporating inter-organ relations into the process, multi-organ related tasks require a complete representation of the complex and flexible abdominal anatomy. Thus, this emerging field requires new efficient computational and machine learning models.
Under the influence of the above mentioned visionary studies, CHAOS has been organized to strengthen the field by aiming at objectives that involve emerging ML concepts (cross-modality learning, multi-modal segmentation, etc.) through an extensive dataset. In this respect, it focuses on segmenting multiple organs from unpaired patient datasets acquired by two modalities, CT and MR (including two different pulse sequences).
Iii CHAOS Challenge
Iii-a Aims and Tasks
CHAOS challenge has two separate but related aims:
Achieving accurate segmentation of the liver from CT.
Achieving accurate segmentation of abdominal organs (liver, spleen, kidneys) from MRI sequences.
CHAOS provides different segmentation algorithm design opportunities to the participants through five individual tasks:
Task 1: Liver Segmentation (CT-MRI) focuses on using a single system that can segment the liver from both CT and multi-modal MRI (T1-DUAL and T2-SPIR sequences). This corresponds to ”cross-modality” learning, which is expected to be used more frequently as the abilities of DL develop .
Task 2: Liver Segmentation (CT) covers a regular segmentation task, which can be considered relatively easier due to the inclusion of only healthy livers aligned in the same direction and patient position. On the other hand, the diffusion of contrast agent to parenchyma and the enhancement of the inner vascular tree creates challenging difficulties.
Task 3: Liver Segmentation (MRI) has the aim of Task 2, but presents multi-modal MRI datasets, which are randomly collected within routine clinical workflow. The methods are expected to work on both T1-DUAL (in & oppose phases) and T2-SPIR MR sequences.
Task 4: Segmentation of abdominal organs (CT-MRI) is similar to Task 1 with an extension to multiple organ segmentation from MR. In this task, the interesting part is that only the liver is annotated as ground truth in the CT datasets, but the MRI datasets have four annotated abdominal organs.
Task 5: Segmentation of abdominal organs (MRI) is the same as Task 3 but extended to four abdominal organs.
In all tasks, a fusion of individual models for different modalities (i.e. two models, one working on CT and the other on MRI) is not valid. However, the fusion of individual models for MRI sequences (T1-DUAL and T2-SPIR) is allowed in all MRI-included tasks. More details about the tasks are available on the CHAOS challenge website.111CHAOS Description: https://chaos.grand-challenge.org/ 222CHAOS FAQ: https://chaos.grand-challenge.org/News_and_FAQ/
Iii-B Data Information and Details
The CHAOS challenge data contains 80 patients. 40 of them went through a single CT scan and 40 of them went through MR scans including 2 pulse sequences of the upper abdomen area. Both CT and MR datasets include healthy abdomen organs without any tumors, lesions, etc. The datasets were collected from the Department of Radiology, Dokuz Eylul University Hospital, Izmir, Turkey. The scan protocols are briefly explained in the following subsections. Further details and explanations are available on the CHAOS website. 333CHAOS Data Info: https://chaos.grand-challenge.org/Data/
Iii-B1 CT Data Specifications
The CT volumes were acquired at the portal venous phase after contrast agent injection. In this phase, the liver parenchyma is enhanced maximally through blood supply by the portal vein. Portal veins are well enhanced but some enhancements also exist for hepatic veins. This phase is widely used for liver and vessel segmentation prior to surgery. Since the tasks related to CT data only include liver segmentation, this set has only annotations for the liver. The details of the data are presented in Table II.
Iii-B2 MRI Data Specifications
The MRI dataset includes two different sequences (T1 and T2) for 40 patients. In total, there are 120 DICOM datasets from T1-DUAL in phase (40 datasets), oppose phase (40 datasets), and T2-SPIR (40 datasets). Each of these sets is routinely performed to scan the abdomen in the clinical routine. T1-DUAL in and oppose phase images are registered. Therefore their ground truths are the same. On the other hand, T1 and T2 sequences are not registered. The datasets were acquired on a 1.5T Philips MRI, which produces 12-bit DICOM images. The details of this dataset are given in Table II.
|Number of patients (Train + Test)||20 + 20||20 + 20|
|Number of sets (Train + Test)||20 + 20||60 + 60*|
|Spatial resolution of files||512 x 512||256 x 256|
|Number of files in all sets [min-max]||[78 - 294]||[26 - 50]|
|Average number of files in a set||160||32x3*|
|Total files in the whole dataset||6407||3868x3*|
|X space (mm) [min-max]||[0.54 - 0.79]||[0.72 - 2.03]|
|Y space (mm) [min-max]||[0.54 - 0.79]||[0.72 - 2.03]|
|Slice thickness (mm.) [min-max]||[2.0 - 3.2]||[4.4 - 8.0]|
* MRI sets are collected from 3 different pulse sequences. For each patient T1 (in) and (oppose) phases (registered) and T2 phase are acquired.
Iii-C Annotations for reference segmentation
All 2D slices were labeled manually by three different radiology experts who have 10, 12, and 28 years of experience, respectively. The final shapes of the reference segmentations were decided by majority voting. Also, in some extraordinary situations, experts have made joint decisions. Although this handcrafted annotation process has taken a significant amount of time, it was preferred to create thea consistent and consensus-based ground truth image series.
Iii-D Challenge Setup and Distribution of the Data
Both CT and MRI datasets were divided into 20 sets for training and 20 sets for testing. Typically, training data is presented with ground truth labels, while testing data only contains original images. To provide sufficient data that contains enough variability, the datasets in the training data were selected to represent all the difficulties that are observed on the whole database.
The images are distributed as DICOM files to present the data in its original form. The only modification was removing patient-related information for anonymization. The ground truths are also presented as image series to match the original format. CHAOS data can be accessed with its DOI number via zenodo.org webpage under CC-BY-SA 4.0 license .
Since the outcomes of medical image segmentation are used for various clinical procedures, using a single metric for 3D segmentation evaluation is not a proper approach to ensure acceptable results for all requirements [29, 46]. Thus, in the CHAOS challenge, four different metrics are combined. The metrics have been chosen among the most preferred ones in previous challenges  and to analyze results in terms of overlapping, volumetric, and spatial differences:
Assume that represents the voxels in a segmentation result, represents the voxels in the ground truth.
DICE coefficient (DICE)
Dice coefficient is calculated as Dice , where denotes cardinality (the larger, the better).
Relative absolute volume difference (RAVD)
RAVD compares two volumes. RAVD = , where ‘abs’ denotes the absolute value (the smaller, the better).
Average symmetric surface distance (ASSD)
This metric is the average Hausdorff distance between border voxels in and . The unit of this metric is millimeters.
Maximum symmetric surface distance (MSSD)
This metric is the maximum Hausdorff distance between border voxels in and . The unit of this metric is millimeters.
Iv-B Scoring System
In the literature, there are two main ways of ranking results via multiple metrics. One way is ordering the results by metrics’ statistical significance with respect to all results. Another way is converting the metric outputs to the same scale and averaging all . In CHAOS, we adopted the second approach. Values coming from each metric have been transformed to span the interval so that higher values correspond to better segmentation. For this transformation, it was reasonable to apply thresholds in order to cut off the unacceptable results and increase the sensitivity of the corresponding metric. We are aware of the fact that decisions on metrics and thresholds have a very critical impact on ranking . Instead of setting arbitrary thresholds, we used intra- and inter-user similarities among our experts who created the ground truth. We asked some of the experts to repeat the annotation process at different times. These collections of reference masks were used for the calculation of our metrics in a pair-wise manner. These values were used to specify the thresholds as given in Table III. By using these thresholds, two manual segmentations performed by the same expert on the same CT data set resulted in liver volumes of 1491 mL and 1496 mL. The volumetric overlap is found to be 97.21%, while RVD is 0.347%, ASSD is 0.611 (0.263 mm), RMSD is 1.04 (0.449 mm), and MSSD is 13.038 (5.632 mm). These measurements yielded a total grade of 95.14. Similar analysis of the segmentation of the liver from MRI showed a slightly lower grade of 93.01%.
|Metric name||Best value||Worst value||Threshold|
|ASSD||0 mm||ASSD <15 mm|
|MSSD||0 mm||MSSD <60 mm|
The metric values outside the threshold range get zero points. The values within the range are mapped to the interval . Then, the score of each case in the testing data is calculated as the mean of the four scores. The missing cases (sets which do not have segmentation results) get zero points and these points are included in the final score calculation. The average of the scores across all test cases determines the overall score of the team for the specified task. The code for all metrics (in MATLAB, Python, and Julia) is available at https://github.com/emrekavur/CHAOS-evaluation. Also, more details about the metrics, CHAOS scoring system, and a mini-experiment that compares sensitivities of different metrics to distorted segmentations are provided on the same website.
|Team||Details of the method||Training strategy|
|OvGUMEMoRIAL (P. Ernst, S. Chatterjee, O. Speck, A. Nürnberger)||
(D. D. Pham, G. Dovletov, J. Pauli)
|Lachinov (D. Lachinov)||
|IITKGP-KLIV (R. Sathish, R. Rajan, D. Sheet)||
|METU_MMLAB (S. Özkan, B. Baydar, G. B. Akar)||
|PKDIA (P.-H. Conze)||
|MedianCHAOS (V. Groza)||
|Mountain (Shuo Han)||
|CIRMPerkonigg (M. Perkonigg)||
(F. Isensee, K. H. Maier-Hein)
V Participating Methods
The participated methods are summarized in Table IV and VI. Detailed descriptions with figures can be found in the challenge proceedings444https://chaos.grand-challenge.org/Results_CHAOS/ and here, an overview is given.
The majority of the applied methods (i.e. all except IITKGP-KLIV) used variations of U-Net . This seems to be a typical situation as the corresponding architecture dominates most of the recent DL based studies. Among all, two of them rely on ensembles (i.e. MedianCHAOS and nnU-Net), which uses multiple models and combine their results.
To compare the automatic DL methods with semi-automatic ones, interactive methods including both traditional iterative models and more recent techniques are employed from our previous work . In this respect, the results also present and discuss the accuracy and repeatability of emerging automatic DL algorithms with those of well-established interactive methods, which are applied by a team of imaging scientists and radiologists through two dedicated viewers, Slicer  and exploreDICOM .
|Submission numbers||Task 1||Task 2||Task 3||Task 4||Task 5|
|The most by the same team (ISBI 2019)||0||5||0||0||0|
|The most by the same team (Online)||3||8||8||2||7|
The training dataset was published approximately three months before the ISBI 2019. The testing dataset was given 24 hours before the challenge session. The submissions were evaluated during the conference, and the winners were announced. After ISBI 2019, training and testing (only DICOM images) datasets were published on zenodo.org website  and the online submission system was activated on the challenge website. 555https://grand-challenge.org/challenges/
|OvGUMEMoRIAL||TRI* (). Inference: full-sized.||-||
All teams thresholded their outputs by 0.5. PKDIA, METUMMLAB, and Mountain used connected component analysis for selecting/eliminating some of the model outputs. ISDUE used bicubic interpolation for refinement.
|ISDUE||TRI (96,128,128)||Random translate and rotate|
|Lachinov||RS** z-score normalization||Random ROI crop , mirror X-Y, transpose X-Y, WL-WW***|
|IITKGP-KLIV||TRI (), whitening. Additional class for body.||-|
|METUMMLAB||Min-max normalization for CT||-|
|PKDIA||TRI: MR, CT.||Random scale, rotate, shear and shift|
|MedianCHAOS||LUT [-240,160] HU range, normalization.||-|
, zero padding. TRI:. Rigid register MR.
|Random rotate, scale, elastic deformation|
Normalization to zero mean unit variance.
|2D Affine and elastic transforms, histogram shift, flip and adding Gaussian noise.|
|nnU-Net||Normalization to zero mean unit variance, RS||Add Gaussian noise / blur, rotate, scale, WL-WW, simulated low resolution, Gamma, mirroring|
|*TRI=Training with resized images **RS=Resampling ***WL-WW: Window Level -Width|
There exist two separate leaderboards at the challenge website, one for the conference session666https://chaos.grand-challenge.org/Results_CHAOS/ and another for post-conference online submissions.777https://chaos.grand-challenge.org/evaluation/results/ In this section, we present the majority of the results from the conference participants and the best two of the most significant post-conference results collected among the online submissions. To be specific, Metu_MMLab and nnU-Net results belong to online submissions while others are from the conference session. Each method is assigned a unique color code as shown in the figures and Table VII.
Box plots of all results for each task are presented separately in Fig.2. Also, scores on each testing case are shown in Fig.2 for all tasks. As expected, the tasks, which received the highest number of submissions and scores, were the ones focusing on the segmentation of a single organ from a single modality. Thus, the vast majority of the submissions were for liver segmentation from CT images (Task 2), followed by liver segmentation from MR images (Task 3). Accordingly, in the following subsections, the results are presented in the order of performance/participation Table V (i.e. from the task having the highest submissions and scores to the one having the lowest). In this way, the segmentation from cross- and multi-modality/organ concepts (Tasks 1 and 4) are discussed in the light of the performances obtained for more conventional approaches (Tasks 2, 3 and 5).
Vi-a CT Liver Segmentation (Task 2)
This task includes one of the most studied cases and a very mature field of abdominal segmentation. Therefore, it provides a good opportunity to test the effectiveness of the participated models compared to the existing approaches. Although the provided datasets only include healthy organs, the injection of contrast media creates several additional challenges as described in Section III.B. Nevertheless, the highest scores of the challenge were obtained in this task (Fig.2.b).
The on-site winner was MedianCHAOS with a score of 80.458.61 and the online winner is PKDIA with 82.468.47. Being an ensemble strategy, the performances of the sub-networks of MedianCHAOS are illustrated in Fig. 2.a. When individual metrics are analyzed, DICE performances seem to be outstanding (i.e. 0.980.00) for both winners (i.e. scores 97.790.43 for PKDIA and 97.550.42 for MedianCHAOS). Similarly, ASSD performances have very high mean and small variance (i.e. 0.890.36 [score: 94.062.37] for PKDIA and 0.900.24 [94.021.6]for MedianCHAOS). On the other hand, RAVD and MSSD scores are significantly low resulting in reduced overall performance. Actually, this outcome is valid for all tasks and participating methods.
Regarding semi-automatic approaches in , the best three have received scores 72.8 (active contours with mean interaction time (MIT) 25 minutes ), 68.1 (robust static segmenter having MIT of 17 minutes), and 62.3 (i.e. watershed MIT 8 minutes). Thus, the successful designs among participated in deep learning-based automatic segmentation algorithms have outperformed the interactive approaches by a large margin. This increase reaches almost to the inter-expert level for volumetric analysis and average surface differences. However, there is still a need for improvement considering the metrics related to maximum error margins (i.e. RAVD and MSSD). An important drawback of the deep approaches is observed as they might completely fail and generate unreasonably low scores for particular cases.
Regarding the effect of architectural design differences on performance, comparative analyses have been performed through some well established deep frameworks (i.e. DeepMedic  and NiftyNet ). These models have been applied with their default parameters and they have both achieved scores around 70. Thus, considering the participating models that have received below 70, it is safe to conclude that, even after the intense research studies and literature, the new deep architectural designs and parameter tweaking does not necessarily translate into more successful systems.
Vi-B MR Liver Segmentation (Task 3)
Segmentation from MR can be considered as a more difficult operation compared to segmentation from CT because CT images have a typical histogram and dynamic range defined by Hounsfield Units (HU), whereas MRI does not have such a standardization. Moreover, the artifacts and other factors in clinical routine cause significant degradation of MR image quality. The on-site winner of this task is PKDIA with a score of 70.716.40, which had the most successful results not only for the mean score but also for the distribution of the results (shown in 2.c and 2.d). Robustness to the deviations in MR data quality is an important factor that affects performance. For instance, CIR_MPerkonigg, which has the most successful scores for some cases, but could not show a high overall score.
The online winner is nnU-Net with 75.107.61. When the scores of individual metrics are analyzed for PKDIA and nnU-Net, DICE (i.e. 0.940.01 [score: 94.471.38] for PKDIA and 0.950.01 [score:95.421.32] for nnU-Net) and ASSD (i.e. 1.320.83 [score: 91.195.55] for nnU-Net and 1.560.68 [89.584.54] for PKDIA) performances are again extremely good, while RAVD and MSSD scores are significantly lower than the CT results. The reason behind this can also be attributed to the low resolution and higher spacing of the MR data, which cause a higher spatial error for each mis-classified pixel/voxel (See Table II). Comparisons with the interactive methods show that they tend to make regional mistakes due to the spatial enlargement strategies. The main challenge for them is to differentiate the outline when the liver is adjacent to isodense structures. On the other hand, automatic methods show much more distributed mistakes all over the liver. Further analysis also revealed that interactive methods seem to make fever over-segmentations. This is partly related to iterative parameter adjustment of the operator which prevents unexpected results. Overall, the participated methods performed equally well with interactive methods if only volumetry metrics are considered. However, the interaction seems to outperform deep models at other measures.
Vi-C CT-MR Liver Segmentation (Task 1)
This task aims cross-modality learning and it involves the usage of CT and MR information together during training. A model that can effectively accomplish cross-modality learning would: 1) help to satisfy the big data needs of deep models by providing more images and 2) reveal common features of incorporated modalities for an organ. To compare cross modality learning with individual ones, Fig.2.a should be compared to Fig.2.c for CT. Such a comparison clearly reveals that participated models trained only on CT data show significantly better performance than models trained on both modalities. A similar observation can also be made for MR results by observing Fig.2.b and Fig.2.d.
The on-site winner of this task was OvGUMEMoRIAL with a score of 55.7819.20. Although its DICE performances are quite satisfactory (i.e. 0.880.15 correspond to a score of 83.140.43), the other measures cause the low grade. Here, a very interesting observation is that the score of OvGUMEMoRIAL is lower than its score on CT (61.1319.72) but higher than MR (41.1521.61). Another interesting observation of the highest-scoring non-ensemble model, PKDIA, both for Task 2 (CT) and Task 1 (MR), had a significant performance drop in this task. Finally, it is worth to point that the online results have reached up to 73, but those results could not be validated (i.e. participant did not respond) and achieved after multiple submissions by the same team. In such cases, the possibility of peeking is very strong and therefore, not included in the manuscript.
It is important to examine the scores of cases with their distribution across all data. This can help to analyze the generalization capabilities and real-life use of these systems. For example, Fig.2
.a shows a noteworthy situation. The winner of Task 1, OvGUMEMoRIAL, shows lower performances than the second method (ISDUE) in terms of standard deviation. Figures2.a and 2
.b show that the competing algorithms have slightly higher scores on the CT data than on the MR data. However, if we consider the scattering of the individual scores along with the data, CT scores have higher variability. This shows that reaching equal generalization for multiple modalities is a challenging task for Convolutional Neural Networks (CNNs).
|Team Name||Mean Score||DICE||DICE Score||RAVD||RAVD Score||ASSD||ASSD Score||MSSD||MSSD Score|
|OvGUMEMoRIAL||55.78 19.20||0.88 0.15||83.14 28.16||13.84 30.26||24.67 31.15||11.86 65.73||76.31 21.13||57.45 67.52||31.29 26.01|
|ISDUE||55.48 16.59||0.87 0.16||83.75 25.53||12.29 15.54||17.82 30.53||5.17 8.65||75.10 22.04||36.33 21.97||44.83 21.78|
|PKDIA||50.66 23.95||0.85 0.26||84.15 28.45||6.65 6.83||21.66 30.35||9.77 23.94||75.84 28.76||46.56 45.02||42.28 27.05|
|Lachinov||45.10 21.91||0.87 0.13||77.83 33.12||10.54 14.36||21.59 32.65||7.74 14.42||63.66 31.32||83.06 74.13||24.30 27.78|
|METU_MMLAB||42.54 18.79||0.86 0.09||75.94 32.32||18.01 22.63||14.12 25.34||8.51 16.73||60.36 28.40||62.61 51.12||24.94 25.26|
|IITKGP-KLIV||40.34 20.25||0.72 0.31||60.64 44.95||9.87 16.27||24.38 32.20||11.85 16.87||50.48 37.71||95.43 53.17||7.22 18.68|
|PKDIA*||82.46 8.47||0.98 0.00||97.79 0.43||1.32 1.302||73.6 26.44||0.89 0.36||94.06 2.37||21.89 13.94||64.38 20.17|
|MedianCHAOS6||80.45 8.61||0.98 0.00||97.55 0.42||1.54 1.22||69.19 24.47||0.90 0.24||94.02 1.6||23.71 13.66||61.02 21.06|
|OvGUMEMoRIAL||61.13 19.72||0.90 0.21||90.18 21.25||9x 4x||44.35 35.63||4.89 12.05||81.03 20.46||55.99 38.47||28.96 26.73|
|ISDUE||55.79 11.91||0.91 0.04||87.08 20.6||13.27 7.61||4.16 12.93||3.25 1.64||78.30 10.96||27.99 9.99||53.60 15.76|
|IITKGP-KLIV||55.35 17.58||0.92 0.22||91.51 21.54||8.36 21.62||30.41 27.12||27.55 114.04||81.97 21.88||102.37 110.9||17.50 21.79|
|Lachinov||39.86 27.90||0.83 0.20||68 40.45||13.91 20.4||22.67 33.54||11.47 22.34||53.28 33.71||93.70 79.40||15.47 24.15|
|nnU-Net||75.10 7.61||0.95 0.01||95.42 1.32||2.85 1.55||47.92 25.36||1.32 0.83||91.19 5.55||20.85 10.63||65.87 15.73|
|PKDIA||70.71 6.40||0.94 0.01||94.47 1.38||3.53 2.14||41.8 24.85||1.56 0.68||89.58 4.54||26.06 8.20||56.99 12.73|
|Mountain||60.82 10.94||0.92 0.02||91.89 1.99||5.49 2.77||25.97 27.95||2.77 1.32||81.55 8.82||35.21 14.81||43.88 17.60|
|ISDUE||55.17 20.57||0.85 0.19||82.08 28.11||11.8 15.69||24.65 27.58||6.13 10.49||73.50 25.91||40.50 24.45||40.45 20.90|
|CIR_MPerkonigg||53.60 17.92||0.91 0.07||84.35 19.83||10.69 20.44||31.38 25.51||3.52 3.05||77.42 18.06||82.16 50||21.27 23.61|
|METU_MMLAB||53.15 10.92||0.89 0.03||81.06 18.76||12.64 6.74||10.94 15.27||3.48 1.97||77.03 12.37||35.74 14.98||43.57 17.88|
|Lachinov||50.34 12.22||0.90 0.05||82.74 18.74||8.85 6.15||21.04 21.51||5.87 5.07||68.85 19.21||77.74 43.7||28.72 15.36|
|OvGUMEMoRIAL||41.15 21.61||0.81 0.15||64.94 37.25||49.89 71.57||10.12 14.66||5.78 4.59||64.54 24.43||54.47 24.16||25.01 20.13|
|IITKGP-KLIV||34.69 8.49||0.63 0.07||46.45 1.44||6.09 6.05||43.89 27.02||13.11 3.65||40.66 9.35||85.24 23.37||7.77 12.81|
|ISDUE||58.69 18.65||0.85 0.21||81.36 28.89||14.04 18.36||14.08 27.3||9.81 51.65||78.87 25.82||37.12 60.17||55.95 28.05|
|PKDIA||49.63 23.25||0.88 0.21||85.46 25.52||8.43 7.77||18.97 29.67||6.37 18.96||82.09 23.96||33.17 38.93||56.64 29.11|
|OvGUMEMoRIAL||43.15 13.88||0.85 0.16||79.10 29.51||5x 5x||12.07 23.83||5.22 12.43||73.00 21.83||74.09 52.44||22.16 26.82|
|IITKGP-KLIV||35.33 17.79||0.63 0.36||50.14 46.58||13.51 20.33||15.17 27.32||16.69 19.87||40.46 38.26||130.3 67.59||8.39 22.29|
|nnU-Net||72.44 5.05||0.95 0.02||94.6 1.59||5.07 2.57||37.17 20.83||1.05 0.55||92.98 3.69||14.87 5.88||75.52 8.83|
|PKDIA||66.46 5.81||0.93 0.02||92.97 1.78||6.91 3.27||28.65 18.05||1.43 0.59||90.44 3.96||20.1 5.90||66.71 9.38|
|Mountain||60.2 8.69||0.90 0.03||85.81 10.18||8.04 3.97||21.53 15.50||2.27 0.92||84.85 6.11||25.57 8.42||58.66 10.81|
|ISDUE||56.25 19.63||0.83 0.23||79.52 28.07||18.33 27.58||12.51 15.14||5.82 11.72||77.88 26.93||32.88 33.38||57.05 21.46|
|METU_MMLAB||56.01 6.79||0.89 0.03||80.22 12.37||12.44 4.99||15.63 13.93||3.21 1.39||79.19 8.01||32.70 9.65||49.29 12.69|
|OvGUMEMoRIAL||44.34 14.92||0.79 0.15||64.37 32.19||76.64 122.44||9.45 11.98||4.56 3.15||71.11 18.22||42.93 17.86||39.48 16.67|
|IITKGP-KLIV||25.63 5.64||0.56 0.06||41.91 11.16||13.38 11.2||11.74 11.08||18.7 6.11||35.92 8.71||114.51 45.63||11.65 13.00|
|* Corrected submission of PKDIA right after the ISBI 2019 conference (i.e. During the challenge, they have submitted the same results, but in reversed orientation. Therefore, the winner of Task 2 at conference session is the MedianCHAOS6).|
Vi-D Multi-Modal MR Abdominal Organ Segmentation (Task 5)
Task 5 investigates how DL models contribute to the development of more comprehensive computational anatomical models leading to multi-organ related tasks. Deep models have the potential to provide a complete representation of the complex and flexible abdominal anatomy by incorporating inter-organ relations through their internal hierarchical feature extraction process.
The on-site winner was PKDIA with a score of 66.46 .81 and the online winner is nnU-Net with 72.445.05. When the scores of individual metrics are analyzed in comparison to Task 3, the DICE performances seems to remain almost the same for nnU-Net and PKDIA. This is a significant outcome as all four organs are segmented instead of a single one. It is also worth to point that the model that the third-place (i.e. Mountain) have almost exactly the same overall score for Task 3 and 5. The same observation is also valid for the standard deviation of these models. Considering RAVD, the performance decrease seems to be higher compared to DICE. These reduced DICE and RAVD performances are partially compensated by better MSSD and ASSD performances. On the other hand, this increase might not be directly related to multi-organ segmentation. One should keep in mind that generally the most complex abdominal organ for segmentation is the liver (Task 3) and the other organs in Task 5 can be considered relatively easier to analyze.
Vi-E CT-MR Abdominal Organ Segmentation (Task 4)
This task covers segmentation of both the liver in CT and four abdominal organs in MRI data. Hence, it can be considered as the most difficult task since it contains both cross-modality learning and multiple organ segmentation. Therefore, it is not surprising that it has the lowest attendance and scores.
The on-site winner was ISDUE with a score of 58.6918.65. Fig. 2.e-f shows that their solution had consistent and high-performance distribution in both CT and MR data. It can be thought that two convolutional encoders in their system boost performance on cross-modality data. These encoders are able to compress information about anatomy. On the other hand, PKDIA also shows promising performance with a score of 49.63
23.25. Despite their success on MRI sets, the CT performance can be considered unsatisfactory similar to their situation at Task 1. This reveals that the CNN may not perform efficient training. The encoder part of their solution uses transfer learning and the pre-trained weights approach might not be successful on multiple modalities. The OvGUMEMoRIAL team achieved third position with an average score and they have a balanced performance on both modalities. Their method can be considered successful in terms of generalization.
Together with the outcomes of Task 1 and 5, it is shown that in current strategies and architectures, CNNs have better segmentation performance on single modality tasks. This might be considered as an expected outcome because the success of CNNs is very dependent on the consistency and homogeneity of the data. Using multiple modalities creates a high variance in the data even though all data were normalized. On the other hand, the results also revealed that CNNs have good potential for cross-modality tasks if appropriate models are constructed. This potential was not that clear before the development of deep learning strategies for segmentation.
Vii Conclusions and Discussions
In this paper, we presented the CHAOS abdominal healthy organ segmentation benchmark. We generated an unpaired multi-modality (CT-MR), multi-Sequence (T1 in / oppose, T2) public dataset for five tasks and evaluated a significant number of well-established and state-of-the-art segmentation methods. The evaluation is performed using four metrics. Our results indicate various important outcomes. First, deep learning-based automatic methods outperformed interactive semi-automatic strategies for CT liver segmentation. They have reached inter-expert variability for DICE and volumetry, but still need some more improvements at distance-based measures, which are critical for determining surgical error margins. Second, considering MR liver segmentation, the participated deep models have performed almost equally well with interactive ones for DICE, but lack performance for distance-based measures. Third, when all four abdominal organs are considered, the performance of deep models get better compared to only liver segmentation. However, it is unclear if this improvement can be attributed to multi-tasking since the liver can be considered as the most complex organ to segment among the ones in the challenge. Fourth, cross-modality (CT-MR) learning still proved more challenging than individual training. Last, but not least, multi-organ cross-modality segmentation remains the most challenging problem until appropriate ways to take advantage of multi-tasking properties of deep models and bigger data advantage of cross-modal medical data are developed. Such complicated tasks would benefit from spatial priors, global topological or shape-representations in their loss functions as employed by some of the suggested models.
Given the outstanding results for single-modality tasks and the fact that the resulting volumes will be visualized by a radiologist-surgeon team prior to clinical operations in the context of clinical workflow, it can be concluded that minimal user interaction, especially in the post-processing phase, would easily bring the single modality results to clinically acceptable levels. This would require not only having a software implementation of the participated successful methods, but also their integration to an adequate workstation/DICOM viewer that is integrated and easily accessible in the daily workflow of clinicians.
Except for one, all teams involved in this challenge have used a modification of U-Net as a primary classifier or as a support system. However, the high variance between reported scores, even though they use the same baseline CNN structure, shows the interpretability of the model performance still relies on many parameters including architectural design, implementation, parametric modifications, and tuning. Although several common algorithmic properties can be derived for high-scoring models, an interpretation and/or explanation of why a particular model performs well or not is far from being trivial as even one of these factors is poorly determined. As discussed in the previous challenges, such an analysis is almost impossible to be performed on a heterogeneous set of models developed by different teams and programming environments. Moreover, the selection of evaluation metrics, their transformations to scoring and calculation of the final scores might have a significant impact on the reported performances.
Since the start of the competition, the most popular task, Task 2, has received more than 200 submissions in eight months. Quantitative analyses on Task 2 show that CNNs for segmentation of the liver from CT have achieved great success. Supporting the quantitative analyses, our qualitative observations unveil the top methods can be used in real-life solutions with little effort on post-processing. That is why we believe that the solutions of Task 2 have reached saturation. Of course, new future results may have better segmentation performance than the methods we review here. However, the impact and the significance of these slight improvements may not justify the effort in developing them. We suggest to researchers that they should focus more on the implementation of their solutions to real-world applications instead of pushing hard to gain minimal score upgrades. They should try to reduce the computational cost, increase generalization, attach importance to repeatability and reproducibility, and make the solutions easy to implement.
The organizers would like to thank the whole ISBI 2019 team, especially Ivana Isgum and Tom Vercauteren in the challenge committee for their guidance and support. We express gratitude to all authors and supporting organizations of the grand-challenge.org platform for hosting our challenge. We thank Esranur Kazaz, Umut Baran Ekinci, Ece Köse, David Völgyes, and Javier Coronel for their contributions. Last but not least, our special thanks go to Ludmila I. Kuncheva for her valuable contributions.
-  (2019) A novel focal tversky loss function with improved attention u-net for lesion segmentation. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 683–687. Cited by: 1st item.
-  (2016-10) 20th anniversary of the medical image analysis journal (MedIA). Medical Image Analysis 33, pp. 1–3. External Links: Cited by: §I.
-  (2019) The liver tumor segmentation benchmark (lits). arXiv preprint arXiv:1901.04056. Cited by: §II.
-  (2019) Computational anatomy for multi-organ analysis in medical imaging: a review. Medical Image Analysis 56, pp. 44 – 67. External Links: Cited by: §II.
-  (2019) Deep convolutional encoder-decoders for deltoid segmentation using healthy versus pathological learning transferability. In IEEE International Symposium on Biomedical Imaging, Venice, Italy, pp. 36–39. Cited by: 1st item.
-  (2008) 3D segmentation in the clinic: a grand challenge ii-liver tumor segmentation. In MICCAI workshop, Cited by: §II.
-  (2018) IVD-net: intervertebral disc localization and segmentation in mri with a multi-modal unet. In International Workshop and Challenge on Computational Methods and Clinical Applications for Spine Imaging, pp. 130–143. Cited by: 1st item.
-  (2010-07) Integrating segmentation methods from different tools into a visualization program using an object-based plug-in interface. IEEE Transactions on Information Technology in Biomedicine 14 (4), pp. 923–934. External Links: Cited by: §V.
-  (2018) NiftyNet: a deep-learning platform for medical imaging. Computer Methods and Programs in Biomedicine 158, pp. 113 – 122. External Links: Cited by: §VI-A.
-  (2017) Prediction of overall survival for patients with metastatic castration-resistant prostate cancer: development of a prognostic model through a crowdsourced challenge with open clinical trial data. The Lancet Oncology 18 (1), pp. 132–142. Cited by: §II.
-  (2019) Cerebellum parcellation with convolutional neural networks. In Medical Imaging 2019: Image Processing, Vol. 10949, pp. 109490K. Cited by: 1st item, 2nd item.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: 2nd item.
-  (2009) Comparison and evaluation of methods for liver segmentation from ct datasets. IEEE transactions on medical imaging 28 (8), pp. 1251–1265. Cited by: §II.
-  (2008) MRI artifact reduction and quality improvement in the upper abdomen with propeller and prospective acquisition correction (pace) technique. American Journal of Roentgenology 191 (4), pp. 1154–1158. Cited by: §I.
Ternausnet: u-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv preprint arXiv:1801.05746. Cited by: 2nd item.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp. 448–456. Cited by: 3rd item.
-  (2019) NnU-net: breaking the spell on successful medical image segmentation. CoRR abs/1904.08128. External Links: Cited by: 1st item.
-  (2016) Cloud-based evaluation of anatomical structure segmentation and landmark detection algorithms: visceral anatomy benchmarks. IEEE transactions on medical imaging 35 (11), pp. 2459–2475. Cited by: §II.
-  (2017-02) Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical Image Analysis 36, pp. 61–78. External Links: Cited by: §VI-A.
-  (2020-01) Comparison of semi-automatic and deep learning based automatic methods for liver segmentation in living liver transplant donors. Diagnostic and Interventional Radiology 26, pp. 11–21. External Links: Cited by: §V, §VI-A.
-  (2019-04-11)(Website) Note: Accessed: 2019-04-11 External Links: Cited by: §III-D, §VI.
-  (2014) 3D slicer: a platform for subject-specific image analysis, visualization, and clinical support. In Intraoperative Imaging and Image-Guided Therapy, F. A. Jolesz (Ed.), pp. 277–289. External Links: Cited by: §V.
-  (2013) The virtual skeleton database: an open access repository for biomedical research and collaboration. Journal of medical Internet research 15 (11), pp. e245. Cited by: §II.
-  KiTS19 challenge. Note: Accessed: 2019-07-08 External Links: Cited by: §II.
-  (2016) Challenges and benchmarks in bioimage analysis. In Focus on Bio-Image Informatics, pp. 231–262. Cited by: §I.
-  (2019) Segmentation of thoracic organs using pixel shuffle. In Proceedings of the 2019 Challenge on Segmentation of THoracic Organs at Risk in CT Images, SegTHOR@ISBI 2019, April 8, 2019, External Links: Cited by: 1st item.
-  (2013) Who’s #1? : the science of rating and ranking. pp. 247. External Links: Cited by: §IV-B.
-  (2016) Modout: learning to fuse modalities via stochastic regularization. Journal of Computational Vision and Imaging Systems 2 (1). Cited by: 1st item.
-  (2018-12) Why rankings of biomedical image analysis competitions should be interpreted with care. Nature Communications 9 (1), pp. 5217. External Links: Cited by: §I, §IV-A, §IV-B.
-  (2014) The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34 (10), pp. 1993–2024. Cited by: §I, §I.
-  (2019) SUMNet: fully convolutional model for fast segmentation of anatomical structures in ultrasound volumes. arXiv preprint arXiv:1901.06920. Cited by: 1st item.
-  PAIP 2019 challenge. Note: Accessed: 2019-07-08 External Links: Cited by: §II.
Challenges related to artificial intelligence research in medical imaging and the importance of image analysis competitions. Radiology: Artificial Intelligence 1 (1), pp. e180031. Cited by: §I.
-  (2018) How to exploit weaknesses in biomedical challenge design and organization. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 388–395. Cited by: §I.
-  (2018) Is the winner really the best? a critical analysis of common research practice in biomedical image analysis competitions. arXiv preprint arXiv:1806.02051. Cited by: §I.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: 1st item, 1st item, §V.
-  (2018) Automatic instrument segmentation in robot-assisted surgery using deep learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 624–628. Cited by: 2nd item.
-  (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063. Cited by: §I, §II, 1st item.
-  (2004) Ridge-based vessel segmentation in color images of the retina. IEEE transactions on medical imaging 23 (4), pp. 501–509. Cited by: §I.
-  (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6924–6932. Cited by: 1st item.
-  (2018-03) Multi-modal learning from unpaired images: application to multi-organ segmentation in ct and mri. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 547–556. External Links: Cited by: §III-A.
-  (1999) Grand challenges in biomedical image analysis. Note: Accessed: 2019-07-07 External Links: Cited by: §I, §II.
-  (2007) 3D segmentation in the clinic: a grand challenge. 3D segmentation in the clinic: a grand challenge, pp. 7–15. Cited by: §II.
-  (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: 3rd item.
-  (2019) Cascaded multi-scale convolutional encoder-decoders for breast mass segmentation in high-resolution mammograms. In Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Vol. , Berlin, Germany, pp. 6738–6741. External Links: Cited by: 1st item.
-  (2015) An overview of current evaluation methods used in medical image segmentation. Technical report Technical Report RR-15-08, Department of Computer Science, Oxford, UK. Cited by: §IV-A.