Radiation therapy is one of the standard treatments for lung and esophageal cancer. It consists of irradiating the tumor with ionizing beams to prevent the proliferation of cancer cells. The goal is to destroy the target tumor while preserving healthy tissues and surrounding organs, called Organs at Risk (OARs), from radiation. Thus, delimiting the target tumor and OAR on computed tomography (CT) images is the first step in treatment planning. This segmentation task is mainly performed manually by an expert who relies on his experience and some medical guidelines. In addition, manual segmentation is time-consuming and tedious. For these reasons, an automatic approach may be essential to improve and simplify the segmentation of OARs, and thus reduce the harmful effects of radiation therapy.
In the spirit of making the segmentation of organs at risk automatic and more widely, we have recently setup a dataset with data acquired at the Henri Becquerel Center (CHB), a regional anti-cancer center in Rouen, France. This data set, called SegTHOR for Segmentation of THoracic Organs at Risk, contains 60 CT scans from patients with lung cancer or Hodgkin’s lymphoma. In this dataset, we focus on thoracic organs, which are heart, aorta, esophagus and trachea (Fig. 1). These organs have varying shapes and appearances. The esophagus is the most difficult organ to contour due to its shape and position, which vary greatly from one patient to another and is almost invisible.
To the best of our knowledge, not many datasets exist with the purpose of organ at risk segmentation. The challenge111http://aapmchallenges.cloudapp.net/competitions/3 proposed by the AAPM (American Association of Physicists in Medicine) has a similar goal: it aims to segment the esophagus, heart, spinal cord, left and right lung in CT images. 30 patients are available as training set, while the test set includes the scans of 12 patients. The organs are different from the SegTHOR dataset: their dataset does not include trachea and aorta. Very recently, the StructSeg 2019222https://structseg2019.grand-challenge.org/ challenge proposes two segmentation tasks of OARs. The purpose of the first is to segment 22 OARs in head and neck CT scans from nasopharynx cancer patients. The second one aims to segment 6 OARs in chest CT scans from lung cancer patients. OARs are the same as those of the AAPM challenge with the trachea in addition. For both databases, 50 CT scans compose the training data and 10 others constitute the test data.
The goal of this paper is to present the SegTHOR dataset, and to give some baseline results, using the state-of-the-art segmentation networks. Note that this dataset has been the subject of a challenge that we organized between January and April 2019333Due to a major crash of Codalab servers in July 2019, all results collected for the challenge have disappeared and are not accessible anymore. A new submission system has been setup and is accessible at https://competitions.codalab.org/competitions/21145., held at the IEEE International Symposium on Biomedical Imaging (ISBI) in April 2019 in Venice, Italy.
The paper is organized as follows. The dataset is described in the next section. In Section 3, a brief overview of medical image segmentation is introduced to identify the state-of-the-art architectures; the proposed 2D network used for the automatic segmentation of thoracic organs on SegTHOR dataset is then presented. Results are reported in Section 4.
2 The SegTHOR Dataset
The database consists of 60 thoracic CT scans, acquired with or without intravenous contrast, of 60 patients diagnosed with lung cancer or Hodgkin’s lymphoma. These patients were treated by curative-intensive radiotherapy, between February 2016 and June 2017, at the Henri Becquerel Center (CHB, regional anti-cancer center), Rouen, France. All scanner images are 512512(150 284) voxels in size. Indeed, the number of slices changes according to the patients. The in-plane resolution varies between 0.90 mm and 1.37 mm per pixel and the -resolution fluctuates between 2 mm and 3.7 mm per pixel. Finally, the most common resolution is 0.980.982.5 mm.
Each CT scan is associated with a manual segmentation performed by an experienced radiotherapist at the CHB, using a SomaVision platform, Varian Medical Systems, Inc, Palo Alto, USA. Manual segmentation takes approximately 30 minutes for each patient. The body and lung contours were segmented with the automatic tools available on the platform. The esophagus was manually delineated from the fourth cervical vertebra to the esophago-gastric junction. The heart was delineated as recommended by the Radiation Therapy Oncology Group. The trachea was contoured from the lower limit of the larynx to 2cm below the carena excluding the lobar bronchi. The aorta was delineated from its origin above the heart down to below the diaphragm pillars.
The segmentation of these 4 OAR raises the following challenges. First, the tissues that are surrounding the heart and aorta, and specially the esophagus, have similar gray levels to these organs; the lack of contrast forces the radiotherapist to use his anatomical knowledge resulting in a segmentation that does not rely on the CT scan only. Note that the trachea, on the contrary, is easily identifiable because it is filled with air and thus appears as black on the image. Also, another challenge is the three-dimensional relationships of these OAR: they are intricately interlocked as shown in Fig. 1. At last, the 4 OAR have varying shapes and size: esophagus and trachea have tubular structure and are the smallest organs; the aorta has a cane shape and the heart, the largest organ has a blob shape.
We have split the data in a training set of 40 patients and a test set of 20 patients, which represents 7390 slices for training data and 3694 slices for test data to define the SegTHOR dataset. The dataset is available at https://competitions.codalab.org/competitions/21145, with online automated evaluation. The Dice metric and the Average Hausdorf distance are provided for each OAR of the test set patients.
3 A segmentation framework based on U-Net
3.1 Related work in medical image segmentation
Due to lack of contrast between the organs and surrounding tissues, the segmentation problem of OAR requires to rely on external knowledge, such as pairs of CT image and their corresponding manual labeling. Making use of prior knowledge and labeled images has been long used in medical image segmentation, to guide the segmentation process in case of noise and occlusion, and to handle object variability. For example, an atlas-based method, in addition to other techniques, was used to segment 17 OARs throughout the body . The segmentation of thoracic organs at risk is obtained in 
by combining multi-atlas deformable registration with a level set-based local search. In recent times, traditional image segmentation methods have been outperformed by convolutional neural networks (CNN)-based ones. One of the first CNN architectures to allow automatic end-to-end semantic segmentation is the Fully Convolutional Network (FCN). FCN has paved the way for encoder-decoder segmentation networks. Among its successors, one of the most well-known architecture is DeepLab , where a combination of dilated convolutions and feature pyramid pooling is introduced. The U-Net architecture  is also a popular segmentation framework, initially designed for medical applications . It has a symmetrical encoder-decoder structure: the image is downsampled throughout the encoding path, and upsampled using transposed convolution (also called deconvolution) to reach the initial resolution. Some variants in U-Net consist in changing the backbone model used for encoding, e.g. VGG, DenseNet, etc. Extensions to 3D have been proposed in the 3D-UNet model  and the V-Net model . For example, in  a multi-class 3D FCN is trained on CT scans to segment seven abdominal structures. In , 21 OAR are segmented in the head and neck using a 3D-UNet architecture. The liver is segmented on CT images thanks to a 3D deeply supervised network in , or to a hybrid densely connected UNet architecture in . In , a distance map that provides the localization of each organ and the spatial relationship between them is used to guide the segmentation task in a fully convolutional setting.
3.2 A simplified segmentation framework
The U-Net architecture being the state-of-the-art model for image segmentation, our first intention is to evaluate this architecture  on each 2D images of the SegTHOR test dataset. Given OAR contours high inter- and intra-patient variability, it is deemed to be subject to overfitting. Our strategy has consisted in adapting U-Net to our problem by some simple steps. The first step to tackle overfitting is to add dropout p
, and therefore their connections, during each step of the training. This prevents the neurons from adapting too much to each other. A second way to reduce overfitting is to limit the number of network layers and feature maps, to reduce the number of trainable parameters. The result is a simplified architecture with one less hidden layer and only up to 256 feature maps calculated. Finally, we have chosen to replace the transposed convolution (also called deconvolution) by a bilinear interpolation for the upsampling operation, in the expansion phase. The first one requires learning the weights of the filters, while the second one uses neighboring pixels to calculate the value of the new pixel through linear interpolations, which further reduces the number of parameters.
As shown in Figure 2, our simplified network has an encoder-decoder path composed of 7 convolutional blocks, some of which are connected by skip connections. Each convolutional block consists of two convolution operations with a 3
3 kernel size. The ReLU (Rectified Linear Units) activation function and then a batch normalization are applied to the outputs of each convolution, along with a dropout. In the encoder part, the two convolution operations are followed by a max-pooling operation that reduces by half the spatial resolution of the input; while in the decoder part, the two convolution operations are preceded by a bilinear upsampling operation to double the spatial resolution and finally reach the initial resolution. Three skip connections are used to concatenate the characteristics of the first layers with those of the deeper ones to compensate for the loss of resolution. At the end of the network, there is a last convolution operation with a 11 kernel size to obtain the feature maps associated to each segmentation classes, the background and the OARs. Finally, this architecture has 4.8 million trainable parameters compared to 7.2 million for the same architecture with the transposed convolution operation, while the original U-Net, based on a VGG backbone, has about 65 million trainable parameters.
4 Experiments and result
All images are normalized by subtracting the image mean and dividing by standard deviation. We increased the data to artificially triple the database size using data augmentation techniques. Each image is modified by a random affine transform on the one hand and a random deformation of a 222 control point grid and a B-Spline interpolation on the other hand [8, 19]. For computational reasons, images are cropped from the center and are 304 304 pixels in size. In addition, only slices with at least one of the four organs are passed through the network during training.
The four classes and background are highly unbalanced. Indeed, the background represents about 99% of the voxels on average. The remaining percentage of voxels is divided into 70.7% for the heart, 23% for the aorta, 3.7% for the esophagus and 2.6% for the trachea. To overcome this problem, the multi-class Dice loss function, a generalization of the binary Dice loss function[8, 16]
, is used. It is optimized using the stochastic gradient descent algorithm with an initial learning rate of 1e, over mini-batches of size 5. When learning no longer progresses, this learning rate is reduced by a factor of ten. Weight decay and momentum are set to 5e and 9e
, respectively. Finally, the weights in the network are initialized by Xavier’s initialization. The deep network is implemented with PyTorch.
4.3 Evaluation metrics
To quantify our segmentation results, two metrics are used. First, the Dice score, which measures the overlap rate between manual and automatic segmentation. In complement to this metric, the Average Hausdorff distance (AHD) in mm is calculated as the maximum between average distances from manual to closest automatic contours and average distances from automatic to closest manual contours. These two scores are obtained for each of the four OARs.
In the first experiment, we compare U-Net performance with the simplified sU-Net. We also assess the difference in segmentation accuracy, without and with dropout, with drop probability
to 0.2. Next, we assess two different configurations in the decoder phase: (i) with a 2D transposed convolution operation (denoted conv2Dtranspose in the result table), and (ii) with a bilinear upsampling operation, which are used to recover the initial resolution of the image. Whenever necessary, we assess the statistical significance of the results, by performing a Wilcoxon signed-ranked test on Dice values (as well as on AHD) between the two methods of interest, using a confidence interval of 95%.
Comparison of sU-Net vs U-Net and influence of dropout. Results are reported in Table 1. Comparing sU-Net to U-Net without dropout (columns (1) and (3)), it can be seen that results are similar. Now, if dropout is included in both networks, sU-Net shows enhanced performance compared to U-Net (columns (2) vs (4)), for all organs but the trachea. This is confirmed by the -values of the Wilcoxon test, which are below the 0.05 threshold, for the esophagus, trachea, and aorta. Some qualitative results to illustrate the difference between the U-Net for the esophagus, are given in Figure 3. The contribution of the dropout to the sU-Net framework can be assessed by comparing columns (3) and (4), where one can see that for 3 out of the 4 OAR, the dropout provides a substantial improvement, especially for the esophagus.
|U-Net||Simplified U-Net (sU-Net)|
|OAR||Metrics||without DR||with DR||without DR||with DR|
|Esophagus||Dice||0.76 0.10||0.79 0.08||0.75 0.11||0.82 0.05|
|AHD||1.74 2.77||0.94 0.63||1.69 2.02||0.70 0.39|
|Trachea||Dice||0.85 0.05||0.85 0.04||0.86 0.04||0.85 0.04|
|AHD||1.32 1.20||1.30 1.12||1.06 0.83||1.21 1.13|
|Aorta||Dice||0.92 0.05||0.91 0.04||0.91 0.02||0.91 0.03|
|AHD||0.50 0.64||0.77 0.93||0.57 0.65||0.58 0.67|
|Heart||Dice||0.93 0.03||0.93 0.03||0.92 0.03||0.93 0.03|
|AHD||0.23 0.21||0.25 0.28||0.31 0.22||0.27 0.20|
Influence of upsampling method for the decoder. Comparing the transposed convolution method and the bilinear interpolation method in Table 2, we find that the Dice and AHD values are not significantly different () for all organs, but the aorta for which the -value is 1.2e in favor of bilinear upsampling. Thus a bilinear upsampling operation is more than sufficient in this application. Moreover, choosing bilinear interpolation can help in reducing computation time.
|U-Net||Simplified U-Net (sU-Net)|
|Esophagus||Dice||0.79 0.08||0.82 0.05||0.81 0.06|
|AHD||0.94 0.63||0.70 0.39||0.68 0.35|
|Trachea||Dice||0.85 0.04||0.85 0.04||0.86 0.04|
|AHD||1.30 1.12||1.21 1.13||1.08 0.85|
|Aorta||Dice||0.91 0.04||0.91 0.03||0.92 0.02|
|AHD||0.77 0.93||0.58 0.67||0.52 0.66|
|Heart||Dice||0.93 0.03||0.93 0.03||0.93 0.03|
|AHD||0.25 0.28||0.27 0.20||0.26 0.22|
4.5 Labeling issue
Manual segmentation of the SegTHOR dataset is tailored according to the needs of radiotherapy and has not been performed for systematic segmentation evaluation. Thus, due to recommendations for manual segmentation, some slices located at the bottom or the top of the patient CT scan were not segmented. While this lack of manual labeling does not hinder the heart segmentation evaluation, this may be a problem for tubular organs which are perpendicular to the axial plane, such as the esophagus, the trachea, and to a lesser extent, the aorta. For a majority of the 20 test patients, the automatic segmentation of the esophagus, trachea and aorta produced exceeds the upper and lower limits of manual segmentation as shown in Figure 4, and produces a labeling that is counted as missegmentation, since the corresponding ground truth (GT) does not exist.
We have thus run new experiments to assess the gain when evaluating on the restricted range of slices where the GT is present. From Table 3, one can gather that for the esophagus, trachea and aorta, there is an improvement in Dice scores, especially for the trachea, but more significantly for the Average Hausdorff’s distances. For future submission on the Codalab platform, we now offer two types of evaluation of the predicted segmentation: on all slices and on slices where the GT is present, i.e. by restricting the evaluation to a range of slices.
|original dataset||restricted dataset|
|Esophagus||Dice||0.81 0.06||0.83 0.06|
|AHD||0.68 0.35||0.32 0.20|
|Trachea||Dice||0.86 0.04||0.92 0.02|
|AHD||1.08 0.85||0.15 0.09|
|Aorta||Dice||0.92 0.02||0.93 0.02|
|AHD||0.52 0.66||0.19 0.31|
|Heart||Dice||0.93 0.03||0.93 0.03|
|AHD||0.26 0.22||0.16 0.15|
5 Discussion and conclusion
In this paper we have introduced SegTHOR, a dataset for the segmentation of organs at risk in CT images, available from the Codalab platform. The aim of the SegTHOR challenge is to foster research on this clinical application, but also to inspire the field of multilabel segmentation for (volumetric) anatomical images. We have presented several variants of a U-Net based architecture, that maybe used as first-line processing when dealing with a new medical image segmentation problem. Given the limited amount of data available, an architecture that is too deep and includes a large number of feature maps does not seem to be suitable for our semantic segmentation problem, in particular for the segmentation of the esophagus. We have presented a simplified CNN that was more appropriate to the problem at hand. Results show that the addition of the dropout has a major influence on the accuracy, and is a great help for most organs to improve the Dice metric as well as the AHD. In the decoding phase, the transposed convolution did not yield improved results compared to the bilinear upsampling operation; in this case, the bilinear interpolation should be favored to reduce computation time.
One limitation of our approach is that we only use one single reference segmentation. It is known that the variability of manual segmentation, be it intra- or inter-expert is not negligible. Most importantly, the OAR segmentation has a tremendous influence on dosimetric metrics . Thus our next step will be to quantitatively assess the influence of OAR segmentation on dosimetric dose. In a study of a patient with oropharyngeal cancer , the authors found substantial dose differences resulting strictly from contouring variation, depending on the size, shape and location of the OAR. This emphasizes the need to accurately contour the OAR, in addition to the target tumor, when planning a radiotherapy. A dosimetric study would also allow to avoid the labeling issue present in the dataset.
Another use case of this dataset could be weakly supervised learning for image segmentation or handling missing annotations . Weakly supervised learning allows to reach full segmentation with partially annotated data, thus reducing the cost of full annotation. New challenges are arisen by this paradigm (how to leverage the weak labels? how to make use and model of external knowledge to help in the process?), which has been identified as a hot topic for the coming years .
This project was co-financed by the European Union with the European regional development fund (ERDF, 18P03390/18E01750/18P02733) and by the Haute-Normandie Régional Council via the M2SINUM project. The authors would like to thank Prof. Carole Le Guyader (LMI, INSA Rouen) for her advice and the CRIANN (Centre des Ressources Informatiques et Applications Numérique de Normandie, France) for providing computational resources.
No conflicts of interest, financial or otherwise, are declared by the authors.
-  (2017) Deeplab: semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §3.1.
-  (2016) 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. In MICCAI, pp. 424––432. Cited by: §3.1.
-  (2016) 3D Deeply Supervised Network for Automatic Liver Segmentation from CT volumes. CoRR abs/1607.00582. External Links: Cited by: §3.1.
-  (2015) Segmentation of organs at risk in CT volumes of head, thorax, abdomen, and pelvis. In SPIE Medical Imaging 2015: Image Processing, Vol. 9413, pp. 94133J. Cited by: §3.1.
-  (2017) H-denseunet: hybrid densely connected unet for liver and liver tumor segmentation from CT volumes. CoRR abs/1709.07330. External Links: Cited by: §3.1.
A survey on deep learning in medical image analysis. Medical image analysis 42, pp. 60–88. Cited by: §3.1.
-  (2015) Fully Convolutional Networks for Semantic Segmentation. In , pp. 3431–3440. Cited by: §3.1.
-  (2016) V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In IEEE International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §3.1, §4.1, §4.2.
-  (2012) Variations in the contouring of organs at risk: test case from a patient with oropharyngeal cancer. International Journal of Radiation Oncology* Biology* Physics 82 (1), pp. 368–378. Cited by: §5.
-  (2018) Deep learning to achieve clinically applicable segmentation of head and neck anatomy for radiotherapy. arXiv preprint arXiv:1809.04430. Cited by: §3.1.
-  (2018) Handling missing annotations for semantic segmentation with deep convnets. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 20–28. Cited by: §5.
-  (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.1, §3.2.
-  (2017) Hierarchical 3D fully convolutional networks for multi-organ segmentation. arXiv preprint arXiv:1704.06382. Cited by: §3.1.
-  (2014) Multiatlas segmentation of thoracic and abdominal anatomy with level set-based local search. Journal of applied clinical medical physics 15 (4), pp. 22–38. Cited by: §3.1.
Dropout: A Simple Way to Prevent Neural Networks from Overfitting.
Journal of Machine Learning Research15, pp. 1929–1958. External Links: Cited by: §3.2.
-  (2017) Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. CoRR abs/1707.03237. External Links: Cited by: §4.2.
-  (2019) Embracing imperfect datasets: a review of deep learning solutions for medical image segmentation. External Links: Cited by: §5.
-  (2019) Multi-Organ Segmentation using Distance-Aware Adversarial Networks. Journal of Medical Imaging 6 (1), pp. 014001. Cited by: §3.1.
Segmentation of Organs at Risk in thoracic CT images using a SharpMask architecture and Conditional Random Fields. In IEEE International Symposium on Biomedical Imaging, pp. 1003–1006. Cited by: §4.1.
-  (2016) A review of interventions to reduce inter-observer variability in volume delineation in radiation oncology. Journal of medical imaging and radiation oncology 60 (3), pp. 393–406. Cited by: §5.