Organs at risk (OAR) in head and neck area is a group of organs at potential risk of damage during radiotherapy application. Their three-dimensional segmentation in medical Computed Tomography (CT) images is a first step required for reliable planning in image-guided radiotherapy during head and neck cancer treatment . Producing 3D segmentation in clinical data manually is a tedious task and therefore effort is being put into developing automatic methods that would be able to produce accurate segmentation masks for the objects of interest. In area of head and neck OAR, however, this is a very challenging task as the soft tissue structures have very little contrast.
The MICCAI 2015 Head and Neck Auto Segmentation Challenge  provides a dataset for evaluation of head and neck OAR segmentation methods. Furthermore, the challenge also defined baseline methods for head and neck segmentation for each structure. Most of the approaches relied on statistical shape model  or active appearance model 
registration with atlas-based initialization during the challenge. Although several methods where able to produce segmentations of every structure, especially smaller objects such as submandibular glands, optic nerves and optic chiasm didn’t reach a satisfactory accuracy required for clinical application. More recent methods made use of more modern machine learning methods such as convolutional neural networks (CNN) that have been gaining a lot of popularity since the introduction of AlexNet in 2012
in most of computer vision fields including medical image data analysis.
Fritscher et al. 
first used a patch-based CNN with 3 orthogonal input patches to obtain a pixel-wise prediction map for each structure, using atlas-based probability map for each structure as an additional input, slightly increasing accuracy of parotid and submandibular gland segmentation. Coincidentally, Ibragimov et al. used a very similar model with a Markov random field post processing step. They obtained segmentation of all OARs but reaching accuracy of only for optic chiasm and under for optic nerves. Most recently, Wang et al. 
applied a hierarchical random forest vertex regression method for some of the OAR, showing further improvement of accuracy of brain stem, mandible, and parotid gland segmentation.
In this paper we design a CNN with encoder-decoder architecture, first used in biomedical segmentation by Ronneberger et al. , and evaluate its performance on head and neck OAR segmentation task. Since the model fails to learn some of the structures when trained using the standard cross-entropy loss, we make use of Dice loss function introduced first by Pastor-Pellicer et al. and later used for medical segmentation by Milletari et al. . This form of soft Dice loss has been employed quite extensively in recent literature [12, 13]. Despite obtaining acceptable results on most of the structures, we also observed rather low performance on smaller low-contrast structures when using standard Dice loss. We propose a modification to the standard soft Dice loss function – the Batch soft Dice Loss – to overcome this problem and show that it enables the model to outperform models trained using other loss functions. In case of optic chiasm and nerves segmentation we reach as much as improvement over current state-of-the-art methods in terms of Dice overlap measure.
2 Proposed Method for Organ Segmentation
In this section, we describe the architecture of our CNN model and different loss functions that have been evaluated. We also give details on the training phase and data preprocessing.
2.1 Head and Neck Segmentation Challenge
The dataset includes CT scans of patients with manual segmentations of 6 anatomical structures which include brain stem (BS), mandible (MA), optic chiasm (OC), bilateral optic nerves (ON), parotid glands (PG), and submandibular glands (SG). Total of 48 patient scans of head and neck area are available in the challenge dataset. However, 18 of these scans contain incomplete ground truth annotation for some structures. Since our model is trained using image patches that span across almost complete head area, these scans had to be excluded from our experiments to prevent introducing false background voxels into the ground truth.
2.2 Model Architecture
Our segmentation model architecture is of encoder-decoder type 
. Convolutional layers are coupled with max-pooling layers to increase the field of view of deeper features while decreasing their resolution in the first, encoder part. In the second, decoder part, the features are upscaled again using bilinear interpolation. Each upscaling step is accompanied by concatenating the feature maps from the encoder part of the model with matching resolution to improve the gradient flow through the model.
We limit our model to only operate on two-dimensional axial slices for the following two reasons. First, because we use image patches to include enough context, there is an intra-image class imbalance issue. Although we compensate for this during the training as mentioned in the following sections, using three-dimensional image patches results in amplification of this issue because some structures, such as optic chiasm, are highly planar in the axial plane. Second, memory requirements are an issue here as well. The nature of multi-class segmentation requires mini-batches of data used in the training phase to contain a balanced number of image patches containing each of the structures in order to correctly compute the gradient step. This is easier to accomplish when using 2D image patches. Our results show that the two-dimensional approach has only small impact on the -dimension discontinuities and the overall performance.
The architecture scheme is shown in Figure 2. We only use standard convolution kernel size. Each convolutional block encompasses convolution kernel filtering, batch normalization, and ReLU activation, except for the last convolutional block which uses softmax activation to produce the final label probabilities. Unlike U-net, we do not use any further regularization beyond batch normalization since the model does not tend to overfit. Concatenation skip connections that we employ have been shown to perform better than the popular residual skip connections in segmentation of medical volumetric data . U-net uses deconvolution layer to perform feature map upscaling but our experiments showed that bilinear interpolation performs at least as well. This is likely caused by the fact that the concatenation skip connections that were not being used when deconvolutional CNNs were first introduced provide sufficient information about fine structure to the model during upscaling.
2.3 Loss functions & Optimization
Several different multi-class loss functions used for segmentation in current literature were evaluated in this paper along with our proposed batch Dice loss. We will use the following notation to introduce different loss functions used in our experiments. Let the number of image patches in our training mini-batches be and let each image patch consist of pixels. The segmentation model then maps each of pixels in the mini-batch to probability for each of
labels. The training procedure ensures that the resulting output label probability vectors
correspond to one-hot encoded ground truth label vectorsas best as possible on the training data. During inference, we choose the output label of each pixel as
Cross-Entropy. Also known as log-loss, cross-entropy is the most widely used loss function for classification CNN. When applied to a segmentation task, cross-entropy measures the divergence of the predicted probability from the ground truth label for each pixel separately and then averages the value over all pixels in the mini-batch:
This loss function tends to under-estimate the prediction probabilities for classes that are under-represented in the mini-batch which is inevitable in our training data, as can also be seen on Figure3.
Weighted Cross-Entropy. The tendency to under-estimate can be mitigated by assigning higher weights to loss contributions from pixels with under-represented class labels:
where is a weight assigned to pixel
computed as a prior probability of ground truth labelin the given mini-batch.
Soft Dice. Inspired by the Dice coefficient  often used to evaluate binary segmentation accuracy, the differentiable soft Dice loss was introduced by Milletari et al.  to tackle the class imbalance issue without the need for explicit weighting. One possible formulation is
This allows easy generalization to multi-class segmentation where by treating each image as a 3D volume where the third dimension is the position in the one-hot encoded label vector.
Batch Soft Dice
. We hypothesize that one of the advantages of the soft Dice loss is that it is a global operator as opposed to point-wise cross-entropy and therefore it is able to better estimate the correct overall gradient direction. Our modification lies in extending the computation by treating the whole data mini-batch as a single 4-dimensional tensor during the loss computation. In other words, instead of computing the Dice loss overvoxels times and then averaging, we compute a single Dice loss over all voxels without averaging.
Our intuition behind this choice is that during the training phase, the standard Dice loss gradient estimation on a single image/slice does not take into account the fact that the same set of filters should also be capable of segmenting structures not present in the current training slice. This is tackled by averaging the gradient over multiple slices in the batch. This can, however, cause individual gradients to more or less cancel out if their directions are very different. By contrast, computing the Dice loss gradient over the whole batch of slices as a single global operator should enforce the gradient to steadily push the filter weights towards the correct segmentation of each structure in the batch.
It should be noted that in all of the above equations we omitted regularizing term used to avoid zero division for clarity.
In the training phase, the model weights are updated through Adam optimizer with step computed over mini-batches of 30 image patches. As some structures are under-represented in the dataset, we use standard data augmentation techniques such as random flips, translations and elastic transformations to prevent overfitting.
3 Experimental Results
We optimized our model until convergence using each of the loss functions with 25 training patient scans to keep the challenge format . We cross-validated the models on different test and training scan subsets so that total of 10 scans were used for testing.
We first demonstrate the performance of the models on the case of optic nerves, optic chiasm, and brain stem segmentation. On other structures, the difference between performance of models trained with different loss functions is less significant. The results in terms of Dice coefficient (measure of segmentation quality on which the loss function is based) are shown in Figure 3. Superiority of soft Dice-trained models can be observed. However, standard soft Dice-trained model reaches a significantly smaller precision in case of optic nerves and sometimes misses the structure altogether as also illustrated by Figure 1. The model trained using the proposed batch soft Dice loss does not seem to suffer from this issue and we therefore conclude that it is more suitable for training models for segmentation of small anatomical structures with low contrast such as head and neck OAR.
. Except for several outlier cases where optic nerve segmentation reached a lower precision with the Dice coefficient around, we obtained acceptable results with Dice coefficient over for all structures. We quantitatively compare our method with other published methods in terms of Dice coefficient (Table 1) and in terms of average surface distance  (Table 2). Although the difference is most accentuated in cases of optic nerves and chiasm segmentation, our model also surpasses current state-of-the-art results on all the remaining OARs.
|MICCAI 2015 ||88||55||93||62||84||78|
|Fritscher et al. ||-||-||-|
|Ibragimov et al. ||-||37.4||89.5||64.2||77.3||71.4|
|Wang et al. ||90.3||-||94.4||-||82.6||-|
We designed an encoder-decoder CNN model for head and neck OAR segmentation and proposed the Batch Dice Loss for multi-class segmentation of structures with small sizes. We compared the loss function to other standard loss functions in terms of their ability to optimize a model for OAR segmentation. The model trained using the batch Dice loss reached the best performance when compared to other loss functions and also to current state-of-the-art methods on this dataset.
In the future work we are going to evaluate the performance of batch Dice loss when applied to optimization of different models. These could include models trained on different datasets where three-dimensional models would be feasible. We are also going to assess whether it is also beneficial in the case of binary segmentation. Another potential area to explore is explicitly weighting the loss function according to current classification performance rather than prior occurrence probabilities.
This work was supported in part by the company TESCAN 3DIM (fka 3Dim Laboratory) and by the Technology Agency of the Czech Republic project TE01020415 (V3C – Visual Computing Competence Center).
-  Dawson, L. A., Sharpe, M. B.: Image-guided radiotherapy: rationale, benefits, and limitations. Lancet Oncology, 7(10), 848–858 (2006)
-  Raudaschl, P. F., Zaffino, P., Sharp, G. C., Spadea, M. F., Chen, A., Dawant, B. M., Fritscher, K. D.: Evaluation of segmentation methods on head and neck CT: Auto-segmentation challenge 2015. Medical Physics, 44(5), 2020–2036 (2017).
-  Heimann T, Meinzer HP. Statistical shape models for 3D medical image segmentation: a review. Med. Image Analy, 13:543–563 (2009)
-  Jung F, Steger S, Knapp O, Noll M, Wesarg S. COSMO - coupled shape model for radiation therapy planning of head and neck cancer. Clinical Image-Based Procedures, LNCS, vol. 8680. Cham: Springer; 25–32 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advance in Neural Information Processing Systems, pp. 1097–1105 (2012)
-  Fritscher, K., Raudaschl, P., Zaffino, P., Spadea, M. F., Sharp, G. C., Schubert, R.: Deep Neural Networks for Fast Segmentation of 3D Medical Images. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. Lecture Notes in Computer Science, vol 9901. Springer, Cham (2016)
-  Ibragimov, B., Xing, L.: Segmentation of organs-at-risks in head and neck CT images using convolutional neural networks. Medical Physics, 44(2), 547–557. (2017)
-  Wang, Z., Wei, L., Wang, L., Gao, Y., Chen, W., Shen, D.: Hierarchical Vertex Regression-Based Segmentation of Head and Neck CT Images for Radiotherapy Planning. IEEE Transactions on Image Processing, 27(2), 923–937. (2018)
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Lecture Notes in Computer Science, vol 9351. Springer, Cham (2015)
-  Milletari, F., Navab, N., Ahmadi, S.: V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. Fourth International Conference on 3D Vision (3DV), Stanford, CA, 2016, pp. 565-571. (2016)
-  Pastor-Pellicer, J., Zamora-Martínez, F., España-Boquera, S., Castro-Bleda, M. J.: International Work-Conference on Artificial Neural Networks, 376-384 (2013)
Sudre, C. H., Li, W., Vecauteren, T., Ourselin, S., Cardoso, M. J.: Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. DLMIA 2017, ML-CDS 2017. Lecture Notes in Computer Science, vol 10553. Springer, Cham (2017)
-  Fidon, L., Li, W., Garcia-Peraza-Herrera, L. C., Ekanayake, J., Kitchen, N., Ourselin, S., Vercauteren, T.: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. BrainLes 2017. Lecture Notes in Computer Science, vol 10670. Springer, Cham (2018)
-  Kayalibay, B., Jensen, G., Smagt, P.: CNN-based Segmentation of Medical Imaging Data. https://arxiv.org/pdf/1701.03056.pdf (2017)
-  Dice, L. R.: Measures of the Amount of Ecologic Association Between Species. Ecology, 26(3):297-302 (1945)
-  Van Ginneken, B., Heimann, T., Styner, M.: 3D segmentation in the clinic: A grand challenge. MICCAI Workshop on 3D Segmentation in the Clinic: A Grand Challenge. (2007)