Quantitative assessment of brain tumors provides valuable information and therefore constitutes an essential part of diagnostic procedures. Automatic segmentation is attractive in this context, as it allows for faster, more objective and potentially more accurate description of relevant tumor parameters, such as the volume of its subregions. Due to the irregular nature of tumors, however, the development of algorithms capable of automatic segmentation remains challenging.
The brain tumor segmentation challenge (BraTS)  aims at encouraging the development of state of the art methods for tumor segmentation by providing a large dataset of annotated low grade gliomas (LGG) and high grade glioblastomas (HGG). The BraTS 2018 training dataset, which consists of 210 HGG and 75 LGG cases, was annotated manually by one to four raters and all segmentations were approved by expert raters [2, 3, 4]. For each patient a T1 weighted, a post-contrast T1-weighted, a T2-weighted and a FLAIR MRI was provided. The MRI originate from 19 institutions and were acquired with different protocols, magnetic field strengths and MRI scanners. Each tumor was segmented into edema, necrosis and non-enhancing tumor and active/enhancing tumor. The segmentation performance of participating algorithms is measured based on the DICE coefficient, sensitivity, specificity and 95th percentile of Hausdorff distance.
It is unchallenged by now that convolutional neural networks (CNNs) dictate the state of the art in biomedical image segmentation[5, 6, 7, 8, 9, 10]. As a consequence, all winning contributions to recent BraTS challenges were exclusively build around CNNs. One of the first notably successful neural network for brain tumor segmentation was DeepMedic, a 3D CNN introduced by Kamnitsas et al. 
. It comprises a low and a high resolution pathway that capture semantic information at different scales and recombines them to predict a segmentation based on precise local as well as global image information. Kamnitsas et al. later enhanced their architectures with residual connections for BraTS 2016. With the success of encoder-decoder architectures for semantic segmentation, such as FCN [12, 13] and most notably the U-Net , it is unsurprising that these architectures are used in the context of brain tumor segmentation as well. In BraTS 2017, all winning contributions were at least partially based on encoder-decoder networks. Kamnitsas et al. , who were the clear winner of the challenge, created an ensemble by combining three different network architectures, namely 3D FCN , 3D U-Net [15, 14] and DeepMedic , trained with different loss functions (dice loss [16, 17] and crossentropy) and different normalization schemes. Wang et al.  used a FCN inspired architecture, enhanced with dilated convolutions  and residual connections . Instead of directly learning to predict the regions of interest, they trained a cascade of networks that would first segment the whole tumor, then given the whole tumor the tumor core and finally given the tumor core the enhancing tumor. Isensee et al.  employed a U-Net inspired architecture that was trained on large input patches to allow the network to capture as much contextual information as possible. This architecture made use of residual connections  in the encoder only, while keeping the decoder part of the network as simple as possible. The network was trained with a multiclass dice loss and deep supervision to improve the gradient flow.
Recently, a growing number of architectural modifications to encoder-decoder networks have been proposed that are designed to improve the performance of the networks for their specific tasks [17, 19, 20, 21, 10, 6, 22, 7]. Due to the sheer number of such variants, it becomes increasingly difficult for researchers to keep track of which modifications extend their usefulness over the few datasets they are typically demonstrated on. We have implemented a number of these variants and found that they provide no additional benefit if integrated into a well trained U-Net. In this context, our contribution to the BraTS 2018 challenge is intended to demonstrate that such a U-Net, without using significant architectural alterations, is capable of generating competitive state of the art segmentations.
In the following we present the network architecture and training schemes used for our submission. As hinted in the previous paragraph, we will use a 3D U-Net architecture that is very close to its original publication  and optimize the training procedure to maximize its performance on the BraTS 2018 training and validation data.
With MRI intensity values being non standardized, normalization is critical to allow for data from different institutes, scanners and acquired with varying protocols to be processed by one single algorithm. This is particularly true for neural networks where imaging modalities are typically treated as color channels. Here we need to ensure that the value ranges match not only between patients but between the modalities as well in order to avoid initial biases of the network. We found the following workflow to work well. We normalize each modality of each patient independently by subtracting the mean and dividing by the standard deviation of the brain region. The region outside the brain is set to 0.
2.2 Network architecture
U-Net  is a successful encoder-decoder network that has received a lot of attention in the recent years. Its encoder part works similarly to a traditional classification CNN in that it successively aggregates semantic information at the expense of reduced spatial information. Since in segmentation, both semantic as well as spatial information are crucial for the success of a network, the missing spatial information must somehow be recovered. U-Net does this through the decoder, which receives semantic information from the bottom of the ’U’ (refer to Fig. 1) and recombines it with higher resolution feature maps obtained directly from the encoder through skip connections. Unlike other segmentation networks, such as FCN  and previous iterations of DeepLab  this allows U-Net to segment fine structures particularly well.
, we stick with our design choice to process patches of size 128x128x128 with a batch size of two. Due to the high memory consumption of 3D convolutions with large patch sizes, we implemented our network carefully to still allow for an adequate number of feature maps. By reducing the number of filters right before upsampling and by using inplace operations whenever possible, this results in a network with 30 feature channels at the highest resolution, which is nearly double the number we could train with in our previous model (on a 12 GB NVIDIA Titan X GPU). Due to our choice of loss function, traditional ReLU activation functions did not reliably produce the desired results, which is why we replaced them with leaky ReLUs (leakiness24] are unstable and do not reflect the feature map activations at test time very well. We found instance normalization  to provide more consistent results and therefore used it to normalize all feature map activations (between convolution and nonlinearity). For an overview over our segmentation architecture, please refer to Fig. 1.
2.3 Training Procedure
Our network architecture is trained with randomly sampled patches of size 128x128x128 voxels and batch size 2. We refer to an epoch as an iteration over 250 batches and train for a maximum of 500 epochs. The training is terminated early if the exponential moving average of the validation loss () has not improved within the last 60 epochs. Training is done using the ADAM optimizer with an initial learning rate , which is reduced by factor 5 whenever the above mentioned moving average of the validation loss has not improved in the last 30 epochs. We regularize with a l2 weight decay of .
One of the main challenges with brain tumor segmentation is the class imbalance in the dataset. While networks will train with crossentropy loss function, the resulting segmentations may not be ideal in the sense of the dice score they obtain. Since the dice scores is one of the most important metrics based upon which contributions are ranked, it is imperative to optimize this metric. We achieve that by using a soft dice loss for the training of our network. While several formulations of the dice loss exist in the literature [25, 16, 17], we prefer to use a multi-class adaptation of  which has given us good results in segmentation challenges in the past [8, 6]
. This multiclass Dice loss function is differentiable and can be easily integrated into deep learning frameworks:
where is the softmax output of the network and
is a one hot encoding of the ground truth segmentation map. Bothand have shape by with being the number of pixels in the training patch and being the classes.
When training large neural networks from limited training data, special care has to be taken to prevent overfitting. We address this problem by utilizing a large variety of data augmentation techniques. The following augmentation techniques were applied on the fly during training: random rotations, random scaling, random elastic deformations, gamma correction augmentation and mirroring. Data augmentation was done with our own in-house framework which is publically available at https://github.com/MIC-DKFZ/batchgenerators.
The fully convolutional nature of our network allows to process arbitrarily sized inputs. At test time we therefore segment an entire patient at once, alleviating problems that may arise when computing the segmentation in tiles with a network that has padded convolutions. We furthermore use test time data augmentation by mirroring the images and averaging the softmax outputs.
2.4 Region based prediction
Wang et al.  use a cascade of CNNs to segment first the whole tumor, then the tumor core and finally the enhancing tumor. We believe the cascade and their rather complicated network architecture to be of lesser importance, but the fact that they did not learn the labels (enhancing tumor, edema, necrosis) but instead optimized the regions that are finally evaluated in the challenge directly to be key to their good performance in last years challenge. For this reason we will also train a version of our model where we replace the final softmax with a sigmoid and optimize the three (overlapping) regions (whole tumor, tumor core and enhancong tumor) directly with the dice loss.
285 training cases is a lot for medical image segmentation, but may still not be enough to prevent overfitting entirely. We therefore also experiment with cotraining on institutional data. Since the label definitions between the BraTS dataset and our own differ slightly, we add a second segmentation layer (1x1x1 convolution) at the end, which acts as a supervised version of m heads . During training, the BraTS segmentation layer only receives gradients from BraTS examples and the other segmentation layer is trained only on institutional data. The losses of both layers are averages to obtain the total loss of a minibatch. The rest of the network weights are shared.
One of the most challenging parts in the BraTS challenge data is distinguishing small blood vessels in the tumor core region (that must be labeled either as edema of as necrosis) from enhancing tumor. Since this is particularly detrimental for LGG patients that may have no enhancing tumor at all we replace all enhancing tumor voxels with necrosis if less than 500 enhancing tumor voxels are present in a patient.
3 Experiments and Results
We designed our training scheme by running a five fold cross-validation on the 285 training cases of BraTS 2018. Training set results are summarized in Table 1, validation set results can be found in table 2. Validation set results were obtained by using the five networks from the training cross-validation as an ensemble. For consistency with other publications, all reported values were computed by the online evaluation platform (https://ipp.cbica.upenn.edu/).
Due to the relatively small size of the validation set (66 cases vs 285 training cases) we base our main analysis on the cross-validation results. We are confident that conclusions drawn from the training set are more robust and will generalize well to the test set. The vast majority of observations we draw from the training set translate into the validation set as well.
|baseline + postprocess||77.11||89.76||82.17||3.99||5.86||7.11|
|baseline + regions||73.81||90.02||82.87||5.01||6.26||6.48|
|baseline + regions + postprocess||77.82||90.02||82.87||3.93||6.26||6.48|
|baseline + regions + cotraining||74.62||90.08||84.30||4.68||5.61||6.00|
|baseline + regions + cotraining + postprocess||77.17||90.08||84.30||3.68||5.61||6.00|
|baseline + postprocess||80.36||90.80||84.32||2.55||4.79||8.02|
|baseline + regions||79.25||90.72||85.14||3.80||5.23||7.23|
|baseline + regions + postprocess||80.48||90.72||85.14||2.81||5.23||7.23|
|baseline + regions + cotraining||80.45||90.83||85.44||3.12||4.97||7.04|
|baseline + regions + cotraining + postprocess||81.01||90.83||85.44||2.54||4.97||7.04|
We refer to our U-Net trained on the BraTS labels and training data as baseline. With Dice scores of 73.43/89.76/82.17 (enh/whole/core) on the training set and 79.59/90.80/84.32 (enh/whole/core) on the validation set this baseline model is by itself already strong. Adding region based training (regions) improved the dice scores of both the enhancing tumor as well as the tumor core. When training with our institutional data (cotraining), we gain an additional dice point on enhancing tumor and tumor core. Our postprocessing, which is targeted at fixing false positive enhancing tumor predictions in LGG patients has a substantial impact on enhancing tumor dice. On the training set it increases the mean dice by up to 4.01 points (baseline+regions vs baseline+regions+postprocess). However, its impact on the best performing model is less pronounced, resulting in only an increase of 2.55 dice points (baseline+ regions+cotraining+postprocess on train set). This is most likely due to an increase in segmentation quality which partly removes the neccessity for postprocessing. Especially the inclusion of institutional data seems to have resolved many false positives in LGG patients by itself. Comparing baseline+regions+postprocess vs baseline+regions+cotraining+postprocess we notice that training with additional training data only yiels marginal improvements. In the training data, inclusing institutional data reduced the enh. tumor dice scores by 0.65 but increased the tumor core dice by 1.43. On the validation set, the impact of our postprocessing is lower, yielding only a minor improvement on in dice score (0.56) for our best model. We believe this is caused by a discrepancy in data distribution between the training and validation set resulting in the validation set having relatively few patients with no enhancing tumor label. With enhancing tumor dice scores on the test set being generally lower than validation and training set scores for almost all participants in recent BraTS challenges, we believe the test set to contain many LGG cases without enhancing tumor label therefore our postprocessing to be a promising approach to maximize our dice scores.
Figure 2 shows a qualitative example generated from our best performing model. The patient shown is taken from the validation set (CBICA_AZA_1). As can be seen in the middle (t1ce), there are several blood vessels close to the enhancing tumor. Segmentation CNNs typically struggle to correctly differentiate between such vessels and actual enhancing tumor. This is most likely due to a) a difficulty in detecting tube-like structures b) few training cases where these vessels are an issue c) the use of dice loss functions that does not sufficiently penalize false segmentations of vessels due to their relatively small size. In the case shown here, our model correctly segmented the vessels as background.
Finally we compare our best model (baseline+regions+cotraining+ postprocess) with other entries in the validation leaderboard https://www.cbica.upenn.edu/
BraTS18/lboardValidation.html (accessed on July 14th 2018, see Table 3).
In this paper we demonstrated that a generic U-Net architecture that has only minor modifications can obtain very competitive segmentation, if trained correctly. Our baseline model is already strong on the BraTS 2018 validation data with dice scores of 79.59, 90.80 and 84.32 for enhancing tumor, whole tumor and tumor core, respectively. Using region based training improved results for the tumor core by about one dice point. Additional training data yielded some improvements for the enhancing tumor dice. Our simple postprocessing technique of removing enhancing tumor entirely from a patient if the total number of predicted enhancing tumor voxels was below 500 proved to be effective. Overall our current best model achieves dice scores 81.01, 90.83 and 85.44 and 95th percentiles of Hausdorff Distance of 2.54, 4.97 and 7.04 for enhancing tumor, whole tumor and tumor core, respectively. We will continue to improve our model until the release of the testing data and look forward to test our model on new, unseen patients.
-  B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest et al., “The multimodal brain tumor image segmentation benchmark (BRATS),” IEEE TMI, vol. 34, no. 10, pp. 1993–2024, 2015.
-  S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, J. Freymann, K. Farahani, and C. Davatzikos, “Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features,” Nature Scientific Data, 2017 (In Press).
-  S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, J. Freymann, K. Farahani, and C. Davatzikos, “Segmentation labels and radiomic features for the pre-operative scans of the TCGA-GBM collection,” TCIA, 2017.
-  S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, J. Freymann, K. Farahani, and C. Davatzikos, “Segmentation labels and radiomic features for the pre-operative scans of the TCGA-LGG collection,” TCIA, 2017.
-  K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation,” MIA, vol. 36, pp. 61–78, 2017.
-  F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, and K. H. Maier-Hein, “Brain tumor segmentation and radiomics survival prediction: Contribution to the brats 2017 challenge,” in International MICCAI Brainlesion Workshop. Springer, 2017, pp. 287–297.
-  X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P. A. Heng, “H-denseunet: Hybrid densely connected unet for liver and liver tumor segmentation from ct volumes,” arXiv preprint arXiv:1709.07330, 2017.
-  F. Isensee, P. F. Jaeger, P. M. Full, I. Wolf, S. Engelhardt, and K. H. Maier-Hein, “Automatic cardiac disease assessment on cine-mri via time-series segmentation and domain specific features,” in International Workshop on Statistical Atlases and Computational Models of the Heart. Springer, 2017, pp. 120–129.
-  K. Kamnitsas, W. Bai, E. Ferrante, S. McDonagh, M. Sinclair, N. Pawlowski, M. Rajchl, M. Lee, B. Kainz, D. Rueckert et al., “Ensembles of multiple models and architectures for robust brain tumour segmentation,” in International MICCAI Brainlesion Workshop. Springer, 2017, pp. 450–462.
-  G. Wang, W. Li, S. Ourselin, and T. Vercauteren, “Automatic brain tumor segmentation using cascaded anisotropic convolutional neural networks,” in International MICCAI Brainlesion Workshop. Springer, 2017, pp. 178–190.
-  K. Kamnitsas, E. Ferrante, S. Parisot, C. Ledig, A. V. Nori, A. Criminisi, D. Rueckert, and B. Glocker, “DeepMedic for brain tumor segmentation,” in International Workshop on Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. Springer, 2016, pp. 138–149.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241.
-  Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2016, pp. 424–432.
-  M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, “The importance of skip connections in biomedical image segmentation,” in Deep Learning and Data Labeling for Medical Applications. Springer, 2016, pp. 179–187.
-  F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in International Conference on 3D Vision. IEEE, 2016, pp. 565–571.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in ECCV. Springer, 2016, pp. 630–645.
-  S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. IEEE, 2017, pp. 1175–1183.
-  O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.
-  A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and channel squeeze & excitation in fully convolutional networks,” arXiv preprint arXiv:1803.02579, 2018.
-  B. Kayalibay, G. Jensen, and P. van der Smagt, “CNN-based segmentation of medical imaging data,” arXiv preprint arXiv:1701.03056, 2017.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2017, pp. 240–248.
-  S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra, “Why m heads are better than one: Training a diverse ensemble of deep networks,” arXiv preprint arXiv:1511.06314, 2015.