No New-Net

09/27/2018 ∙ by Fabian Isensee, et al. ∙ 0

In this paper we demonstrate the effectiveness of a well trained U-Net in the context of the BraTS 2018 challenge. This endeavour is particularly interesting given that researchers are currently besting each other with architectural modifications that are intended to improve the segmentation performance. We instead focus on the training process, argue that a well trained U-Net is hard to beat and intend to back up this assumption with a strong participation in this years BraTS challenge. Our baseline U-Net, which has only minor modifications and is trained with a large patch size and a dice loss function already achieves competitive dice scores on the BraTS2018 validation data. By incorporating region based training, additional training data and a simple postprocessing technique, we obtain dice scores of 81.01, 90.83 and 85.44 and Hausdorff Distances (95th percentile) of 2.54, 4.97 and 7.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Quantitative assessment of brain tumors provides valuable information and therefore constitutes an essential part of diagnostic procedures. Automatic segmentation is attractive in this context, as it allows for faster, more objective and potentially more accurate description of relevant tumor parameters, such as the volume of its subregions. Due to the irregular nature of tumors, however, the development of algorithms capable of automatic segmentation remains challenging.

The brain tumor segmentation challenge (BraTS) [1] aims at encouraging the development of state of the art methods for tumor segmentation by providing a large dataset of annotated low grade gliomas (LGG) and high grade glioblastomas (HGG). The BraTS 2018 training dataset, which consists of 210 HGG and 75 LGG cases, was annotated manually by one to four raters and all segmentations were approved by expert raters [2, 3, 4]. For each patient a T1 weighted, a post-contrast T1-weighted, a T2-weighted and a FLAIR MRI was provided. The MRI originate from 19 institutions and were acquired with different protocols, magnetic field strengths and MRI scanners. Each tumor was segmented into edema, necrosis and non-enhancing tumor and active/enhancing tumor. The segmentation performance of participating algorithms is measured based on the DICE coefficient, sensitivity, specificity and 95th percentile of Hausdorff distance.

It is unchallenged by now that convolutional neural networks (CNNs) dictate the state of the art in biomedical image segmentation

[5, 6, 7, 8, 9, 10]. As a consequence, all winning contributions to recent BraTS challenges were exclusively build around CNNs. One of the first notably successful neural network for brain tumor segmentation was DeepMedic, a 3D CNN introduced by Kamnitsas et al. [5]

. It comprises a low and a high resolution pathway that capture semantic information at different scales and recombines them to predict a segmentation based on precise local as well as global image information. Kamnitsas et al. later enhanced their architectures with residual connections for BraTS 2016

[11]. With the success of encoder-decoder architectures for semantic segmentation, such as FCN [12, 13] and most notably the U-Net [14], it is unsurprising that these architectures are used in the context of brain tumor segmentation as well. In BraTS 2017, all winning contributions were at least partially based on encoder-decoder networks. Kamnitsas et al. [9], who were the clear winner of the challenge, created an ensemble by combining three different network architectures, namely 3D FCN [12], 3D U-Net [15, 14] and DeepMedic [5], trained with different loss functions (dice loss [16, 17] and crossentropy) and different normalization schemes. Wang et al. [10] used a FCN inspired architecture, enhanced with dilated convolutions [13] and residual connections [18]. Instead of directly learning to predict the regions of interest, they trained a cascade of networks that would first segment the whole tumor, then given the whole tumor the tumor core and finally given the tumor core the enhancing tumor. Isensee et al. [6] employed a U-Net inspired architecture that was trained on large input patches to allow the network to capture as much contextual information as possible. This architecture made use of residual connections [18] in the encoder only, while keeping the decoder part of the network as simple as possible. The network was trained with a multiclass dice loss and deep supervision to improve the gradient flow.

Recently, a growing number of architectural modifications to encoder-decoder networks have been proposed that are designed to improve the performance of the networks for their specific tasks [17, 19, 20, 21, 10, 6, 22, 7]. Due to the sheer number of such variants, it becomes increasingly difficult for researchers to keep track of which modifications extend their usefulness over the few datasets they are typically demonstrated on. We have implemented a number of these variants and found that they provide no additional benefit if integrated into a well trained U-Net. In this context, our contribution to the BraTS 2018 challenge is intended to demonstrate that such a U-Net, without using significant architectural alterations, is capable of generating competitive state of the art segmentations.

2 Methods

In the following we present the network architecture and training schemes used for our submission. As hinted in the previous paragraph, we will use a 3D U-Net architecture that is very close to its original publication [15] and optimize the training procedure to maximize its performance on the BraTS 2018 training and validation data.

2.1 Preprocessing

With MRI intensity values being non standardized, normalization is critical to allow for data from different institutes, scanners and acquired with varying protocols to be processed by one single algorithm. This is particularly true for neural networks where imaging modalities are typically treated as color channels. Here we need to ensure that the value ranges match not only between patients but between the modalities as well in order to avoid initial biases of the network. We found the following workflow to work well. We normalize each modality of each patient independently by subtracting the mean and dividing by the standard deviation of the brain region. The region outside the brain is set to 0.

2.2 Network architecture

U-Net [14] is a successful encoder-decoder network that has received a lot of attention in the recent years. Its encoder part works similarly to a traditional classification CNN in that it successively aggregates semantic information at the expense of reduced spatial information. Since in segmentation, both semantic as well as spatial information are crucial for the success of a network, the missing spatial information must somehow be recovered. U-Net does this through the decoder, which receives semantic information from the bottom of the ’U’ (refer to Fig. 1) and recombines it with higher resolution feature maps obtained directly from the encoder through skip connections. Unlike other segmentation networks, such as FCN [12] and previous iterations of DeepLab [13] this allows U-Net to segment fine structures particularly well.

Figure 1: We use a 3D U-Net architecture with minor modifications. It instance normalization [23]

and leaky ReLU nonlinearities and reduces the number of feature maps before upsampling. Feature map dimensionality is noted next to the convolutional blocks, with the first number being the number of feature channels.

Our network architecture is an instantiation of the 3D U-Net [15] with minor modifications. Following our successful participation in 2017 [6]

, we stick with our design choice to process patches of size 128x128x128 with a batch size of two. Due to the high memory consumption of 3D convolutions with large patch sizes, we implemented our network carefully to still allow for an adequate number of feature maps. By reducing the number of filters right before upsampling and by using inplace operations whenever possible, this results in a network with 30 feature channels at the highest resolution, which is nearly double the number we could train with in our previous model (on a 12 GB NVIDIA Titan X GPU). Due to our choice of loss function, traditional ReLU activation functions did not reliably produce the desired results, which is why we replaced them with leaky ReLUs (leakiness

) throughout the entire network. With a small batch size of 2, the exponential moving averages of mean and variance within a batch learned by batch normalization

[24] are unstable and do not reflect the feature map activations at test time very well. We found instance normalization [23] to provide more consistent results and therefore used it to normalize all feature map activations (between convolution and nonlinearity). For an overview over our segmentation architecture, please refer to Fig. 1.

2.3 Training Procedure

Our network architecture is trained with randomly sampled patches of size 128x128x128 voxels and batch size 2. We refer to an epoch as an iteration over 250 batches and train for a maximum of 500 epochs. The training is terminated early if the exponential moving average of the validation loss (

) has not improved within the last 60 epochs. Training is done using the ADAM optimizer with an initial learning rate , which is reduced by factor 5 whenever the above mentioned moving average of the validation loss has not improved in the last 30 epochs. We regularize with a l2 weight decay of .

One of the main challenges with brain tumor segmentation is the class imbalance in the dataset. While networks will train with crossentropy loss function, the resulting segmentations may not be ideal in the sense of the dice score they obtain. Since the dice scores is one of the most important metrics based upon which contributions are ranked, it is imperative to optimize this metric. We achieve that by using a soft dice loss for the training of our network. While several formulations of the dice loss exist in the literature [25, 16, 17], we prefer to use a multi-class adaptation of [16] which has given us good results in segmentation challenges in the past [8, 6]

. This multiclass Dice loss function is differentiable and can be easily integrated into deep learning frameworks:

(1)

where is the softmax output of the network and

is a one hot encoding of the ground truth segmentation map. Both

and have shape by with being the number of pixels in the training patch and being the classes.

When training large neural networks from limited training data, special care has to be taken to prevent overfitting. We address this problem by utilizing a large variety of data augmentation techniques. The following augmentation techniques were applied on the fly during training: random rotations, random scaling, random elastic deformations, gamma correction augmentation and mirroring. Data augmentation was done with our own in-house framework which is publically available at https://github.com/MIC-DKFZ/batchgenerators.

The fully convolutional nature of our network allows to process arbitrarily sized inputs. At test time we therefore segment an entire patient at once, alleviating problems that may arise when computing the segmentation in tiles with a network that has padded convolutions. We furthermore use test time data augmentation by mirroring the images and averaging the softmax outputs.

2.4 Region based prediction

Wang et al. [10] use a cascade of CNNs to segment first the whole tumor, then the tumor core and finally the enhancing tumor. We believe the cascade and their rather complicated network architecture to be of lesser importance, but the fact that they did not learn the labels (enhancing tumor, edema, necrosis) but instead optimized the regions that are finally evaluated in the challenge directly to be key to their good performance in last years challenge. For this reason we will also train a version of our model where we replace the final softmax with a sigmoid and optimize the three (overlapping) regions (whole tumor, tumor core and enhancong tumor) directly with the dice loss.

2.5 Cotraining

285 training cases is a lot for medical image segmentation, but may still not be enough to prevent overfitting entirely. We therefore also experiment with cotraining on institutional data. Since the label definitions between the BraTS dataset and our own differ slightly, we add a second segmentation layer (1x1x1 convolution) at the end, which acts as a supervised version of m heads [26]. During training, the BraTS segmentation layer only receives gradients from BraTS examples and the other segmentation layer is trained only on institutional data. The losses of both layers are averages to obtain the total loss of a minibatch. The rest of the network weights are shared.

2.6 Postprocessing

One of the most challenging parts in the BraTS challenge data is distinguishing small blood vessels in the tumor core region (that must be labeled either as edema of as necrosis) from enhancing tumor. Since this is particularly detrimental for LGG patients that may have no enhancing tumor at all we replace all enhancing tumor voxels with necrosis if less than 500 enhancing tumor voxels are present in a patient.

3 Experiments and Results

We designed our training scheme by running a five fold cross-validation on the 285 training cases of BraTS 2018. Training set results are summarized in Table 1, validation set results can be found in table 2. Validation set results were obtained by using the five networks from the training cross-validation as an ensemble. For consistency with other publications, all reported values were computed by the online evaluation platform (https://ipp.cbica.upenn.edu/).

Due to the relatively small size of the validation set (66 cases vs 285 training cases) we base our main analysis on the cross-validation results. We are confident that conclusions drawn from the training set are more robust and will generalize well to the test set. The vast majority of observations we draw from the training set translate into the validation set as well.

Dice HD95
enh. whole core enh. whole core
baseline 73.43 89.76 82.17 4.88 5.86 7.11
baseline + postprocess 77.11 89.76 82.17 3.99 5.86 7.11
baseline + regions 73.81 90.02 82.87 5.01 6.26 6.48
baseline + regions + postprocess 77.82 90.02 82.87 3.93 6.26 6.48
baseline + regions + cotraining 74.62 90.08 84.30 4.68 5.61 6.00
baseline + regions + cotraining + postprocess 77.17 90.08 84.30 3.68 5.61 6.00
Table 1: Results on BraTS 2018 training data (285 cases). All results were obtained by running a five fold cross-validation. Metrics were computed by the online evaluation platform.
Dice HD95
enh. whole core enh. whole core
baseline 79.59 90.80 84.32 3.12 4.79 8.02
baseline + postprocess 80.36 90.80 84.32 2.55 4.79 8.02
baseline + regions 79.25 90.72 85.14 3.80 5.23 7.23
baseline + regions + postprocess 80.48 90.72 85.14 2.81 5.23 7.23
baseline + regions + cotraining 80.45 90.83 85.44 3.12 4.97 7.04
baseline + regions + cotraining + postprocess 81.01 90.83 85.44 2.54 4.97 7.04
Table 2: Results on BraTS2018 validation data (66 cases). Results were obtained by using the five models from the training set cross-validation as an ensemble. Metrics were computed by the online evaluation platform.

We refer to our U-Net trained on the BraTS labels and training data as baseline. With Dice scores of 73.43/89.76/82.17 (enh/whole/core) on the training set and 79.59/90.80/84.32 (enh/whole/core) on the validation set this baseline model is by itself already strong. Adding region based training (regions) improved the dice scores of both the enhancing tumor as well as the tumor core. When training with our institutional data (cotraining), we gain an additional dice point on enhancing tumor and tumor core. Our postprocessing, which is targeted at fixing false positive enhancing tumor predictions in LGG patients has a substantial impact on enhancing tumor dice. On the training set it increases the mean dice by up to 4.01 points (baseline+regions vs baseline+regions+postprocess). However, its impact on the best performing model is less pronounced, resulting in only an increase of 2.55 dice points (baseline+ regions+cotraining+postprocess on train set). This is most likely due to an increase in segmentation quality which partly removes the neccessity for postprocessing. Especially the inclusion of institutional data seems to have resolved many false positives in LGG patients by itself. Comparing baseline+regions+postprocess vs baseline+regions+cotraining+postprocess we notice that training with additional training data only yiels marginal improvements. In the training data, inclusing institutional data reduced the enh. tumor dice scores by 0.65 but increased the tumor core dice by 1.43. On the validation set, the impact of our postprocessing is lower, yielding only a minor improvement on in dice score (0.56) for our best model. We believe this is caused by a discrepancy in data distribution between the training and validation set resulting in the validation set having relatively few patients with no enhancing tumor label. With enhancing tumor dice scores on the test set being generally lower than validation and training set scores for almost all participants in recent BraTS challenges, we believe the test set to contain many LGG cases without enhancing tumor label therefore our postprocessing to be a promising approach to maximize our dice scores.

Figure 2: Qualitative results. The case shown here is patient CBICA_AZA_1 from the validation set. Left: flair, middle: t1ce, right: our segmentation. Enhancing tumor is shown in yellow, necrosis in turquoise and edema in violet.

Figure 2 shows a qualitative example generated from our best performing model. The patient shown is taken from the validation set (CBICA_AZA_1). As can be seen in the middle (t1ce), there are several blood vessels close to the enhancing tumor. Segmentation CNNs typically struggle to correctly differentiate between such vessels and actual enhancing tumor. This is most likely due to a) a difficulty in detecting tube-like structures b) few training cases where these vessels are an issue c) the use of dice loss functions that does not sufficiently penalize false segmentations of vessels due to their relatively small size. In the case shown here, our model correctly segmented the vessels as background.

Finally we compare our best model (baseline+regions+cotraining+ postprocess) with other entries in the validation leaderboard https://www.cbica.upenn.edu/
BraTS18/lboardValidation.html
(accessed on July 14th 2018, see Table 3).

Dice HD95
enh. whole core enh. whole core
xuhuaren 83.70 90.76 87.40 3.35 5.20 5.55
SCAN 79.65 90.01 85.13 3.60 4.41 5.58
ETS_livia 79.26 90.29 85.42 3.32 5.38 6.59
NVDLMED 82.33 91.00 86.68 3.92 4.52 6.85
MIC-DKFZ 81.01 90.83 85.44 2.54 4.97 7.04
Table 3: Comparison with other teams (validation leaderboard).

4 Discussion

In this paper we demonstrated that a generic U-Net architecture that has only minor modifications can obtain very competitive segmentation, if trained correctly. Our baseline model is already strong on the BraTS 2018 validation data with dice scores of 79.59, 90.80 and 84.32 for enhancing tumor, whole tumor and tumor core, respectively. Using region based training improved results for the tumor core by about one dice point. Additional training data yielded some improvements for the enhancing tumor dice. Our simple postprocessing technique of removing enhancing tumor entirely from a patient if the total number of predicted enhancing tumor voxels was below 500 proved to be effective. Overall our current best model achieves dice scores 81.01, 90.83 and 85.44 and 95th percentiles of Hausdorff Distance of 2.54, 4.97 and 7.04 for enhancing tumor, whole tumor and tumor core, respectively. We will continue to improve our model until the release of the testing data and look forward to test our model on new, unseen patients.

References