Automated Brain Tumour Segmentation Using Deep Fully Convolutional Residual Networks

08/12/2019 ∙ by Indrajit Mazumdar, et al. ∙ 27

Automated brain tumour segmentation has the potential of making a massive improvement in disease diagnosis, surgery, monitoring and surveillance. However, this task is extremely challenging. Here, we describe our automated segmentation method using 2D CNNs that are based on U-Net. To deal with class imbalance effectively, we have formulated a weighted Dice loss function. We found that increasing the depth of the 'U' shape beyond a certain level results in a decrease in performance, so it is essential to choose an optimum depth. We also found that 3D contextual information cannot be captured by a single 2D network that is trained with patches extracted from multiple views whereas an ensemble of three 2D networks trained in multiple views can effectively capture the information and deliver much better performance. Our method obtained Dice scores of 0.79 for enhancing tumour, 0.90 for whole tumour, and 0.82 for tumour core on the BraTS 2018 validation set and its performance is comparable to the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The tumours present in the brain are one of the major lethal types of tumour DeAngelis2001 . The most frequently occurring primary brain tumours are gliomas Ohgaki2005 ; Goodenberger2012 . They originate from the neuroglia within the brain Goodenberger2012 ; Bauer2013 . A considerable amount of research has been done in this area, but there remains a great need for improving patient treatment. There are mainly two categories of gliomas, namely high-grade gliomas (HGG) and low-grade gliomas (LGG) Louis2016 ; Bauer2013 . HGGs are malignant and more aggressive. They have a worse prognosis and patients having HGG survives on average two years or lower Ohgaki2005 . LGGs are generally benign but may get converted into HGG. They are less aggressive and have a better prognosis. Patients having LGG survives on average for several years Ohgaki2005 . Gliomas have four types of histologically heterogeneous tumoural structures, namely peritumoral oedema (ED), enhancing core (ET), necrotic core (NCR), and non-enhancing core (NET) Menze2015 . These tumoural structures are organised into three mutually inclusive tumour sub-regions, namely, enhancing tumour (ET), tumour core (TC), and whole tumour (WT) Menze2015 . The TC contains within it the ET, NCR, and NET. The WT contains within it the TC and ED. Among the various imaging modalities available, MRI is preferred for imaging of the brain Isin2016 . Multiple MRI modalities are used to emphasise the different sub-regions of glioma. The different MRI modalities used are T2-weighted (T2), Fluid Attenuated Inversion Recovery (FLAIR), T1-weighted (T1), and contrast-enhanced T1-weighted (T1ce) Menze2015 . T1ce emphasises the enhancing tumour. T1 and T1ce emphasise the tumour core. T2 and FLAIR emphasise the whole tumour.

Automated segmentation of brain tumours is essential for quantitative assessment of brain tumours. It is much more accurate and objective compared to qualitative assessment, which is subjective. Thus, automated segmentation has the potential of making a massive improvement in diagnosis, surgery, treatment planning, and monitoring the disease. It is much faster than manual segmentation, which makes it scalable and ensures that patients receive more rapid treatment. Automated segmentation systems are objective as they are not subject to intra- and inter-rater variations. Because automated segmentation does not depend on radiologists, it results in reducing human labour, which in turn reduces cost. Automated brain tumour segmentation is highly clinically-relevant since accurate segmentation of tumour is required for extracting accurate radiomic features that are used to predict the patient’s overall survival Bakas2018 .

Automated brain tumour segmentation is extremely difficult and challenging. Brain tumours are highly heterogeneous, having a significant variation in shape, size, and location among patients, which makes prior information about these things not valuable for segmentation. Multiple tumours of different grades may reside simultaneously inside the brain. The boundary between tumour and adjacent healthy tissue is usually unclear since the intensity at the boundaries changes very smoothly. Bias field artefacts and partial volume artefacts are also present in the MRI images. The problem of brain tumour segmentation also suffers from extreme class imbalance.

Generative and discriminative approaches are the two most used approaches for automated segmentation of brain tumours. Generative methods explicitly define the probabilistic distributions of brain tumours Menze2010

. However, it is extremely challenging and time-consuming to construct an accurate probabilistic model from meaningful analysis of an image. Discriminative methods straightway comprehend the connection between image intensities and labels. In the past, discriminative methods used for brain tumour segmentation employed feature extraction from images followed by classification

Isin2016

. However, it is extremely challenging to select highly representative features for the classifier.

At present, the latest techniques used for automated brain tumour segmentation are discriminative methods that are deep learning-based

Bakas2018 . Such methods based on deep learning can instinctively learn highly representative features and are thus better than traditional discriminative methods. The success achieved by the numerous deep learning networks varies mainly due to the difference in their network architecture and training procedure. Pereira et al. Pereira2016 used two different 2D CNN architectures; one for segmenting HGG and the other for segmenting LGG. But their trained CNN predicts the label of the central voxel only within a patch which leads to high memory usage and time consumption. DeepMedic Kamnitsas2017a is a 3D CNN that uses two pathways of different resolutions to combine semantic information at distinct scales. Ensembles of Multiple Models and Architectures (EMMA) Kamnitsas2018 is an ensemble of widely varying CNN models that include two variations of DeepMedic model Kamnitsas2017a ; Kamnitsas2015 , three variations of 3D FCN model Long2015 , and two variations of 3D U-Net model Ronneberger2015 . This network demonstrated good generalisation performance. Wang et al. Wang2018d used a cascade approach in which three networks were used to sequentially segment the three different tumour sub-regions wherein each network’s output is cropped and fed as input to the next network. They also incorporated 3D contextual information by training their network in three different orthogonal views and averaging the outputs of those networks. However, the cascade approach consumes more time and memory since it is not end-to-end. Isensee et al. Isensee2018 used a U-Net Ronneberger2015 inspired 3D CNN. In their network, each encoder block is a residual block He2016a . Deep supervision had been used in the decoder part to improve the gradient flow. Myronenko Myronenko2019

used an asymmetric encoder-decoder network. Notably, they used a variational autoencoder for regularising the encoder. Isensee et al.

Isensee2019 used a modification of the 3D U-Net Cicek2016 in which they reduced the number of activation maps before each upsampling operation to decrease the memory usage. McKinley et al. McKinley2019 used U-Net style network in which DenseNet Huang2017 structures were placed. They also introduced the label-uncertainty loss that can model label noise and uncertainty. Zhou et al. Zhou2019 used a model cascade approach to segment the three tumour sub-regions sequentially.

Figure 1: The network architecture illustrated schematically. The violet box denotes input. Each green box corresponds to a residual block. On the top of each box the number channels are denoted. The dimensions of the activation maps is denoted on the left side of each level. Each orange arrow pointing down represents downsampling operation and each pink arrow pointing up represents upsampling operation. Each yellow circle containing denotes the concatenation of the upsampled activations with the activations from the equivalent level of the encoder. The red box present at the terminal of the decoder represents a convolution that is succeeded by a softmax output layer.

In this work, we describe our automated glioma segmentation method using 2D CNNs that are based on U-Net Ronneberger2015 . We examine how the performance of a U-Net based network gets affected when we increase the depth of the ‘U’ shape and how important it is to choose an optimum depth. We formulate a weighted Dice loss function taking inspiration from the various Dice loss functions available in the literature Milletari2016 ; Sudre2017 , to take care of the class imbalance problem effectively. We explore the effectiveness of two different approaches for incorporating 3D contextual information in a 2D network to enhance its performance. In the first approach, we use a single model that is trained by extracting patches from multiple views and in the second approach, we use an ensemble of three networks trained in multiple views. In comparison to 3D networks, 2D networks have the advantage that they consume very less memory, which allows them to use larger batch size and make use of batch normalisation. Because of this advantage, we decided to use a 2D network and investigate whether a 2D network that has incorporated 3D contextual information from three orthogonal views can obtain similar performance to the latest 3D networks on the problem of automated segmentation of brain tumours.

We have arranged the remaining portion of this paper in the following way. We describe our proposed segmentation technique in Section 2. Then, in Section 3, we deal with the results of the experiments. Eventually, in Section 4, we conclude this paper.

2 Methods

2.1 Preprocessing

It is essential to standardise the MRI images, so the intensity range remains comparable not only among MRI images of different patients but also among different MRI modalities of the same patient. So, we normalised every MRI modality separately by subtracting the mean and dividing by the standard deviation of the nonzero voxels present inside the region of the brain. The voxels present outside the region of the brain contain only zero intensity.

After normalisation, we extracted 2D slices of size from the 3D MRI images. We discarded the patches that do not contain any tumour. Then we shuffled the entire set of extracted patches. After shuffling, we partitioned the set of extracted patches into a training set (80%) for training our network and an internal validation set (20%) for parameter tuning.

2.2 Network Architecture

We have designed a U-Net Ronneberger2015

inspired 2D fully convolutional neural network. We illustrate our network architecture in Fig. 

1. The network is made up of an encoder part to capture context and a symmetric decoder part to enable precise localisation.

The encoder part is composed of encoder blocks. Each encoder block is a residual block He2016a containing two convolutions, every one of which is succeeded by a Batch Norm Ioffe2015

and a ReLU activation. We decreased the resolution of the activation maps by 2 while downsampling by using

strided convolutions with stride 2. Additionally, we doubled the number of features in every downsampling operation. At the beginning of the encoder part, a total of 32 filters were present in the first convolutional layer. The encoder endpoint has dimensions .

In the decoder part, we upsampled the activation maps using bilinear upsampling to increase the resolution of the feature maps by 2. Additionally, we halved the number of features in every upsampling operation. The decoder part is composed of decoder blocks. We have used skip connections for concatenating the upsampled activation maps with the activation maps from the corresponding level of the encoder. Each decoder block has the same structure as that of an encoder block. A

convolution is used at the end of the decoder to map the feature vector to the four classes, and a softmax output layer follows it.

Method Dice score

Hausdorff distance (95% quantile)

ET WT TC ET WT TC
Baseline 0.78 0.93 0.91 5.71 11.94 9.71
Baseline (depth of ‘U’ shape is 4) 0.74 0.90 0.79 12.12 28.23 24.52
Baseline (training patches extracted from three orthogonal views) 0.80 0.94 0.91 4.96 8.31 7.29
Baseline (sagittal view) 0.81 0.93 0.92 5.06 13.87 5.24
Baseline (coronal view) 0.78 0.93 0.92 4.36 11.82 9.51
Ensemble of three models 0.82 0.94 0.93 2.30 5.02 2.70
Table 1: Quantitative results on the training set of BraTS 2018.

2.3 Training

Keras Keras

and Tensorflow

Abadi2016 was used to implement our network. We employed an NVIDIA GeForce GTX TITAN X 12 GB GPU for training our network. The dimensions of training patches were

, and the batch size was 8. Training continued for 300 epochs. We used Adam optimizer

Kingma2015 and set the initial learning rate to and used a decay of . We applied L2 regularisation with a regularisation strength of on the convolutional filter weights.

2.4 Loss Function

In the problem of brain tumour segmentation, an extremely small part of the entire image contains tumour. In the training set of the BraTS 2018 Menze2015 ; Bakas2017 ; Bakas2017a ; Bakas2017b ; Bakas2018 dataset, approximately 98.88% of total voxels belong to background, 0.64% belongs to ED, 0.28% belongs to NCR and NET, and 0.20% belongs to ET. Thus, there are one majority class and three minority classes. Since the distribution of class labels in the dataset is highly disproportionate, the dataset suffers from the class imbalance problem. To deal with class imbalance, we formulated a weighted Dice loss function by taking inspiration from the various Dice loss functions available in the literature Milletari2016 ; Sudre2017 . Let be the ground truth for the -th voxel belonging to the class and be the corresponding prediction, be the total number of voxels, and be the total number of classes. Then, we define weighted Dice loss function as

(1)

where is the weight allocated to the class . Thus, a class containing a smaller number of voxels gets more weight. The weights are important in helping to differentiate the three minority classes better. The weighted Dice loss function is designed to deliver good performance for a multi-class segmentation problem like brain tumour segmentation.

2.5 Data Augmentation

The shape, size, and location of brain tumours have massive variations. It can lead to overfitting if we have limited training data. To reduce overfitting and make the network robust to such variability, we enhanced the diversity of the training dataset by using data augmentation. The techniques we used for data augmentation are random rotations, random horizontal flip, and random vertical flip. We have performed these augmentations on the fly, before each epoch, so that the memory consumption does not increase excessively.

2.6 Incorporating 3D Contextual Information

In segmentation, a voxel’s class label is highly correlated with that of its neighbouring voxels. Since our network is 2D, it can only capture the correlation of a voxel to its neighbours that are on the same 2D slice. But to capture the correlation of a voxel to its neighbours that are on a different 2D slice we need to incorporate 3D contextual information from axial, sagittal, and coronal views. It can help to determine the class label of a voxel more accurately, thereby improving the performance of our model. We have used two different approaches for combining the 3D contextual information from multiple views. In the first approach, we extracted training patches from axial, sagittal, and coronal views. Then, we used all these extracted patches as input for training a single network. In the second approach, we trained our network in axial, sagittal, and coronal views, respectively. Then we used an ensemble of these three networks for making prediction by taking the average of the softmax outputs of the three networks.

3 Experiments and Results

3.1 Dataset

We carried out our experiments by taking the aid of the BraTS 2018 Menze2015 ; Bakas2017 ; Bakas2017a ; Bakas2017b ; Bakas2018 dataset. The BraTS training dataset includes pre-operative MRI brain images of 285 patients out of which 210 are for HGG, and 75 are for LGG. The ground truths are also given for the training dataset. The BraTS validation dataset contains MRI images of 66 patients. The ground truths are not given for the validation set, and the online evaluation tool PennImaging needs to be used to determine the algorithm’s performance on the validation set. For all patients, T1, T1ce, T2, and FLAIR MRI modalities were given. The four MRI modalities for the same patient had been co-registered, and every image had been resampled to 1 mm3 isotropic resolution and had been skull stripped. Every MRI image has dimensions of . The voxels in the MRI images are grouped into four classes. The NCR and the NET are assigned class label 1, ED is assigned label 2, ET is assigned label 4, and everything else is assigned label 0. Label 3 is not used.

Method Dice score Hausdorff distance (95% quantile)
ET WT TC ET WT TC
Baseline 0.76 0.89 0.81 7.29 10.39 11.60
Baseline (depth of ‘U’ shape is 4) 0.74 0.87 0.79 12.65 27.91 27.31
Baseline (training patches extracted from three orthogonal views) 0.78 0.89 0.81 8.18 13.76 7.22
Baseline (sagittal view) 0.75 0.88 0.80 5.23 18.31 10.69
Baseline (coronal view) 0.77 0.88 0.79 14.30 15.62 16.08
Ensemble of three models 0.79 0.90 0.82 2.99 6.28 5.90
Table 2: Quantitative results on the validation set of BraTS 2018.
Dataset Sensitivity Specificity
ET WT TC ET WT TC
Training 0.86 0.93 0.91 0.999 0.997 0.998
Validation 0.81 0.89 0.79 0.997 0.995 0.998
Table 3: Quantitative results on the dataset of BraTS 2018.
Figure 2: Boxplots for each of the three tumour sub-regions on the BraTS training set. The first column represents the boxplots of Dice score, second column represents the boxplots of sensitivity, third column represents the boxplots of specificity, and fourth column represents the boxplots of 95% quantile of the Hausdorff distance. The mean is denoted using a red coloured diamond.
Figure 3: Boxplots for each of the three tumour sub-regions on the BraTS validation set. The first column represents the boxplots of Dice score, second column represents the boxplots of sensitivity, third column represents the boxplots of specificity, and fourth column represents the boxplots of 95% quantile of the Hausdorff distance. The mean is denoted using a red coloured diamond.
Figure 4: Qualitative result for a patient from the BraTS training set. The left column displays the FLAIR image, the middle column displays the ground truth overlaid over the FLAIR image, and right column displays our predicted segmentation mask overlaid over the FLAIR image. The top row displays the axial slices, middle row displays the sagittal slices, and bottom row displays the coronal slices. Yellow colour represents the ET, red colour represents the NCR and the NET, and green colour represents the ED.

3.2 Results

We have used the online evaluation tool PennImaging for measuring our method’s performance. The online evaluation tool reports the performance of the algorithm with regard to Dice score, sensitivity, specificity, and Hausdorff distance (95% quantile) for the ET, WT, and TC tumour sub-regions. Quantitative segmentation results with regard to Dice coefficient and 95% quantile of the Hausdorff distance on the BraTS training set and the BraTS validation set are presented in Table 1 and Table 2, respectively. In the tables, we use the term baseline to denote our network that used a weighted Dice loss function and was trained in axial view. Next, we increased the depth of the ‘U’ shape from 3 to 4 by adding one extra level, and it resulted in a decrease in performance. It happens because increasing the depth of the ‘U’ shape beyond certain level results in a massive reduction in image resolution leading to a significant loss of spatial information. Thus, we must always use an optimum depth of the ‘U’ shape for achieving excellent performance. We then trained a single model by extracting patches from three orthogonal views. It resulted in little improvement in Dice score for the enhancing tumour but the Hausdorff distance increased for all the three tumour sub-regions. It shows that a single 2D network cannot capture 3D contextual information from multiple views. We also trained our baseline in sagittal and coronal views. Then we used an ensemble of three models trained in the three orthogonal views. It increased the Dice score but more importantly, it helped to bring down the Hausdorff distance. It shows that 3D contextual information can be effectively captured by our ensemble of 2D networks trained in multiple views. We show the quantitative segmentation results on the dataset of BraTS with regard to sensitivity and specificity in Table 3. In Fig. 2 and Fig. 3

we present the boxplots of the four evaluation metrics for each of three tumour sub-regions on the BraTS training set and BraTS validation set respectively. In the boxplots for Dice score and sensitivity, it is found that the WT sub-region has the lowest spread and the highest mean whereas the ET sub-region has the lowest mean. The boxplots for specificity have the highest spread and lowest mean for the WT sub-region. In the boxplots for the 95% quantile of Hausdorff distance, the ET sub-region has the lowest spread and the lowest mean.

We show the qualitative result for a patient from the BraTS training set in Fig. 4. In that figure, yellow colour represents the ET, red colour represents the NCR and the NET, and green colour represents the ET. It is observed in the figure that our model effectively segments the different tumoural structures. We used the ITK-SNAP Yushkevich2006a ; ITK-SNAP tool for visualising the MRI image.

In Table 4, we compare the results obtained by our model with those of the state-of-the-art methods on the validation set of BraTS 2018 BraTS18Leaderboard . The results achieved by our technique is similar to those of the latest techniques.

Methods Dice score Hausdorff distance (95% quantile)
ET WT TC ET WT TC
Proposed 0.79 0.90 0.82 2.99 6.28 5.90
Myronenko Myronenko2019 0.82 0.91 0.86 3.92 4.51 6.85
Isensee et al. Isensee2019 0.80 0.91 0.86 2.41 4.27 6.52
McKinley et al. McKinley2019 0.79 0.90 0.84 3.55 4.17 4.93
Zhou et al. Zhou2019 0.81 0.90 0.86 2.71 4.17 6.54
Table 4: Results obtained by our technique on the validation set of BraTS 2018 (submission id IML). We also display the results of the state-of-the-art methods for comparison.

4 Conclusions

In this work, we have described our automated glioma segmentation method using 2D CNNs that are based on U-Net Ronneberger2015 . To effectively deal with class imbalance problem, we have formulated a weighted Dice loss function taking inspiration from Milletari2016 ; Sudre2017 . We investigated and found that increasing the depth of the ‘U’ shape beyond certain level results in a massive reduction in image resolution leading to loss of spatial information and a decrease in performance. Thus, for CNNs based on U-Net, we must always use an optimum depth of the ‘U’ shape for achieving the best results possible. We investigated the effectiveness of two different approaches for incorporating 3D contextual information in a 2D network to enhance its performance. In the first approach, we used a single model that is trained by extracting patches from three orthogonal views but found that this approach resulted in a decrease in performance. In the second approach, we used an ensemble of three models trained in multiple views, and this approach resulted in improvement in performance. We found that 3D contextual information cannot be captured by a single 2D network that is trained with patches extracted from multiple views whereas an ensemble of three 2D networks trained in multiple views can effectively capture the information and deliver much better performance. Finally, our method obtained Dice scores of 0.79, 0.90 and 0.82 for the ET, WT and TC, respectively on the BraTS 2018 validation set and its results are similar to those of the latest techniques. Although we used a 2D network, whereas every other latest technique used 3D networks, yet we managed to achieve comparable performance to those methods because we incorporated 3D contextual information from multiple views. It shows that a 2D network that has incorporated 3D contextual information from orthogonal views can deliver comparable performance to a 3D network.

References