DeepMRSeg: A convolutional deep neural network for anatomy and abnormality segmentation on MR images

by   Jimit Doshi, et al.

Segmentation has been a major task in neuroimaging. A large number of automated methods have been developed for segmenting healthy and diseased brain tissues. In recent years, deep learning techniques have attracted a lot of attention as a result of their high accuracy in different segmentation problems. We present a new deep learning based segmentation method, DeepMRSeg, that can be applied in a generic way to a variety of segmentation tasks. The proposed architecture combines recent advances in the field of biomedical image segmentation and computer vision. We use a modified UNet architecture that takes advantage of multiple convolution filter sizes to achieve multi-scale feature extraction adaptive to the desired segmentation task. Importantly, our method operates on minimally processed raw MRI scan. We validated our method on a wide range of segmentation tasks, including white matter lesion segmentation, segmentation of deep brain structures and hippocampus segmentation. We provide code and pre-trained models to allow researchers apply our method on their own datasets.


page 9

page 10


A Contrast-Adaptive Method for Simultaneous Whole-Brain and Lesion Segmentation in Multiple Sclerosis

Here we present a method for the simultaneous segmentation of white matt...

PaddleSeg: A High-Efficient Development Toolkit for Image Segmentation

Image Segmentation plays an essential role in computer vision and image ...

Brain Tumor Segmentation Using Deep Learning by Type Specific Sorting of Images

Recently deep learning has been playing a major role in the field of com...

Multi-Scale Convolutional-Stack Aggregation for Robust White Matter Hyperintensities Segmentation

Segmentation of both large and small white matter hyperintensities/lesio...

Multi-stream 3D FCN with Multi-scale Deep Supervision for Multi-modality Isointense Infant Brain MR Image Segmentation

We present a method to address the challenging problem of segmentation o...

2D Multi-Class Model for Gray and White Matter Segmentation of the Cervical Spinal Cord at 7T

The spinal cord (SC), which conveys information between the brain and th...

The Chan-Vese Algorithm

Segmentation is the process of partitioning a digital image into multipl...

1 Introduction

Segmentation has been a major task in medical image analysis since the early years of the field, as it enables quantification of normal and abnormal anatomical regions both for individual assessments and for comparative group analyses (Gonzalez-Villa et al., 2016). In neuroimaging, multiple automated methods have been developed for various problems, such as brain extraction, segmentation of anatomical regions of interest (ROIs), white matter lesion (WML) segmentation and segmentation of brain tumor sub-regions. Importantly, each of these problems have their own specific challenges, mainly due to variations in image modalities and imaging signatures that best characterize target regions. These variations motivated development of a large number of distinct task-specific segmentation methods (Kalavathi P, 2016; Anbeek et al., 2004; Eugenio Iglesias and Sabuncu, 2014; Gordillo et al., 2013; Despotovic et al., 2015).

Machine learning has played a key role in enabling novel methods that achieved accuracy comparable to, or surpassing human raters. In the commonly used supervised learning framework, examples with ground-truth labels are presented to the learning algorithm in order to construct a model that learns imaging patterns that characterize the target segmentations. The model is then applied to new scans for segmenting the target areas on them. Common supervised learning techniques, such as the support vector machines, have obtained very promising results. However, they require a number of sophisticated preprocessing and feature elimination/selection steps, making them vulnerable to negative effects of scanner variations and limiting their widespread usage in clinical settings.

In recent years, deep learning techniques have attracted a lot of attention as a result of their state-of-the-art performance in multiple major problems in computer vision and image analysis (Guo et al., 2018). In neuroimaging, convolutional neural networks (CNN) have started to gain popularity, with successful applications on various image recognition tasks (Kamnitsas et al., 2016; Akkus et al., 2017; Anwar et al., 2018). CNNs are deep neural networks designed to take advantage of the 2D or 3D structure of the input images. An input image passes through a series of convolution layers followed by pooling layers, which are acting together as filters to extract multiple translation invariant local features, without need for the manual feature engineering traditionally required.

In this paper we present DeepMRSeg, a deep learning based segmentation method that can be applied in a generic way to a variety of segmentation tasks. Our method uses a modified version of the UNet architecture (Ronneberger et al., 2015), with ResNet (He et al., 2015) and modified Inception-ResNet-A (Szegedy et al., 2016)

blocks in the encoding and decoding paths, taking advantage of recent advances in biomedical image segmentation and image classification. The residual connections allow the UNet architecture to learn residual mappings, while multiple branches of convolutional layers with different kernel sizes allow the network to learn multi-scale features that are adaptive to the segmentation task at hand. Also, we replace the maxpool operations in the UNet architecture with 1x1 convolutional filters to prevent the loss of fine boundary details that are important for a segmentation task. Importantly, our method operates on minimally processed raw MRI scans and it can be directly applied for different segmentation problems after training a model with the specific training data type. We validated our method on a wide range of segmentation tasks, including WML segmentation, segmentation of deep brain structures, and hippocampus segmentation. The DeepMRSeg method and trained models are provided on our online platform, IPP (, to allow users to apply our methods and models without need for installing any software packages.

2 Related Work

UNet architecture was introduced by Ronneberger et al. (2015). UNet has been an important advancement in the application of deep CNNs to the problem of biomedical image segmentation. CNNs have been initially used on classification problems by mapping input images to output class labels. However, in segmentation tasks, the desired output is an image, e.g. a binary segmentation map. UNet extended the CNN architecture by supplementing the usual contracting or encoding path with a symmetric expanding or decoding path, where pooling operators are replaced by upsampling operators. The encoding path allows the architecture to learn spatially relevant context information and the decoding path adds the precise localization back to the architecture, leading to a final segmentation image as output of the model.

A straightforward way for improving the performance of deep neural networks is to increase the network size, either by increasing the depth (number of layers) or the width (number of units at each layer). The ResNet architecture (He et al., 2015), was introduced to address an important limitation of deep learning, known as the degradation problem. As a network goes deeper, the gradient of the error function used for the weight updates may vanish, resulting in degrading accuracy. The main idea in ResNet is to learn a “residual mapping” instead of directly learning the desired underlying mapping to address this problem. For this purpose, shortcut connections that skip one or more layers are introduced. These shortcut connections are identity mappings that simply add their outputs to the outputs of the stacked convolutional layers, thus making the layer learn the residual. ResNet allowed training deeper networks, while providing significantly faster convergence. Large convolution operations are computationally expensive, as the network will have a larger number of parameters to learn. The Inception framework (Szegedy et al., 2014), was proposed to overcome this limitation by introducing sparsity in the network architecture. The main idea is to constrain the network to lower dimensions and group them into highly correlated filter units. The authors used different types of convolutional filter sizes (1x1, 3x3, 5x5) instead of a single filter size (3x3), allowing the network to learn different representations of the input image. This strategy helped to reduce the number of parameters/connections, thus reducing the complexity of the network and allowing the network to go wider while keeping the computational budget constant.

As noted by Krizhevsky et al. (2012), when convolutional filters are arranged in different groups, the network can learn distinct features from each group, with low correlation between the learned features across groups. This was demonstrated in AlexNet, where the network consistently identified color-agnostic and color-specific features in different filter groups. The same concept also applies to Inception network, where through the use of different convolutional filters at a single layer the network learns feature representations at different resolution levels.

3 Network Architecture

3.1 Overview

An overview of the proposed architecture is illustrated in figure 1. DeepMRSeg is built upon components that combine ideas from recent advances in the field, as described in the previous section. The network architecture consists of an encoding path and a corresponding decoding path, as in UNet (Ronneberger et al., 2015)

, followed by a voxel-wise, multi-class soft-max classifier to produce class probabilities for each voxel independently. An initial projection layer transforms the input feature maps (m) into the desired number of features (f). These features go through a pre-encoding block, consisting of ResNet blocks that extract various features from the input images and forms the input to the UNet. The encoding path of the network consists of encoder blocks that operate at different feature map resolutions. At each layer, the feature maps are subsampled using the “transition down” operation and they are fed into a ResInc block. The size of the feature maps decreases at each layer, while their receptive field increases, thus encoding more context information into the network. The decoding pathway includes up-sampling operations symmetric to the encoding blocks, coupled with ResInc blocks. Individual components of the DeepMRSeg architecture are explained in more details below.

Figure 1: Overview of the DeepMRSeg network architecture

3.2 Projection Layer

This layer is used at the start of the network to project the input image channels, or modalities (m), to a set of feature maps (f). This is accomplished with a convolution layer with a kernel size of 1x1 and stride 1. Intuitively, this layer learns a linear combination of the input channels and projects it onto the desired number of feature maps required for subsequent layers. The convolution in this layer is followed by batch normalization

(Ioffe and Szegedy, 2015)

and ReLU activation

(Nair and Hinton, 2010).

3.3 ResNet Module

Following the general design principles described in (Szegedy et al., 2015), we avoid representational bottlenecks with extreme compression early in the network. To achieve this, we add the traditional ResNet block (fig. 2.A) before the encoding pathway to ensure that the input features go through a few layers of convolutions before being fed into the ResInc block.

Figure 2: The ResNet and ResInc block architectures. The input data is the set of “f” feature maps obtained from the output of the previous layer.

3.4 ResInc Module

The ResInc module (fig. 2.B), modified from the Inception-ResNet-A module of the Inception-ResNet-v2 architecture, is used at every level of the UNet architecture, coupled with the “Transition Down” operation. This module splits the incoming input feature maps into 4 branches and an identity mapping that is added back to the final output. Each branch reduces the input dimensionality from “f” feature maps to “f/4” feature maps. This introduces a bottleneck by reducing the dimensionality of the incoming features and also reduces the number of learnable parameters in the network. Each branch subsequently transforms the input features with a varying number of 3x3 convolution layers. This ensures that each branch learns a different representation of the input features by learning shallow as well as deeper features, and allows the subsequent layer to abstract features from different scales simultaneously. This property can be extremely useful when dealing with segmentation tasks for more complex or heterogeneous structures.

The 4 branches are concatenated to form a single feature vector that then goes through a 1x1 convolution layer before the residual connection is added. This 1x1 convolution acts as a linear weighting of the features learned in each branch. Thus, if a certain representation is not useful, it can be assigned a lower weight.

This configuration of the ResInc module has less than one-third the number of learnable parameters compared to the traditional ResNet block. This allows us to increase the width and depth of the network while keeping the number of learnable parameters low.

3.5 Transition Down Blocks

In the traditional UNet architecture, a maxpool layer is used to reduce the dimensionality of the feature maps. This allows the subsequent convolutional layers to have a larger field of view and therefore, more contextual information. This operation is used to achieve translation invariance over small spatial shifts in the input image. Adding several such operations in the network achieves more translation invariance for robust classification, but it can also lead to a loss of spatial resolution (boundary detail) of the input feature maps. This lossy representation of the boundary details is not desirable for segmentation tasks where boundary delineation is important.

Motivated by the work done by (Springenberg et al., 2014), we have replaced the maxpool layer with a 1x1 convolution layer with stride 2, with the intuition that we let the network learn the parameters required for achieving the downsampling of the feature maps. This adds a few more learnable parameters in the network, resulting in a larger model size compared to maxpooling. However, while maxpooling works on subsampling within each feature map, the proposed convolution operation allows to model inter-feature dependencies.

The choice of kernel size was based on minimizing the number of learnable parameters outside of blocks containing residual connections. This operation serves the two tasks of subsampling the feature maps and doubling the number of feature maps simultaneously. This is followed by batch normalization and ReLU activation.

3.6 Transition Up Blocks

For upsampling the feature maps, we use a transposed convolution layer with a kernel size of 1 and stride 2. Upsampling the feature maps allows the addition of the skip connection from the previous layer from the encoding path. These skip connections are essential as they add the spatial localization information back into the network. Along with upsampling, the number of output feature maps are reduced to one-forth.

3.7 Training Methodology

The cost function to be minimized was a combination of softmax cross-entropy (), mean squared error () and soft Intersection Over Union (). and

were calculated between the one-hot encoded labels and the predicted soft max probabilities. The final loss,

was an equal weighted sum of , and

. Adam optimizer was used to minimize the loss function (

). We started with a learning rate of and used a decayed learning rate schedule with a decay factor of

every epoch. Data augmentation was done using randomized left/right flipping, followed by random translation, rotation and brightness/contrast adjustment.

3.8 Evaluation Datasets

We performed validation experiments on 3 different segmentation problems, specifically WML, mid-brain and hippocampus segmentation. We used publicly available datasets with ground-truth labels for each problem.

WML segmentation: We used the training dataset that was provided as part of the MICCAI 2017 WML Segmentation Challenge (Kuijf et al., 2019). This dataset included 3D T1-weighted and 2D multi-slice FLAIR scans from 60 subjects and manually delineated WML masks for these scans. The MRI scans were acquired from three different institutes/scanners: the University Medical Center (UMC), Utrecht, VU University Medical Centre (VU), Amsterdam, and the National University Health System (NUHS), Singapore. Manual segmentations were generated by one expert rater, following the STandards for ReportIng Vascular changes on nEuroimaging (STRIVE) protocol (Wardlaw et al., 2013) (Figure 3).

Figure 3: Example ground-truth segmentation for the WML.

Deep brain segmentation: We applied DeepMRSeg for segmentation of deep brain structures using the publicly available datset from MICCAI 2013 segmentation challenge (Asman et al., 2013). This dataset included T1-weighted scans of 35 subjects from OASIS project and corresponding manually created reference labels for 14 deep brain structures. The target regions of interest included accumbens, amygdala, caudate, hippocampus, pallidum, putamen and thalamus, separately for the left and the right hemispheres (fig 4).

Figure 4: Example ground-truth segmentation for deep brain structures.

Hippocampus segmentation: We also applied DeepMRSeg for segmenting the hippocampus, a structure critical in learning and memory, and particularly vulnerable to damage in early stages of AD (Mu and Gage, 2011). We used the dataset provided as part of the Decathlon challenge, consisting of 3D T1-weighted MPRAGE scans of 195 subjects and manual hippocampus segmentations into two sub-regions, hippocampus tail and body (Simpson et al., 2019) (Figure 5).

Figure 5: Example ground-truth segmentation for the hippocampus.

4 Evaluation Experiments and Metrics

We compared DeepMRSeg against a modified UNet architecture where a batch normalization layer was added between the convolution and ReLU layers. This was used as the benchmark method to compare the proposed architecture against. The loss function, optimizer and data augmentation was the same as the one used for the proposed architecture. The two network models were trained using the appropriate type of labeled data for each specific segmentation task. We performed four-fold cross-validation in all experiments, with 25% of the data left out for testing and the remaining data used for training and validation. This was repeated 20 times with randomization, giving us a robust estimate of the performance of the networks.

We quantified the performance of the networks using three complementary metrics that are commonly used for measuring the overlap between binary segmentation masks. We report the score (also known as the Dice coefficient), the

score and the balanced accuracy between automated and expert delineated ground-truth segmentations. We calculated the scores individually for each subject and target ROI, and we reported mean and standard deviation of the scores across all subjects.

In neuroimaging analyses, rather than binary segmentation masks, total volumes of segmented regions are often used as primary variables of interest. For this reason, we also calculated metrics to evaluate the accuracy of total volume estimations from the segmentations. We used the concordance correlation coefficient (), a metric that measures the agreement between two variables and that is commonly applied for evaluating reproducibility or inter-rater consistency. We reported the between automated and ground-truth segmentation volumes for all subjects.

The metrics that are used in our evaluations are briefly described below.

The score or Dice coefficient (Dice, 1945) is a spatial overlap statistic used to gauge the similarity of two segmentations. It’s defined as:

where and are the predicted and ground truth labels, , , and are the true positive, true negative, false positive and false negative rates respectively.

The score is commonly used in applications where recall is more important than precision (as compared to ):

Considering that our target datasets may typically be imbalanced, i.e. the foreground (target segmentation) may be much smaller compared to the background, we also report the balanced accuracy (), which is defined as:

where and

The concordance correlation coefficient is defined as:

where and are the means and and

are the variances of the two variables, and

is the correlation correlation between them.

5 Experimental Results

5.1 White matter lesion segmentation

The distribution of balanced accuracy, and scores for the segmentations obtained using UNet and DeepMRSeg are shown in figure 6. DeepMRSeg obtained a significantly better balanced accuracy and score. The mean Dice () score for both methods was similar with no significant differences. DeepMRSeg also obtained a significantly higher score (Table 1).

Figure 6:

Distribution of scores for the 3 evaluation metrics for segmentation of WML using UNet and DeepMRSeg models

Unet DeepMRSeg Unet DeepMRSeg Unet DeepMRSeg Unet DeepMRSeg
WML 0.876 (0.06) 0.889 (0.06) 0.768 (0.10) 0.765 (0.10) 0.759 (0.10) 0.769 (0.10) 0.956 0.962
Table 1: Scores for the 4 evaluation metrics for segmentation of WML using UNet and DeepMRSeg models. For the three overlap metrics, , and , we report mean and standard deviation across all subjects.

5.2 Mid-brain segmentation

UNet and DeepMRSeg networks were applied for segmenting each scan into 14 target ROIs. We calculated evaluation scores for each ROI independently. The distribution of overlap scores over all ROIs and subjects are shown in figure 7. DeepMRSeg obtained a significantly higher balanced accuracy for each ROI indpendently, as well as on average across all ROIs. DeepMRSeg also obtained higher for all ROIs except left caudate and left and right pallidum (table 2).

Figure 7: Distribution of scores for the 3 evaluation metrics for segmentation of deep brain structures using UNet and DeepMRSeg models
ROI Unet DeepMRSeg Unet DeepMRSeg Unet DeepMRSeg Unet DeepMRSeg
Right Accumbens Area 0.867 (0.05) 0.894 (0.05) 0.762 (0.05) 0.765 (0.07) 0.743 (0.07) 0.777 (0.09) 0.682 0.764
Left Accumbens Area 0.855 (0.05) 0.896 (0.05) 0.755 (0.06) 0.775 (0.06) 0.725 (0.08) 0.784 (0.08) 0.565 0.808
Right Amygdala 0.845 (0.05) 0.88 (0.03) 0.751 (0.06) 0.782 (0.04) 0.712 (0.08) 0.767 (0.05) 0.266 0.466
Left Amygdala 0.855 (0.04) 0.889 (0.03) 0.764 (0.05) 0.798 (0.04) 0.73 (0.07) 0.785 (0.05) 0.349 0.592
Right Caudate 0.928 (0.04) 0.945 (0.05) 0.883 (0.06) 0.893 (0.07) 0.866 (0.08) 0.89 (0.08) 0.802 0.854
Left Caudate 0.943 (0.02) 0.948 (0.03) 0.891 (0.04) 0.897 (0.05) 0.887 (0.04) 0.896 (0.05) 0.912 0.900
Right Hippocampus 0.892 (0.03) 0.924 (0.03) 0.833 (0.03) 0.858 (0.03) 0.802 (0.05) 0.852 (0.04) 0.425 0.736
Left Hippocampus 0.893 (0.03) 0.921 (0.03) 0.832 (0.03) 0.856 (0.03) 0.803 (0.05) 0.847 (0.04) 0.396 0.662
Right Pallidum 0.916 (0.04) 0.952 (0.03) 0.854 (0.05) 0.859 (0.04) 0.84 (0.06) 0.886 (0.04) 0.778 0.687
Left Pallidum 0.927 (0.05) 0.955 (0.03) 0.857 (0.06) 0.856 (0.04) 0.854 (0.08) 0.886 (0.04) 0.674 0.473
Right Putamen 0.941 (0.02) 0.953 (0.02) 0.901 (0.03) 0.908 (0.03) 0.889 (0.04) 0.906 (0.04) 0.863 0.899
Left Putamen 0.939 (0.03) 0.955 (0.03) 0.899 (0.04) 0.907 (0.04) 0.886 (0.05) 0.908 (0.05) 0.815 0.894
Right Thalamus Proper 0.936 (0.02) 0.959 (0.02) 0.9 (0.02) 0.914 (0.01) 0.883 (0.04) 0.916 (0.02) 0.752 0.906
Left Thalamus Proper 0.946 (0.02) 0.954 (0.02) 0.906 (0.01) 0.912 (0.01) 0.898 (0.03) 0.91 (0.02) 0.849 0.888
Average 0.906 (0.02) 0.93 (0.02) 0.842 (0.03) 0.856 (0.03) 0.823 (0.04) 0.858 (0.04) 0.987 0.993
Table 2: Scores for the 4 evaluation metrics for segmentation of deep brain structures using UNet and DeepMRSeg models. For the three overlap metrics, , and , we report mean and standard deviation across all subjects.

5.3 Hippocampus segmentation

The hippocampus was segmented into two sub-regions using UNet and DeepMRSeg. We calculated overlap scores for each sub-region independently. DeepMRSeg obtained a significantly better accuracy for both hippocampus sub-regions (Figure 8 and table 3).

Figure 8: Distribution of scores for the 3 evaluation metrics for segmentation of hippocampus sub-regions using UNet and DeepMRSeg models
ROI Unet DeepMRSeg Unet DeepMRSeg Unet DeepMRSeg Unet DeepMRSeg
Anterior 0.917 (0.03) 0.925 (0.03) 0.866 (0.04) 0.869 (0.04) 0.848 (0.06) 0.858 (0.05) 0.765 0.800
Posterior 0.908 (0.03) 0.920 (0.03) 0.849 (0.05) 0.858 (0.04) 0.830 (0.06) 0.847 (0.05) 0.624 0.734
Average 0.913 (0.03) 0.922 (0.02) 0.857 (0.04) 0.862 (0.04) 0.839 (0.05) 0.852 (0.04) 0.726 0.786
Table 3: Scores for the 4 evaluation metrics for segmentation of hippocampus sub-regions using UNet and DeepMRSeg models. For the three overlap metrics, , and , we report mean and standard deviation across all subjects.

6 Conclusions

We presented a novel deep learning based MRI segmentation method that combines elements from recent major advances in the field. The proposed network architecture was built with two main motivations: providing a generic tool that can be used for different types of segmentation problems, rather than being specific to a single type of target label; and allowing a wide range of users to easily segment their images by directly using their raw T1 scans without need for any pre-processing steps. In our validation experiments, we showed that DeepMRSeg outperformed a standard UNet implementation used as benchmark in three different segmentation tasks, achieving highly accurate segmentations in all tasks. We provide code and pre-trained models that can be used for applying DeepMRSeg on new scans.