Multiple sclerosis (MS) is a chronic, autoimmune and demyelinating disease of the central nervous system causing lesions in the brain tissues, especially in white matter (WM) (Steinman, 1996). Nowadays, magnetic resonance imaging (MRI) scans are the most common solution to visualize these kind of abnormalities owing to their sensitivity to detect WM damage (Compston and Coles, 2008).
Precise segmentation of MS lesions is an important task for understanding and characterizing the progression of the disease (Rolak, 2003). To this aim, both manual and automated methods are used to compute the total number of lesions and total lesion volume. Although manual segmentation is considered the gold standard (Simon et al., 2006), this method is a challenging task as delineation of 3-dimensional (3D) information from MRI modalities is time-consuming, tedious and prone to intra- and inter-observer variability (Sweeney et al., 2013)
. This motivates machine learning (ML) experts to develop automated lesion segmentation techniques, which can be orders of magnitude faster and immune to expert bias.
Among automated methods, supervised ML algorithms can learn from previously labeled train data and provide high performance in MS lesion segmentation. More specifically, traditional supervised ML methods rely on hand-crafted or low-level features. For instance, Cabezas et al. (2014)
exploited a set of features, including intensity channels (fluid-attenuated inversion-recovery (FLAIR), proton density-weighted (PDw), T1-weighted (T1w), and T2-weighted (T2w)), probabilistic tissue atlases (WM, grey matter (GM), and cerebrospinal fluid (CSF)), a map of outliers with respect to these atlases(Schmidt et al., 2012), and a set of low-level contextual features. A Gentleboost algorithm (Friedman et al., 2000) was then used with these features to segment multiple sclerosis lesions through a voxel by voxel classification.
During the last decade, deep learning methods, especially convolutional neural networks (CNNs)(LeCun et al., 1998), have demonstrated outstanding performance in biomedical image analysis. Unlike traditional supervised ML algorithms, these methods can learn by themselves how to design features directly from data during the training procedure (LeCun et al., 2015)
. They provided state-of-the-art results in different problems such as segmentation of neuronal structures(Ronneberger et al., 2015), retinal blood vessel extraction (Liskowski and Krawiec, 2016), cell classification (Han et al., 2016), brain extraction (Kleesiek et al., 2016), brain tumor (Havaei et al., 2017), tissue (Moeskops et al., 2016), and MS lesion segmentation (Valverde et al., 2017).
In particular, CNN-based biomedical image segmentation methods can be categorized into two different groups: patch-based and image-based methods. In patch-based methods, a moving window scans the image generating a local representation for each pixel/voxel. Then, a CNN is trained using all extracted patches, classifying the central pixel/voxel of each patch as a healthy or unhealthy region. These methods are frequently used in biomedical image analysis since they considerably increase the amount of training samples. However, they suffer of an increased training time due to repeated computations over the overlapping features of the sliding window. Moreover, they neglect the information over the global structure because of the small size of patches(Tseng et al., 2017). On the contrary, image-based approaches process the entire image exploiting the global structure information (Tseng et al., 2017; Brosch et al., 2016). These methods can be further categorized into two groups according to the processing of the data: slice-based segmentation of 3D data (Tseng et al., 2017) and 3D-based segmentation (Brosch et al., 2016). In slice-based segmentation methods, each 3D image is converted to its 2D slices which are then processed individually. Subsequently, the segmented slices are concatenated together to reconstruct the 3D volume. However, in almost all proposed pipelines based on this approach, the segmentation is not accurate, most likely because the method ignores part of the contextual information (Tseng et al., 2017). In 3D-based segmentation, a CNN with 3D kernels is used for extracting meaningful information directly from the original 3D image. The main significant disadvantages of these methods are related to the training procedure which needs a large number of parameters and has a high risk of overfitting in the presence of small datasets. Unfortunately, this is a quite common situation in biomedical applications (Brosch et al., 2016).
1.1 Related works
The literature offers some methods based on CNNs for MS lesion segmentation. For example, Vaidya et al. (2015) proposed a shallow 3D patch-based CNN using the idea of sparse convolution (Li et al., 2014) for effective training. Moreover, they added a post-processing stage which increased the segmentation performance by applying a WM mask to the output predictions. Ghafoorian and Platel (2015) developed a deep CNN based on 2D patches in order to increase the number of the training samples and avoid the overfitting problems of 3D-based approaches. Similarly, in (Birenbaum and Greenspan, 2016), multiple 2D patch-based CNNs have been designed to take advantage of the common information within longitudinal data. Valverde et al. (2017) proposed a pipeline relying on a cascade of two 3D patch-based CNNs. They trained the first network using extracted patches and the second network was used to refine the training procedure utilizing misclassified samples from the first network. Roy et al. (2018) proposed a 2D patch-based CNN including two pathways. They used different MRI modalities as input for each pathway and the outputs were concatenated to create a membership function for lesions. Recently, Hashemi et al. (2018)
proposed a method relying on a 3D patch-based CNN using the idea of a densely connected network. They also developed an asymmetric loss function for dealing with highly unbalanced data. Despite the fact that all the proposed patch-based techniques have good segmentation performance, they suffer from lacking global structural information. This means that global structure of the brain and the absolute location of lesions are not exploited during the segmentation.Brosch et al. (2016) developed a whole-brain segmentation method using a 3D CNN. They used single shortcut connection between the coarsest and the finest layers of the network, which enables the network to concatenate the features from the deepest layer to the shallowest layer in order to learn information about the structure and organization of MS lesions. However, they did not exploit middle-level features which have been shown to have a considerable impact on the segmentation performance (Ronneberger et al., 2015)
In this paper, we propose a novel deep learning architecture for automatic MS lesion segmentation consisting of a multi-branch 2D convolutional encoder-decoder network. In this study, we concentrated on (whole) slice-based segmentation in order to prevent both the overfitting present in 3D-based segmentation (Brosch et al., 2016) and the lack of global structure information in patch-based methods (Birenbaum and Greenspan, 2016; Valverde et al., 2017; Roy et al., 2018). We designed an end-to-end encoder-decoder network including a multi-branch downsampling path as the encoder, multi-scale feature fusion and multi-scale upsampling blocks as the decoder. In the encoder, each branch is assigned to a specific MRI modality in order to take advantage of each modality individually. During the decoding stage of the network, different scales of the encoded attributes related to each modality, from the coarsest to the finest, including the middle-level attributes, are combined together and upconvolved gradually to get fine details (more contextual information) of the lesion shape. Moreover, we used three different (orthogonal) planes for each 3D modality as an input to the network to better exploit the contextual information in all directions. In summary, the main contributions in this work are:
A (whole) slice-based approach to exploit the overall structural information, combined with a multi-plane strategy to take advantage of full contextual information.
A multi-level feature fusion and upsampling approach to exploit contextual information at multiple scales.
The evaluation of different versions of the proposed model so as to find the most performant combination of MRI modalities for MS lesion segmentation.
The demonstration of top performance on two different datasets.
In order to evaluate the performance of the proposed method for MS lesion segmentation, two different datasets were used: one publicly available, the ISBI 2015 Longitudinal MS Lesion Segmentation Challenge (Carass et al., 2017) (denoted as the ISBI dataset), and one from the neuroimaging research unit (NRU) in Milan (denoted as the NRU dataset).
2.1 ISBI 2015 Longitudinal MS Lesion Segmentation Challenge
The ISBI dataset included 19 subjects divided into two sets, 5 subjects in the train set and 14 subjects in the test set. Each subject had different time-points, ranging from 4 to 6. For each time-point, T1w, T2w, PDw, and FLAIR image modalities were provided. The volumes were composed of 182 slices with FOV=182256 and 1-millimeter cubic voxel resolution. All images available were already segmented manually by two different raters, therefore representing two ground truth lesion masks. For all 5 training images, lesion masks were made publicly available. For the remaining 14 subjects in the test set, there was no publicly available ground truth. The performance evaluation of the proposed method over the test dataset was done through an online service by submitting the binary masks to the challenge111http://iacl.ece.jhu.edu/index.php/MSChallenge website (Carass et al., 2017).
2.2 Neuroimaging Research Unit
The NRU dataset was collected by a research team from Ospedale San Raffaele, Milan, Italy.
The dataset consisted in 37 MS patients acquired on a 3.0 Tesla Philips Ingenia CX scanner (Philips Medical Systems) with standardized procedures for subjects positioning. The following sequences were collected: Sagittal 3D FLAIR sequence, FOV=256256, pixel size=11 mm, 192 slices, 1 mm thick; Sagittal 3D T2w turbo spin echo (TSE) sequence, FOV=256256, pixel size=11 mm, 192 slices, 1 mm thick; Sagittal 3D high resolution T1w, FOV=256256, pixel size=11 mm, 204 slices, 1 mm thick.
3.1 Data Preprocessing
From the ISBI dataset, we selected the preprocessed version of the images available online at the challenge website. All images were already skull-stripped using Brain Extraction Tool (BET) (Smith, 2002), rigidly registered to 1 MNI-ICBM152 template (Oishi et al., 2008) and N3 intensity normalized (Sled et al., 1998).
In the NRU dataset, all sagittal acquisitions were reoriented in axial plane and the exceeding portion of the neck was removed. T1w and T2w sequences were aligned on FLAIR MRI using FMRIB’s Linear Image Registration tool (FLIRT) (Jenkinson and Smith, 2001; Jenkinson et al., 2002) and brain tissues were separated from non-brain tissues on FLAIR sequences using BET (Smith, 2002)
. The estimated brain mask on FLAIR MRI was then used on both registered T1w and T2w images to extract brain tissues. Then, on this dataset, hyperintense MS lesions were manually identified and segmented by an expert neuroradiologist. Finally, all images were rigidly registered to a 1MNI-ICBM152 template (Oishi et al., 2008) to obtain volumes of size () and then N3 intensity normalized (Sled et al., 1998).
3.2 Network Architecture
In this work, we propose a 2D end-to-end convolutional network based on the residual network (ResNet) (He et al., 2016)
. The core idea of ResNet is the use of identity shortcut connections which have benefits such as preventing gradient vanishing and reducing computational complexity. Recently, ResNets have shown outstanding performance in computer vision problems, specifically in image recognition task(He et al., 2016). We modified ResNet50 (version with 50 layers) to work as a pixel-level segmentation network. This has been obtained by changing the last prediction layer with a dense pixel-level prediction layer inspired by the idea of the fully convolutional network (FCN) (Long et al., 2015). To exploit the MRI multi-modality analysis, we built a pipeline of parallel ResNets without weights sharing. Moreover, a multi-modal feature fusion block (MMFF) and a multi-scale feature upsampling block (MSFU) were proposed to combine and upsample features from different modalities and different resolutions, respectively.
In the following subsections, we first describe how the input features were generated by decomposing 3D data into 2D images. Then, we describe the proposed network architecture in details and the related training procedure. Finally, we introduce the multi-plane reconstruction block, which defines how we combined the 2D binary slices of the network output to match the original 3D data.
3.2.1 Input Features Preparation
For each MRI modality volume, three different plane orientations (axial, coronal and sagittal) were considered for generating 2D slices along x, y, and z axes. Since the size of each slice depended on orientation (axial=, coronal=, sagittal=), each of them was zero-padded (while centering the brain) to get the same size () for each plane orientation. This procedure was applied to all three modalities. Figure 1 illustrates the described procedure using FLAIR, T1w, and T2w modalities.
3.2.2 Network Architecture Details
Our proposed model essentially integrates multiple ResNets with other blocks to handle multi-modality and multi-resolution approaches, respectively. As can be seen in figure 2, the proposed network includes three main parts: downsampling networks, multi-modal feature fusion using MMFF blocks, and multi-scale upsampling using MSFU blocks. In the downsampling stage, multiple parallel ResNets (without weights sharing) were used for extracting multi-resolution features, with each ResNet associated to one specific modality (in our experiments, we used FLAIR, T1w, T2w). To take advantage of the multi-resolution approach, each ResNet was divided into five blocks according to the resolution of the feature maps (, , , , and ). Then, features with the same resolution from different modalities were combined using MMFF blocks as illustrated in figure 3(a). Each MMFF block included convolutions to reduce the number of feature maps (halving them), followed by convolutions for adaptation. Then, a simple concatenation layer was used to combine the features from different modalities. In the upsampling stage, MSFU blocks fused the multi-resolution representations and gradually upsized them back to the original resolution of the input image. Figure 3(b) illustrates the proposed MSFU block consisting of a convolutional layer to reduce the number of feature maps (halving them) and an upconvolutional layer with
kernel size and a stride of 2, transforming low-resolution feature maps to higher resolution maps. Then, a concatenation layer was used to combine the two sets of feature maps, followed by aconvolutional layer to reduce the number of feature maps (halving them) and a
convolutional layer for adaptation. After the last MSFU block, a soft-max layer of size 2 was used to get the output probability maps of the lesions. Probabilistic outputs were then thresholded (0.5) to generate binary classification for each pixel (lesion vs. non-lesion). It is important to mention that in all proposed blocks before each convolutional and upconvolutional layer, we used a batch normalization layer(Ioffe and Szegedy, 2015)2010). Size and number of feature maps in the input and output of all convolutional layers were kept same.
3.2.3 Implementation Details
The proposed model was implemented in Python language222https://www.python.org
using Keras333https://keras.io (Chollet et al., 2015)
with Tensorflow444https://www.tensorflow.org (Abadi et al., 2015) backend. All experiments were done on a Nvidia GTX Titan X GPU. Our multi-branch slice-based network was trained end-to-end. In order to train the proposed CNN, we created a train set using the pipeline introduced in the subsection 3.2.1. Then, a train subset was determined by selecting only slices with at least one lesion pixel to limit extremely unbalanced data and omit uninformative samples. According to the size of lesion volume, the total number of selected slices for training ranged from 1500 to 2000 for each subject. To optimize the network weights and early stopping criterion, the created train set was divided into train and validation subsets depending on the experiments described in the following section. We trained our network using Adam optimizer (Kingma and Ba, 2014)
with an initial learning rate of 0.0001 and mini-batches of size 15. The maximum number of training epochs was fixed to 1000 for all experiments and the training computation time was approximately 36 hours. The best model was selected according to the validation set. With respect to the network initialization, in the downsampling branches, we used ResNet50 pre-trained on ImageNet and all other blocks (MMFFs and MSFUs) were randomly initialized from a Gaussian distribution. It is worth noticing that we did not use parameter sharing in parallel ResNets. The dice loss function (DL) was used to train the proposed network(Milletari et al., 2016), and was defined as:
where and are the predicted and ground truth binary volumes, respectively.
3.2.4 3D Binary Image Reconstruction
Output binary slices of the network were concatenated to form a 3D volume matching the original data. In order to reconstruct the 3D image from the output binary 2D slices, we proposed a multi-planes reconstruction (MPR) block. Feeding each 2D slice to the network, we got as output the associated 2D binary lesion classification map. Since each original modality was duplicated three times in the input, once for each slice orientation (coronal, axial, sagittal), concatenating the binary lesion maps belonging to the same orientation resulted in three 3D lesion classification maps. To obtain a single lesion segmentation volume, these three lesion maps were combined via majority voting (the most frequent lesion classification was selected) as illustrated in figure 4.
4.1 Evaluation Metrics
The following measures were calculated for evaluation purposes and comparison with other state-of-the-art methods.
Dice Similarity Coefficient:
Where , and indicate the true positive, false negative and false positive voxels, respectively.
Lesion-wise True Positive Rate:
Where denotes the number of lesions in the reference segmentation that overlap with a lesion in the output segmentation (at least one voxel overlap), and is the total number of lesions in the reference segmentation.
Lesion-wise False Positive Rate:
Where denotes the number of lesions in the output segmentation that do not overlap with a lesion in the reference segmentation and is the total number of lesions in the produced segmentation.
As described in (Carass et al., 2017), the ISBI challenge website provides a report on the submitted test set including some measures such as:
Positive Prediction Value:
Absolute Volume Difference:
Where and reveal the total number of the segmented lesion voxels in the output and manual annotations masks, respectively.
ISBI test set overall evaluation score (Carass et al., 2017):
Where is the set of all subjects, is the set of all raters and is the pearson’s correlation coefficient of the volumes.
4.2 Experiments on the ISBI Dataset
To evaluate the performance of the proposed method on the ISBI dataset, two different experiments were performed according to the availability of the ground truth.
Since the ground truth was available only for the train set, in the first experiment, we ignored the official ISBI test set. We only considered data with available ground truth (train set with 5 subjects) as mentioned in (Brosch et al., 2016). To get a fair result, we tested our approach with a nested leave-one-subject-out cross-validation (3 subjects for training, 1 subject for validation and 1 subject for testing). please refer to appendix for more details.
In the second experiment, the performance of the proposed method was evaluated on the official ISBI test set (with 14 subjects), for which the ground truth was not available, using the challenge web service. We trained our model doing a leave-one-subject-out cross-validation on the whole train set with 5 subjects (4 subjects for training and 1 subject for validation). We executed the ensemble of 5 trained models on the official ISBI test set and the final prediction was generated with a majority voting over the ensemble. The 3D output binary lesion maps were then submitted to the challenge website for evaluation.
4.3 Experiment on the NRU Dataset
To test the robustness of the proposed model, we performed two experiments using the NRU dataset including 37 subjects.
In the first experiment, we implemented a nested 4-fold cross-validation over the whole dataset (21 subjects for training, 7 subjects for validation and 9 subjects for testing). please refer to appendix for more details. For comparison purpose, we tested three different publicly available MS lesion segmentation software. We tested OASIS (Automated Statistic Inference for Segmentation) (Sweeney et al., 2013), TODAS (Topology reserving Anatomy Driven Segmentation) (Shiee et al., 2010), and LST (Lesion Segmentation Toolbox)(Schmidt et al., 2012). For OASIS, there was just a thresholding parameter to set, which was optimized to obtain the best DSC. FLAIR, T1w, and T2w modalities were used in this method. TODAS is a software free from parameter tuning and it only required FLAIR and T1w modalities for segmentation. LST included a single thresholding parameter which initialized lesion segmentation. This parameter was also optimized to get the best DSC in this experiment. In this method FLAIR and T1w were used for segmentation.
In the second experiment, to investigate the importance of each single modality in MS lesion segmentation, we tested our model with various combination of modalities. This means that the model was adapted accordingly in the number of parallel branches in the downsampling network. In this experiment, we randomly split the corresponding dataset into fixed train (21 subjects), validation (7 subjects) and test (9 subjects) sets.
Single-branch (SB): In a single-branch version of the proposed model, we used a single ResNet as the downsampling part of the network. Attributes from different levels of the single-branch were supplied to the MMFF blocks. In this version of our model, each MMFF block had single input since there was only one downsampling branch. Therefore, MMFF blocks include a convolutional layer followed by a convolutional layer. We trained and tested the single-branch version of our proposed network with each modality separately.
Multi-branch (MB): The multi-branch version of the proposed model used multiple parallel ResNets in the downsampling network without weights sharing. In this experiment, we used two-branch and three-branch versions, which were trained and tested using two modalities and three modalities, respectively. We trained and tested the mentioned models with all possible combination of modalities (two-branches: [FLAIR, T1w], [FLAIR, T2w], [T1w, T2w], three-branches: [FLAIR, T1w, T2w]).
|Method||Rater 1||Rater 2|
|Maier and Handels (2015) (GT1)||0.7000||0.5333||0.4888||0.6555||0.3777||0.4444|
|Maier and Handels (2015) (GT2)||0.7000||0.5555||0.4888||0.6555||0.3888||0.4333|
|Brosch et al. (2016) (GT1)||0.6844||0.7455||0.5455||0.6444||0.6333||0.5288|
|Brosch et al. (2016) (GT2)||0.6833||0.7833||0.6455||0.6588||0.6933||0.6199|
|Aslani et al. (2018) (GT1)||0.6980||0.7460||0.4820||0.6510||0.6410||0.4506|
|Aslani et al. (2018) (GT2)||0.6940||0.7840||0.4970||0.6640||0.6950||0.4420|
|Hashemi et al. (2018)||92.48||0.5841||0.9207||0.4135||0.0866||0.4972|
|Andermatt et al. (2017)||92.07||0.6298||0.8446||0.2013||0.4870||0.4045|
|Valverde et al. (2017)||91.33||0.6304||0.7866||0.3669||0.1529||0.3384|
|Maier and Handels (2015)||90.28||0.6050||0.7746||0.2657||0.3672||0.3653|
|Birenbaum and Greenspan (2016)||90.07||0.6271||0.7889||0.4975||0.5678||0.3522|
|Aslani et al. (2018)||89.85||0.5850||0.6351||0.5994||0.4500||0.4540|
|Deshpande et al. (2015)||89.81||o.5960||0.6740||0.3383||0.3661||0.1306|
|Jain et al. (2015)||88.74||0.5560||0.7300||0.3742||0.3225||0.3748|
|Sudre et al. (2015)||88.00||0.5511||0.6276||0.3862||0.2870||0.3058|
|Tomas-Fernandez and Warfield (2015)||87.01||0.4317||0.6973||0.4115||0.2101||0.5109|
|Ghafoorian et al. (2017)||86.92||0.5009||0.5491||0.5765||0.4288||0.5707|
5.1 ISBI Dataset
In the first experiment, we evaluated our model using three measures: DSC, LTPR, and LFPR to make our results comparable to those obtained in (Brosch et al., 2016; Maier and Handels, 2015; Aslani et al., 2018). Table 1 summarizes the results of the first experiment when comparing our model with previously proposed methods. The table shows the mean DSC, LTPR, and LFPR. As shown, our method outperformed the other methods regarding DSC and LFPR, while the highest LTPR was achieved by our recently published method (Aslani et al., 2018). Figures 5 shows the segmentation outputs of the proposed method for subject 2 (with high lesion load) and subject 3 (with low lesion load) compared to both ground truth annotations (rater 1 and rater 2).
For the second experiment, the official ISBI test set was used. In this experiment, all 3D binary output masks on the test set were submitted to the ISBI website. Several measures were calculated online by the challenge website. Table 2 shows the results on all measures reported as a mean across raters. At the time of the submission, our method had an overall evaluation score of 92.12 on the official ISBI challenge web service555http://iacl.ece.jhu.edu/index.php/MSChallenge being amongst the top-ranked methods with published papers or technical reports.
5.2 NRU Dataset
Table 3 shows the results of the first and second experiments by measuring the mean values of DSC, LFPR, LTPR, PPV, and VD. It illustrates the comparison of our method with the other methods. As shown in the table, our method achieved the best results with respect to DSC, PPV and VD measures while showing a good trade-off between LTPR and LFPR, comparable to the best results of the other methods.
illustrates boxplots of the DSC, LTPR, LFPR, and VD evaluation metrics obtained from the different methods. Confirming the results of table3, this figure shows that our model has the highest median value of DSC (0.7) and the lowest median value of VD (0.5) while providing the best trade-off between LTPR and LFPR compared with the other methods. The output segmentation of all methods applied to a random subject (with medium lesion load) can be seen with different plane orientations on figure 7.
Figure 8 depicts the relationship between the ground truth and the estimated lesion volumes for each of the evaluated methods. It can be seen that TODAS and OASIS methods tend to overestimate lesion volumes, while, LST method tends to underestimate the lesion volumes. Among the other methods, our model produced slope with the highest value () between estimated and ground truth lesion volumes close to unity. Moreover, it provided the best pearson correlation (0.75) than all other methods.
Finally, table 4 shows the performance of the proposed model with respect to different combinations of modalities.
The SB version of the proposed model trained with FLAIR modality has noticeably better performance regarding the DSC, PPV, LTPR, and LFPR measures compared with the other modalities. It is also important to notice that the SB network trained with the T2w modality showed considerably low value for VD measure.
In MB versions of the model, all possible two-branch and three-branch versions were considered. As seen from table 4, two-branch versions including FLAIR modality have increased performance with respect to all measures. This emphasizes the importance of using FLAIR modality together with among others (T1w and T2w). However, overall, a combination of all modalities in the three-branch version of the model demonstrated outstanding performance compared to the other versions of the network with different subsets of modalities. This version of the model with three input modalities showed the best performance for DSC (with 4% improvement), PPV (with 7.5% improvement), LFPR (with 0.8% improvement), and VD (with 0.01% improvement) measures.
|TODAS Shiee et al. (2010)||0.5241||0.5965||0.4608||0.6277||0.4659|
|LST Schmidt et al. (2012)||0.4905||0.8004||0.1361||0.0097||0.5119|
|OASIS Sweeney et al. (2013)||0.4193||0.3483||0.3755||0.4143||2.0588|
|Method||Set of Modalities||DSC||PPV||LTPR||LFPR||VD|
|FLAIR, T1w, T2w||0.7067||0.6844||0.6136||0.1284||0.1488|
6 Discussion and Conclusions
In this work, we have designed an automated pipeline for the MRI MS lesion segmentation. The proposed model is a deep end-to-end 2D CNN consisting of a multi-branch downsampling network, MSFF blocks fusing the features from different modalities at different stages of the network, and MSFU blocks combining and upsampling multi-scale features.
When having insufficient training data in deep learning based approaches, which is very common in the medical domain, transfer learning has demonstrated to be a good solution(Chen et al., 2015, 2016; Hoo-Chang et al., 2016). It not only helps boosting the performance of the network but also significantly reduces overfitting. Therefore, we used the parallel ResNet50s pre-trained on ImageNet as a multi-branch downsampling network while the other layers in MMFF and MSFU blocks were randomly initialized from a gaussian distribution. We fine-tuned the whole network on the given MS lesion segmentation task.
In brain image segmentation using deep networks, a combination of MRI modalities overcomes the limitations of single modality approaches, and allows the networks to provide more accurate segmentation (Kleesiek et al., 2016; Moeskops et al., 2016; Aslani et al., 2018). Unlike previously proposed networks (Brosch et al., 2016; Aslani et al., 2018), which stacked all modalities together as a single input, we designed a network with several downsampling branches, with a separate branch for each individual modality. We believe that stacking all modalities together as a single input to a network is not an optimal solution since during the downsampling procedure, information related to the most informative modality can vanish. The multi-branch approach allows the network to have weights and biases in each downsampling branch to be specifically optimized for each modality. Results in table 4 confirm our claim that a network with separate branches shows more accurate segmentation (DSC=0.7649) than networks with a single branch (stacking all modalities together) as that proposed by Brosch et al. (2016) (DSC=0.6844) and Aslani et al. (2018) (DSC=0.6980). The mentioned methods showed higher LTPR values (0.70). However, LFPR values are much higher in these methods (0.48) due to an overestimation of lesion volumes. Our proposed method showed the best trade-off between LTPR () and LFPR (0.2022) while having the highest DSC value ().
When examining the influence of different modalities, results in table 4 demonstrates that the most important modality for MS lesion segmentation is FLAIR sequence (DSC0.65). This is likely due to the fact that FLAIR sequences benefit from CSF signal suppression and hence a higher image contrast between MS lesions and the surrounding normal appearing WM. However, using all modalities together in a multi-branch downsampling network showed outstanding segmentation performance (DSC=0.7067). The combination with other modalities, such as with T1w, could help the algorithm in identifying additional information regarding the location of lesions or artifacts.
In deep CNNs, attributes from different layers include different information for example with coarse layers relating to high-level semantic information (category specific), and shallow layers to low-level spatial information (appearance specific) (Long et al., 2015), while information from middle layer attributes have shown a significant impact on segmentation performance (Ronneberger et al., 2015). Combining these multi-level attributes from the different stage of the network makes the attributes richer than using single-level attributes. Unlike previously CNN based method proposed by Brosch et al. (2016), where a single shortcut connection between the deepest and the shallowest layers is used, our model includes several shortcut connections between all layers of the network to combine multi-scale features from different stages of the network as inspired by U-Net architecture (Ronneberger et al., 2015). The results shown in table 1 suggest that the combination of multi-level features during the upsampling procedure helps the network exploiting more contextual information associated to the lesions. This could explain that the performance of our proposed model (DSC=0.7649) is higher than the method proposed by Brosch et al. (2016) (DSC=0.6844).
Patch-based CNNs suffer from lacking spatial information about the lesions because of the patch size limitation. To deal with this problem, we proposed a (whole) slice-based approach. Compared with patch-based methods (Valverde et al., 2017; Ghafoorian et al., 2017), our experiments have shown that our model has better performance with respect to almost all measures as can be clearly seen in table 2. Although the CNN proposed by Valverde et al. (2017) has the highest DSC value among all, our method showed better performance regarding the LTPR and LFPR, which indicates that our model is robust in identifying the correct lesion locations. The patch-based CNN proposed by Ghafoorian et al. (2017) has been optimized to have the highest LTPR. However, their method showed significantly lower performance in LFPR and DSC. Compared with this method, our model has better overlap between segmented and ground truth lesions (DSC=0.6114) with 11% improvement.
The proposed method also has some limitations. We observed that the proposed pipeline is slightly slow in segmenting each 3D input image since it is based on segmenting whole-slices which needs more memory and takes a longer time compared to other CNN-based approaches (Roy et al., 2018). The required time for segmenting an input 3D image is estimated by three sequential steps: input features preparation 3.2.1, slice-level segmentation 3.2.2, and 3D image reconstruction 3.2.4. The calculating time depends on the size of the input image and the type of the GPU used by the system. In both the ISBI and NRU datasets, the average time for segmenting an input image, including all 3 steps, was determined to be around 90 seconds.
As a future work, we aim to design a multi-task network for segmenting different parts of brain including different tissue types (WM, GM, CSF) and different types of MS lesions (including cortical lesions). Since MS lesions are most visible in WM, we believe that introducing information from the tissue class could help improve the network identifying cortical and subcortical lesions. Moreover, we plan to use other MRI modalities such as double inversion recovery (DIR) sequences for the identification of cortical lesions which benefits of the signal suppression from both CSF and WM.
We respectfully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Appendix A. Evaluation Protocols
This appendix includes 3 tables that describe the training procedures in details related to sections 4.2 and 4.3.
Table A.1 and A.2 give detailed information about how we implemented training procedure on the ISBI dataset for the first and second experiments. Table A.3 describes the a nested 4-fold cross-validation training procedure applied on the NRU dataset in the first experiment.
|1,2,3,4||5||ISBI test set|
|1,2,3,5||4||ISBI test set|
|1,2,4,5||3||ISBI test set|
|1,3,4,5||2||ISBI test set|
|2,3,4,5||1||ISBI test set|
|[10-16 @ 24-37]||[17-23]||[1-9]|
|[10-23 @ 31-37]||[24-30]||[1-9]|
|[10-30 @ 31-37]||[31-37]||[1-9]|
|[8-9 @ 19-37]||[1-7]||[10-18]|
|[1-7 @ 24-37]||[8-9 @ 19-23]||[10-18]|
|[1-9 @ 19-23 @ 31-37]||[24-30]||[10-18]|
|[1-9 @ 19-30]||[31-37]||[10-18]|
|[8-18 @ 28-37]||[1-7]||[19-27]|
|[1-7 @ 15-18 @ 27-37]||[8-14]||[19-27]|
|[1-14 @ 31-37]||[15-18 @ 28-30]||[19-27]|
|[1-18 @ 28-30]||[31-37]||[19-27]|
|[1-7 @ 15-27]||[8-14]||[28-37]|
|[1-14 @ 22-27]||[15-21]||[28-37]|
Abadi et al. (2015)
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,
G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp,
A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M.,
Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C.,
Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.,
Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P.,
Wattenberg, M., Wicke, M., Yu, Y., Zheng, X., 2015. TensorFlow: Large-scale
machine learning on heterogeneous systems. Software available from
Andermatt et al. (2017)
Andermatt, S., Pezold, S., Cattin, P. C., 2017. Automated segmentation of multiple sclerosis lesions using multi-dimensional gated recurrent units. In: International MICCAI Brainlesion Workshop. Springer, pp. 31–42.
- Aslani et al. (2018) Aslani, S., Michael, D., Vittorio, M., Diego, S., 2018. Deep 2d encoder-decoder convolutional neural network for multiple sclerosis lesion segmentation in brain mri. In: International MICCAI Brainlesion Workshop. Springer.
- Birenbaum and Greenspan (2016) Birenbaum, A., Greenspan, H., 2016. Longitudinal multiple sclerosis lesion segmentation using multi-view convolutional neural networks. In: International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis. Springer, pp. 58–67.
- Brosch et al. (2016) Brosch, T., Tang, L. Y., Yoo, Y., Li, D. K., Traboulsee, A., Tam, R., 2016. Deep 3d convolutional encoder networks with shortcuts for multiscale feature integration applied to multiple sclerosis lesion segmentation. IEEE transactions on medical imaging 35 (5), 1229–1239.
- Cabezas et al. (2014) Cabezas, M., Oliver, A., Valverde, S., Beltran, B., Freixenet, J., Vilanova, J. C., Ramió-Torrentà, L., Rovira, À., Lladó, X., 2014. Boost: A supervised approach for multiple sclerosis lesion segmentation. Journal of neuroscience methods 237, 108–117.
- Carass et al. (2017) Carass, A., Roy, S., Jog, A., Cuzzocreo, J. L., Magrath, E., Gherman, A., Button, J., Nguyen, J., Prados, F., Sudre, C. H., et al., 2017. Longitudinal multiple sclerosis lesion segmentation: Resource and challenge. NeuroImage 148, 77–102.
Chen et al. (2015)
Chen, H., Dou, Q., Ni, D., Cheng, J.-Z., Qin, J., Li, S., Heng, P.-A., 2015. Automatic fetal ultrasound standard plane detection using knowledge transferred recurrent neural networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 507–514.
Chen et al. (2016)
Chen, H., Qi, X., Yu, L., Heng, P.-A., 2016. Dcan: deep contour-aware networks for accurate gland segmentation. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 2487–2496.
- Chollet et al. (2015) Chollet, F., et al., 2015. Keras. https://github.com/fchollet/keras.
- Compston and Coles (2008) Compston, A., Coles, A., 2008. Multiple sclerosis. The Lancet 372 (9648), 1502–1517.
- Deshpande et al. (2015) Deshpande, H., Maurel, P., Barillot, C., 2015. Adaptive dictionary learning for competitive classification of multiple sclerosis lesions. In: Biomedical Imaging (ISBI), 2015 IEEE 12th International Symposium on. IEEE, pp. 136–139.
Friedman et al. (2000)
Friedman, J., Hastie, T., Tibshirani, R., et al., 2000. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics 28 (2), 337–407.
- Ghafoorian et al. (2017) Ghafoorian, M., Karssemeijer, N., Heskes, T., Bergkamp, M., Wissink, J., Obels, J., Keizer, K., de Leeuw, F.-E., van Ginneken, B., Marchiori, E., et al., 2017. Deep multi-scale location-aware 3d convolutional neural networks for automated detection of lacunes of presumed vascular origin. NeuroImage: Clinical 14, 391–399.
- Ghafoorian and Platel (2015) Ghafoorian, M., Platel, B., 2015. Convolutional neural networks for ms lesion segmentation, method description of diag team. Proceedings of the 2015 Longitudinal Multiple Sclerosis Lesion Segmentation Challenge, 1–2.
- Han et al. (2016) Han, X.-H., Lei, J., Chen, Y.-W., 2016. Hep-2 cell classification using k-support spatial pooling in deep cnns. In: International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis. Springer, pp. 3–11.
Hashemi et al. (2018)
Hashemi, S. R., Salehi, S. S. M., Erdogmus, D., Prabhu, S. P., Warfield, S. K.,
Gholipour, A., 2018. Tversky as a loss function for highly unbalanced image
segmentation using 3d fully convolutional deep networks. CoRR abs/1803.11078.
- Havaei et al. (2017) Havaei, M., Davy, A., Warde-Farley, D., Biard, A., Courville, A., Bengio, Y., Pal, C., Jodoin, P.-M., Larochelle, H., 2017. Brain tumor segmentation with deep neural networks. Medical image analysis 35, 18–31.
- He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778.
- Hoo-Chang et al. (2016) Hoo-Chang, S., Roth, H. R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R. M., 2016. Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35 (5), 1285.
- Ioffe and Szegedy (2015) Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
- Jain et al. (2015) Jain, S., Sima, D. M., Ribbens, A., Cambron, M., Maertens, A., Van Hecke, W., De Mey, J., Barkhof, F., Steenwijk, M. D., Daams, M., et al., 2015. Automatic segmentation and volumetry of multiple sclerosis brain lesions from mr images. NeuroImage: Clinical 8, 367–375.
- Jenkinson et al. (2002) Jenkinson, M., Bannister, P., Brady, M., Smith, S., 2002. Improved optimization for the robust and accurate linear registration and motion correction of brain images. Neuroimage 17 (2), 825–841.
- Jenkinson and Smith (2001) Jenkinson, M., Smith, S., 2001. A global optimisation method for robust affine registration of brain images. Medical image analysis 5 (2), 143–156.
Kingma and Ba (2014)
Kingma, D. P., Ba, J., 2014. Adam: A method for stochastic optimization. CoRR
- Kleesiek et al. (2016) Kleesiek, J., Urban, G., Hubert, A., Schwarz, D., Maier-Hein, K., Bendszus, M., Biller, A., 2016. Deep mri brain extraction: a 3d convolutional neural network for skull stripping. NeuroImage 129, 460–469.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. nature 521 (7553), 436.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), 2278–2324.
- Li et al. (2014) Li, H., Zhao, R., Wang, X., 2014. Highly efficient forward and backward propagation of convolutional neural networks for pixelwise classification. arXiv preprint arXiv:1412.4526.
- Liskowski and Krawiec (2016) Liskowski, P., Krawiec, K., 2016. Segmenting retinal blood vessels with¡? pub _newline?¿ deep neural networks. IEEE transactions on medical imaging 35 (11), 2369–2380.
- Long et al. (2015) Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3431–3440.
Maier and Handels (2015)
Maier, O., Handels, H., 2015. Ms lesion segmentation in mri with random forests. Proc. 2015 Longitudinal Multiple Sclerosis Lesion Segmentation Challenge, 1–2.
- Milletari et al. (2016) Milletari, F., Navab, N., Ahmadi, S.-A., 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, pp. 565–571.
- Moeskops et al. (2016) Moeskops, P., Viergever, M. A., Mendrik, A. M., de Vries, L. S., Benders, M. J., Išgum, I., 2016. Automatic segmentation of mr brain images with a convolutional neural network. IEEE transactions on medical imaging 35 (5), 1252–1261.
Nair and Hinton (2010)
Nair, V., Hinton, G. E., 2010. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10). pp. 807–814.
- Oishi et al. (2008) Oishi, K., Zilles, K., Amunts, K., Faria, A., Jiang, H., Li, X., Akhter, K., Hua, K., Woods, R., Toga, A. W., et al., 2008. Human brain white matter atlas: identification and assignment of common anatomical structures in superficial white matter. Neuroimage 43 (3), 447–457.
- Rolak (2003) Rolak, L. A., 2003. Multiple sclerosis: it is not the disease you thought it was. Clinical Medicine and Research 1 (1), 57–60.
- Ronneberger et al. (2015) Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 234–241.
- Roy et al. (2018) Roy, S., Butman, J. A., Reich, D. S., Calabresi, P. A., Pham, D. L., 2018. Multiple sclerosis lesion segmentation from brain mri via fully convolutional neural networks. arXiv preprint arXiv:1803.09172.
- Schmidt et al. (2012) Schmidt, P., Gaser, C., Arsic, M., Buck, D., Förschler, A., Berthele, A., Hoshi, M., Ilg, R., Schmid, V. J., Zimmer, C., et al., 2012. An automated tool for detection of flair-hyperintense white-matter lesions in multiple sclerosis. Neuroimage 59 (4), 3774–3783.
- Shiee et al. (2010) Shiee, N., Bazin, P.-L., Ozturk, A., Reich, D. S., Calabresi, P. A., Pham, D. L., 2010. A topology-preserving approach to the segmentation of brain images with multiple sclerosis lesions. NeuroImage 49 (2), 1524–1535.
- Simon et al. (2006) Simon, J., Li, D., Traboulsee, A., Coyle, P., Arnold, D., Barkhof, F., Frank, J., Grossman, R., Paty, D., Radue, E., et al., 2006. Standardized mr imaging protocol for multiple sclerosis: Consortium of ms centers consensus guidelines. American Journal of Neuroradiology 27 (2), 455–461.
- Sled et al. (1998) Sled, J. G., Zijdenbos, A. P., Evans, A. C., 1998. A nonparametric method for automatic correction of intensity nonuniformity in mri data. IEEE transactions on medical imaging 17 (1), 87–97.
- Smith (2002) Smith, S. M., 2002. Fast robust automated brain extraction. Human brain mapping 17 (3), 143–155.
- Steinman (1996) Steinman, L., 1996. Multiple sclerosis: a coordinated immunological attack against myelin in the central nervous system. Cell 85 (3), 299–302.
- Sudre et al. (2015) Sudre, C. H., Cardoso, M. J., Bouvy, W. H., Biessels, G. J., Barnes, J., Ourselin, S., 2015. Bayesian model selection for pathological neuroimaging data applied to white matter lesion segmentation. IEEE transactions on medical imaging 34 (10), 2079–2102.
- Sweeney et al. (2013) Sweeney, E. M., Shinohara, R. T., Shiee, N., Mateen, F. J., Chudgar, A. A., Cuzzocreo, J. L., Calabresi, P. A., Pham, D. L., Reich, D. S., Crainiceanu, C. M., 2013. Oasis is automated statistical inference for segmentation, with applications to multiple sclerosis lesion segmentation in mri. NeuroImage: clinical 2, 402–413.
- Tomas-Fernandez and Warfield (2015) Tomas-Fernandez, X., Warfield, S. K., 2015. A model of population and subject (mops) intensities with application to multiple sclerosis lesion segmentation. IEEE transactions on medical imaging 34 (6), 1349–1361.
- Tseng et al. (2017) Tseng, K.-L., Lin, Y.-L., Hsu, W., Huang, C.-Y., 2017. Joint sequence learning and cross-modality convolution for 3d biomedical segmentation. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, pp. 3739–3746.
- Vaidya et al. (2015) Vaidya, S., Chunduru, A., Muthuganapathy, R., Krishnamurthi, G., 2015. Longitudinal multiple sclerosis lesion segmentation using 3d convolutional neural networks. Proceedings of the 2015 Longitudinal Multiple Sclerosis Lesion Segmentation Challenge, 1–2.
- Valverde et al. (2017) Valverde, S., Cabezas, M., Roura, E., González-Villà, S., Pareto, D., Vilanova, J. C., Ramió-Torrentà, L., Rovira, À., Oliver, A., Lladó, X., 2017. Improving automated multiple sclerosis lesion segmentation with a cascaded 3d convolutional neural network approach. NeuroImage 155, 159–168.