Multi-stream 3D FCN with Multi-scale Deep Supervision for Multi-modality Isointense Infant Brain MR Image Segmentation

11/28/2017 ∙ by Guodong Zeng, et al. ∙ Universität Bern 0

We present a method to address the challenging problem of segmentation of multi-modality isointense infant brain MR images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). Our method is based on context-guided, multi-stream fully convolutional networks (FCN), which after training, can directly map a whole volumetric data to its volume-wise labels. In order to alleviate the poten-tial gradient vanishing problem during training, we designed multi-scale deep supervision. Furthermore, context infor-mation was used to further improve the performance of our method. Validated on the test data of the MICCAI 2017 Grand Challenge on 6-month infant brain MRI segmentation (iSeg-2017), our method achieved an average Dice Overlap Coefficient of 95.4



There are no comments yet.


page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Brain development is complex and spans throughout childhood and adolescence, involving numerous processes such as neural induction, neuronal proliferation and migration, synaptogenesis and myelination, etc. Thus, it is important to develop quantitative tools for analysis of neurodevelopment at all ages. Brain segmentation in MR images is a central piece of such quantitative analysis tools, because it delivers quantitative volume measurement of different brain structures and provides context information for further quantification. An example is the studying of normal and abnormal early brain development where accurate tissue segmentation of infant brain images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) plays an important role.

Despite significant progresses achieved for segmentation of adult brain MR images [1], the segmentation of infant brain MR images remains a challenge due to ongoing maturation and myelination process in the first year of life [2, 3]. Moreover, most of the existing infant brain image segmentation methods relied either on the T2 modality for the neonates less than 3 months old [2, 2] or on the T1 modality for the infants over 1-year old [4], as in the associated age those modalities demonstrates a relatively good contrast between WM and GM. Only a few methods [3, 5, 6, 7, 8, 9, 10] addressed the challenges in segmentation of MR images of isointense-phase infants (around 6-8 months of age). At this stage, T1 and T2 modalities have lowest contrast reflected by the fact that the WM and GM have almost the same intensity level. To address such a challenge, different methods have been proposed before. In [3], Wang et al. proposed a longitudinally guided level set method to segment serial infant brain MT images acquired from 2 weeks up to 1.5 years of age, including the isointense images. To address the difficulty caused by the low contrast, their proposed methods leveraged the complimentary tissue distribution information from 4D longitudinal T1, T2 and diffusion-weighted images. The dependence on the 4D longitudinal data is regarded as the major limitation of this method. To address such a limitation, the same authors later proposed a method to integrate sparse multi-modality representation and anatomical constraint for segmentation of cross-sectional single-time-point isointense infant brain MR images [5]. They reported a Dice Overlap Coefficient (DOC) of 0.889 0.008 for white matter and 0.870 0.006 for gray matter.

Recently, machine learning-based methods have gained increasing interest in the field of medical image analysis. Great successes have been validated in different medical image analysis problems. For example, Wang et al.


proposed a learning-based multi-source integration framework for segmentation of infant brain images. More specifically, they employed the random forest technique to effectively integrate features from multi-source images together for tissue segmentation. More recently, with the advance of deep learning techniques

[11, 12, 13], many researchers have proposed deep learning based methods for automatic infant brain image segmentation [7, 9, 10]

. Both deep convolutional neural networks (CNN)-based methods and fully convolutional networks (FCN)-based method have been introduced before. For example, Zhang et al.


proposed a 2D patch-wise CNN to learn a hierarchy of increasingly complex features from T1, T2 and fractional aniostropy (FA) images for the segmentation of multi-modality isointense infant brain image. They showed that their CNN approach outperforms prior methods and classical machine learning algorithms using support vector machine (SVM) and random forest (RF) classifiers. Nie et al.

[9] presented a 2D semantic-wise multi-stream FCN to segment infant brain images using the same datasets that Zhang et al. [7] used. They obtained improved results in comparison to those achieved by Zhang et al. [7]. Their overall DOC were 85.5% (CSF), 87.3% (GM) and 88.8% (WM) vs. 83.5% (CSF), 85.3%(GM), and 86.4% (WM) by [7]. Moeskops and Pluim [10] investigated using a dilated triplanar CNN in combination with a non-dilated 3D CNNs for the segmentation of isointense-phase brain MR images.

In this paper, we propose a 3D semantic segmentation method for accurate tissue segmentation of multi-modality isointense infant brain MR images. Our method is based on context-guided, multi-stream 3D FCN, which after training, can directly map a whole volumetric data to its volume-wise labels. Inspired by previous work [14, 15]

, multi-scale deep supervision is designed to alleviate the potential gradient vanishing problem during training. It is also used together with partial transfer learning to boost the training efficiency when only small set of labeled training data are available. Moreover, context information is used to further improve the performance of our method.

2 Data

The data used in this study was provided by the 2017 MICCAI grand challenge on 6-month infant brain MRI segmentation111One can find details about the MICCAI 2017 Grand Challenge on 6-month infant brain MRI segmentation at: The training data provided by the challenge organizers consists of T1 and T2 weighted MR images of 10 subjects. The organizers also released test data containing T1 and T2 weighted images of another 13 patients. Thus, in this paper, our method is first trained on the training data and then evaluated on the test data.

All images were preprocessed by the challenge organizers, which included linear alignment of T2 images onto the corresponding T2 images, skull stripping, intensity inhomogeneity correction, and removal of the cerebellum and brain stem. All images were up-sampled into an isotropic grid with a resolution of . Fig. 1 shows the T1 and T2 weighted MR images and the associated ground truth segmentation of a training data

Figure 1: T1 and T2 weighted MR images and the associated ground truth segmentation of a training data

3 Method

Fig. 2

illustrates our two-stage method for the automatic infant brain segmentation in multi-modality MR images. We first develop FCN-1 which is used at Stage one to learn the probability map of each brain tissues from multi-modality MR images (T1 and T2). An initial segmentation of different brain tissues is then obtained from the probability map, which further allows us to compute a distance map for each brain tissue. The computed distance maps can be used to model the spatial context information. We then develop FCN-2 which is used at Stage two to get the final segmentation by using both the spatial contact information and the multi-modality MR images. In this section, firstly the detailed architecture of our proposed model is elaborated, and then we will introduce the multi-scale deep supervision. Finally, partial transfer learning with is designed to boost the training efficiency, will be described.

Figure 2: A schematic illustration of our proposed network architecture. For each block, the digits below take a format as “the number of feature stack : the data size”

3.1 Multi-stream 3D FCN with Skip Connection

At both stages, multi-stream 3D FCN with long and short skip connections is employed to integrate information from multiple sources, i.e., T1 and T2 weighted images (and context information for FCN-2). More specifically, both FCN-1 and FCN-2 consist of two parts, i.e., the encoder part (contracting path) and the decoder part (expansive path). The encoder part focuses on analysis and feature representation learning from the input data while the decoder part generates segmentation results, relying on the learned features from the encoder part. Unlike previous work [7], which accommodates multiple sources of information in the form of channels presented to the input layer, we propose to construct an encoder path for each modality and then effectively fuse high-level information from all modalities at the beginning of the decoder path. We feel that the high level information extracted from different modalities at the end of the encoder path are more complementary to each other than the original images from different modalities.

Inspired by 3D U-net [14], long and short skip connections, which help recover spatial context lost in the contracting encoder, are used in our network as shown in Fig. 2. The importance of skip connection in biomedical image segmentation has been demonstrated by previous work [16, 13].

It has been shown that small convolutional kernels are more beneficial for training and performance. In our deeply supervised network, all convolutional layers use kernel size of

and strides of 1 and all max pooling layers uses kernel size of

and strides of 2. In the convolutional and deconvolutional blocks, batch normalization (BN)


and Rectified linear unit (ReLU) are adopted to speed up the training and to enhance the gradient back propagation.

3.2 Multi-scale Deep Supervision

Training a deep neural network is challenging. As the matter of gradient vanishing, final loss cannot be efficiently back propagated to shallow layers, which is more difficult for 3D cases when only a small set of annotated data is available. To address this issue, we inject two down-scaled branch classifiers into our network in addition to the classifier of the main network. By doing this, segmentation is performed at multiple output layers. As a result, classifiers in different scales can take advantage of multi-scale context, which has been demonstrated in previous work on segmentation of 3D liver CT and 3D heart MR images [15]. Furthermore, with the loss calculated by the prediction from classifiers from different scales, more effective gradient back propagation can be achieved by direct supervision on the hidden layers.

Specifically, let be the weights of main network and = {, , … } be the weights of classifiers at different scales, where is the number of classifier branches. For the training samples = (), where represents training sub-volume patches and represents the class labels while .


where = {, , … }; is a sub-volume patch directly sampled from a training image while contains the examples () at scale of , which is obtained by downsampling by a factor of along each dimension; is the weights of the classifier at scale of ; is the weight of , which is the loss calculated by a training sample at scale of .


where is the probability of predicted class label corresponding to sample .

The total loss of our multi-scaled deeply supervised model is then:


where is the regularization term ( norm in our experiment) with hyper parameter .

3.3 Partial Transfer Learning

It is difficult to train a deep neural network from scratch because of limited annotated data. Training deep neural network requires large amount of annotated data, which are not always available, although data augmentation can partially address the problem. Furthermore, randomly initialized parameters make it more difficult to search for an optimal solution in high dimensional space. Previous studies [18] demonstrated that transferring features from another pre-trained model can boost the generalization, and that the effect of transfer learning was related to the similarity between the task of the pre-trained model and the target task. Furthermore, the same study also demonstrated that weights of shallow layers in deep neural network were generic while those of deep layers were more related to specific tasks.

To best utilize the advantage of transfer learning, we need to transfer from a model trained on a related task. In this paper, we used a pre-trained model in our previous work [19], which is designed for the task of segmentation of the proximal femur from 3D T1-weighted MR Images. More specifically, the weight of the complete path for T1 modality (including encoder, decoder and all classifiers) are initialized from our previous model [19], while the weights of the encoder path for T2 modality are partially transferred from C3D model [20]

, which is one of the few 3D models that has been trained on a very large dataset in the field of computer vision.

4 Experiments and Results

4.1 Experimental Setup

Training data augmentation. Data augmentation was used to enlarge the training samples by rotating each image (90, 180, 270) degrees around the z axis of the image and flipped horizontally (y axis).

Training patches preparation. All sub-volume patches to our neural network are in the size of

. We randomly cropped sub-volume patches from training samples. Each sampled image patch was normalized as zero mean and unit variance before fed into network.

#1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 Mean(std)
CSF DOC 0.957 0.951 0.959 0.946 0.956 0.955 0.956 0.960 0.958 0.940 0.956 0.942 0.959 0.954(0.007)
ASD 0.119 0.131 0.123 0.144 0.118 0.125 0.118 0.115 0.115 0.158 0.120 0.151 0.110 0.127(0.015)
MHD 7.28 9.0 11.23 9.90 9.38 11.66 8.66 9.0 8.94 11.23 7.81 10.82 10.10 9.62(1.35)
GM DOC 0.923 0.907 0.920 0.913 0.925 0.916 0.926 0.917 0.914 0.901 0.920 0.910 0.919 0.916(0.007)
ASD 0.30 0.361 0.337 0.351 0.318 0.333 0.280 0.320 0.335 0.428 0.321 0.411 0.333 0.341(0.041)
MHD 5.39 5.10 6.78 7.87 5.66 8.54 4.90 6.78 5.10 7.0 6.16 8.31 6.32 6.46(1.23)
WM DOC 0.906 0.880 0.902 0.895 0.905 0.906 0.914 0.901 0.90 0.870 0.898 0.882 0.893 0.896(0.012)
ASD 0.353 0.424 0.393 0.392 0.367 0.388 0.317 0.379 0.411 0.498 0.382 0.463 0.401 0.40(0.046)
MHD 7.55 6.40 8.12 7.21 5.10 7.48 4.36 7.81 7.81 6.40 6.71 7.55 5.66 6.78(1.16)
Table 1: Table 1. Segmentation performance in terms of DOC, ASD (unit: mm), and MHD (unit: mm) achieved by the present method on the 13 test data.

Training. We trained our network for

iterations after partial transfer learning. All weights were updated by the stochastic gradient descent algorithm (momentum=

, weight decay=). Learning rate was initialized as and halved by every times. In our experiment, we used three branch classifiers. The loss weights of three classifiers , and are , and , respectively. The hyper parameter was chosen to be 0.005.


Our trained models can estimate labels of an arbitrary-sized volumetric image. Given images of a test subject, we extracted overlapped sub-volume patches with the size of

, and fed them to the trained network to get prediction probability maps. For the overlapped voxels, the final probability maps would be the average of the probability maps of the overlapped patches, which were then used to derive the final segmentation results. After that, we conducted morphological operations to remove isolated small volumes and internal holes.

Evaluation metrics. For each test subject, automatic segmentation was evaluated against the associated manual segmentation, by using various measurements including DOC, Average Surface Distance (ASD) and Modified Hausdorff Distance (MHD). For details about the evaluation metrics, we refer to the challenge website:

4.2 Results

Table 1 shows the segmentation performance in terms of DOC, ASD and MHD achieved by the present method when evaluated on the 13 test data provided by the 2017 MICCAI grand challenge on 6-month infant brain MRI segmentation. An average DOC of 0.954, 0.916 and 0.896 was achieved for CSF, GM and WM, respectively.

Methods Tissues DOC ASD MHD
Adding context
CSF 0.954 0.007 0.127 0.015 9.62 1.35
GM 0.916 0.007 0.341 0.041 6.46 1.23
WM 0.896 0.012 0.40 0.046 6.78 1.16
Without context
CSF 0.950 0.006 0.137 0.014 8.94 0.98
GM 0.911 0.008 0.366 0.041 6.61 1.24
WM 0.888 0.012 0.433 0.050 7.12 1.31
Table 2: Average results on the test data achieved by the present method when adding context information vs. without adding context information.

We also evaluated the effectiveness of adding context information. We compared the results achieved by adding the context information with those achieved without using context information. The results are presented in Table 2, which clearly demonstrate the effectiveness of adding the context information. Figure 3 shows a qualitative comparison of the automatic segmentation with context information with the automatic segmentation without using context information by taking the ground truth segmentation as the reference.

Figure 3: A comparison of the ground truth segmentation (the 2nd row), automatic segmentation with context information (the 3rd row) and automatic segmentation without using context information (the 4th row).

Implemented with Python using TensorFlow framework and running on a desktop with a 3.6GHz Intel(R) i7 CPU and a GTX 1080 Ti graphics card with 11GB GPU memory, on average our network took about 8 seconds to segment data of one test subject.

5 Discussions and Conclusions

In this paper, we proposed a 3D semantic segmentation method for accurate tissue segmentation of multi-modality isointense infant brain MR images into CSF, GM and WM. Our method is based on context-guided, multi-stream 3D FCN with multi-scale deep supervision, which after training, can directly map a whole volumetric data to its volume-wise labels.

In total 21 teams participated the 2017 MICCAI Grand Challenge on 6-month infant brain MRI segmentation. The performance of all the methods was evaluated and ranked by the challenge organizers. Although our team was placed at the 3rd position out of the 21 teams, the segmentation performance achieved by our method had a very small difference in comparison with those achieved by the teams at the 1st and the 2nd positions. More specifically, the overall DOC achieved by our method and by the team at the 1st position were: 0.954 (CSF), 0.916 (GM) and 0.896 (WM) vs. 0.958 (CSF), 0.919 (GM), and 0.901 (WM).


  • [1] I. Despotovic, B. Goossens, and W. Philips, “Mri segmentation of the human brain: Challenges, methods, and applications,” Computational and Mathematical Methods in Medicine, vol. 2015, no. Article ID 450341, 2015.
  • [2] N:I: Weisenfeld and S.K. Warfield, “Automatic segmentation of newborn brain mri,” Neuroimage, vol. 47, no. 2, pp. 564–572, 2009.
  • [3] L. Wang, F. Shi, P.T. Yap, J.H. Gilmore, W. Lin, and D. Shen, “4d multi-modality tissue segmentation of serial infant images,” PLOS ONE, vol. 7, no. 9, pp. e44596, 2012.
  • [4] F. Shi, P.-T. Yap, G. Wu, H. Jia, JH. Gilmore, W. Lin, and D. Shen, “Infant brain atlases from neonates to 1- and 2-year-olds,” PLOS ONE, vol. 6, no. 4, pp. e18746, 2011.
  • [5] L. Wang, F. Shi, Y. Gao, G. Li, J.H. Gilmore, W. Lin, and Shen D., “Integration of sparse multi-modality representation and anatomical constraint for isointense infant brain mr image segmentation,” NeuroImage, vol. 89, pp. 152–164, 2014.
  • [6] L. Wang, Y. Gao, F. Shi, G. Li, J.H. Gilmore, W. Lin, and D. Shen, “Links: Learning-based multi-source integration framework for segmentation of infant brain images,” NeuroImage, vol. 108, pp. 160–172, 2015.
  • [7] W. Zhang, R. Li, H. Deng, L. Wang, W. Lin, S. Ji, and D. Shen, “Deep convolutional neural networks for multi-modality isointense infant brain image segmentation,” NeuroImage, vol. 108, pp. 214–224, 2015.
  • [8] L. Wang, F. Shi, Y. Gao, G. Li, W. Lin, and D. Shen, “Isointense infant brain segmentation by stacked kernel canonical correlation analysis,” in Patch Based Tech Med Imaging, 2015, vol. LNCS9467, pp. 28–36.
  • [9] D. Nie, L. Wang, Y. Gao, and D. Shen, “Fully convolutional networks for multi-modality isointense infant brain image segmentation,” in Proc IEEE Int Symp Biomed Imaging, 2016, pp. 1342–1345.
  • [10] P. Moeskops and Pluim J.P.W., “Isointense infant brain mri segmentation with a dilated convolutional neural networks,” CoRR, p. abs/1708.02282, 2017.
  • [11] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [12] Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2015, pp. 3431–3440.
  • [13] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241.
  • [14] O. Cicek, A. Abdulkadir, S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: Learning dense volumetric segmentation from sparse annotation,” in MICCAI 2016, vol. LNCS 9901, pp. 424–432. Springer, 2016.
  • [15] Q. Dou, L. Yu, H. Chen, Y. Jin, X. Yang, J. Qin, and P.A. Heng, “3d deeply supervised network for automated segmentation of volumetric medical images,” Medical Image Analysis, vol. 41, pp. 40–54, 2017.
  • [16] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, “The importance of skip connections in biomedical image segmentation,” in Proceedings of DLMA 2016., 2016, pp. 179–187.
  • [17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of ICML, 2015, pp. 448–456.
  • [18] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” in Advances in neural information processing systems. (2014), 2014, pp. 3320–3328.
  • [19] G. Zeng, X. Yang, J. Li, L. Yu, P.A. Heng, and G. Zheng, “3d u-net with multi-level deep supervision:fully automatic segmentation of proximal femur in 3d mr images,” in MLMI 2017, 2017, pp. 274––282.
  • [20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of CVPR 2015, 2015, pp. 4489–4497.