Global Deep Learning Methods for Multimodality Isointense Infant Brain Image Segmentation

12/10/2018 ∙ by Zhengyang Wang, et al. ∙ University of North Carolina at Chapel Hill Texas A&M University 0

An important step in early brain development study is to perform automatic segmentation of infant brain magnetic resonance (MR) images into cerebrospinal fluid (CSF), gray matter (GM) and white matter (WM) regions. This task is especially challenging in the isointense stage (approximately 6-8 months of age) when GM and WM exhibit similar levels of intensities in MR images. Deep learning has shown its great promise in various image segmentation tasks. However, existing models do not have an efficient and effective way to aggregate global information. They also suffer from information loss during up-sampling operations. In this work, we address these problems by proposing a global aggregation block, which can be flexibly used for global information fusion. We build a novel model based on 3D U-Net to make fast and accurate voxel-wise dense prediction. We perform thorough experiments, and results indicate that our model outperforms previous best models significantly on 3D multimodality isointense infant brain MR image segmentation.



There are no comments yet.


page 1

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human brains undergo a rapid tissue growth and a dramatic development of cognitive and motor functions during the first year after birth [1, 2]. Study of human brain development in this phase is essential to understand human brains and thus has attracted considerable attention [3, 4, 5, 6]. Recently, the availability of infant brain magnetic resonance (MR) image data has increased, which motivated studies that provided more insights in both normal and disease brain growth [3]. For example, the Baby Connectome Project ( collects multimodality magnetic resonance (MR) image data of 500 children from birth to five years old. Based on these data, many important studies have been performed. For instance, identifying early neuro-markers of the risk for disorders, such as autism and schizophrenia, may help disease prognosis or even prevention [4].

In this research area, an important step is to obtain tissue segmentation of 3D infant brain MR images into cerebrospinal fluid (CSF), gray matter (GM) and white matter (WM) regions [7, 8, 9]. Accurate segmentation can lead to volumetric quantification of GM and WM structures, which may indicate very early neuro-anatomical developmental events [5, 6]. This segmentation of infant brain MR images is known to be more difficult than that of adult brain MR images as infant brain MR images typically have lower tissue contrast, higher noise level, severe partial volume effect and ongoing WM myelination [10, 11, 12]. This task is especially challenging in the isointense stage (approximately 6-8 months of age), during which GM and WM exhibit highly similar levels of intensities in MR images. While many previous studies addressed the segmentation tasks for either the infantile ( 3 months) or early-adult-like ( 12 months) stage [13, 12, 14, 15, 16, 10, 11], the methods did not work well for the isointense stage. The left two figures in Fig. 1 provide an example of a 2D slice of 3D isointense infant brain T1 and T2 MR images, where it is hard to segment GM and WM regions.

Fig. 1: A 2D slice of multimodality isointense infant brain MR images. From left to right: T1 MR image, T2 MR image and the segmentation label. Note that the segmentation label only contains 4 voxel values (0,1,2,3) corresponding to 4 classes.

Due to the difficulty of this task, manual segmentation of 3D multimodality isointense infant brain MR images requires expertise and experience and is very time-consuming. This raises the need for developing automatic segmentation tools. Some studies proposed to use longitudinal datasets [17, 18]

to guide the segmentation. However, the infant brain MR images come from a singe time point in most cases, where longitudinal datasets are not available. In this case, machine learning methods 

[19, 20, 21, 22] without using longitudinal datasets have been developed for this task. Yet the performance of these models in terms of accuracy and speed is not satisfactory for medical research.

In recent years, deep learning methods, such as fully convolutional networks (FCN) [23], U-Net [24], Deeplab [25, 26, 27], and RefineNet [28], have been proposed to set new performance records on 2D image segmentation. Motivated by these successes in 2D cases, the study in [8] proposed to extend FCN into 3D-FCN to perform 3D segmentation. Particularly, their model was composed of an encoder and a decoder. The encoder used convolutional layers for extracting features along with pooling layers for reducing the spatial sizes of feature maps. The decoder applied deconvolutional layers to up-sample feature maps and finally output a segmentation map with the same size as the input. Additionally, they employed the strategy of U-Net and added skip connections. Experimental results indicated that their model, known as the convolution-concatenate 3D-FCN (CC-3D-FCN), achieved promising results on 3D multimodality isointense infant brain MR image segmentation.

We conduct an in-depth study of prior models for this task and observe three limitations shared by most of them. First, in the encoding part, the feature maps have to be down-sampled to a very small size in their model in order to perform effective global information aggregation. For example, in CC-3D-FCN [8], with input feature maps, the encoder uses three down-sampling layers to obtain output feature maps. Then a deconvolution with a kernel size of is capable of covering the whole feature maps and fusing global information. As the number of feature maps usually increases after each down-sampling layers, employing more down-sampling layers will introduce a considerably large amount of training parameters, making the model less efficient. In addition, more down-sampling layers cause the loss of more spatial information during encoding, which is crucial for image segmentation tasks. On the other hand, reducing the number of down-sampling layers will result in feature maps of larger spatial sizes before global information aggregation. In this case, using operations based on small local kernels, like convolutions or deconvolutions, to aggregate global information may not be effective. To conclude, it will improve both the effectiveness and efficiency to develop a new method capable of performing global information aggregation on feature maps of any size. The second limitation is in the decoding part, where the spatial sizes of feature maps gradually increase through up-sampling. Deconvolutions are the most popular up-sampling operations used in previous models. However, deconvolutions apply the same local kernels to scan each location, without taking global information into consideration. When inputs are feature maps with a larger spatial size than the kernel size, the deconvolution fails to recover all necessary information during up-sampling. The third and last limitation we observe lies in the setting of the number of feature maps in each layer. Empirical results [24] indicate that it is beneficial to have the number of feature maps increase after down-sampling and decrease after up-sampling. We find the strategy also works in this task, demonstrated by experiments in Section IV-D.

In this work, we address the three limitations and propose a novel model for 3D multimodality isointense infant brain MR image segmentation. To address the first limitation, we propose a global aggregation block based on the self-attention scheme [29], which is able to aggregate global information from feature maps of any size without introducing more parameters. This module is further extended to an up-sampling global aggregation block, which can alleviate the second problem mentioned above. To the best of our knowledge, we are the first to make this extension. To address the third limitation, we build our model based on the U-Net [24] framework. We conduct extensive experiments to compare our model with CC-3D-FCN on our dataset. The results and analysis shown that our model improves CC-3D-FCN significantly in terms of segmentation accuracy.

Ii Related Work

The U-Net [24] architecture incorporates both local and global contextual information through the encoding-decoding process. In the past several years, many variants of U-Net have been developed and they achieved improved performance on biomedical image segmentation. For example, FusionNet [30], residual deconvolutional network (RDN) [31] and residual symmetric U-Net [32]

addressed the 2D electron microscopy image segmentation task by building a U-Net-based network with additional short-range residual connections 

[33]. In addition, U-Net was extended from 2D to 3D cases for volumetric biomedical images, leading to models like 3D U-Net [34], V-Net [35], and CC-3D-FCN [8]. Meanwhile, DeepMedic [36] explored another way to fuse both local and global contextual information by removing the decoder of U-Net and employing a dual pathway architecture. However, without the decoder, the spatial sizes of outputs become smaller as compared to U-Net-based models, which harms the inference efficiency since more patches need to be processed during inference. DeepMedic has been outperformed by U-Net-based models like CC-3D-FCN [8]. In this work, we unified previous models and employed the 3D U-Net architecture with short-range residual connections as the basic framework.

The self-attention mechanism was used in the Transformer [29] on machine translation tasks. The Transformer did not apply any recurrence and convolution based on the insight that the self-attention mechanism makes it easier to learn long-range dependencies between sequences and requires less computation [29]. The study in [37] proposed the spacetime non-local block for video classification, where the self-attention mechanism was also employed for capturing long-range dependencies. In this work, we explore the self-attention mechanism for a different functionality. As the self-attention mechanism provides an operation with a global receptive field, it can aggregate global information both effectively and efficiently. We further generalize it to perform down-sampling and up-sampling.

Iii The Proposed Methods

In this section, we introduce our proposed model. Section III-A illustrates our model framework derived from U-Net [24]. Based on the framework, our model is composed of different blocks. Basic residual blocks are discussed in Section III-B. We then propose the global aggregation block and its extension for up-sampling in Section III-C. Finally, Section III-D provides details of training and inference strategies.

Iii-a U-Net Framework

We adopt the 3D U-Net [34] as the framework of our proposed model for 3D multimodality isointense infant brain MR image segmentation. An illustration is given in Fig. 2. The input first goes through an encoding input block, which extracts low-level features. Two down-sampling blocks are used to reduce the spatial sizes and obtain high-level features. Note that the number of channels is doubled after each down-sampling block. A bottom block then aggregates global information and gives the output of the encoder. Correspondingly, the decoder uses two up-sampling blocks to recover the spatial sizes for the segmentation output. The number of feature maps is halved after an up-sampling operation.

To assist the decoding process, skip connections copy feature maps from the encoder to the decoder. In the proposed model, the copied feature maps are combined with decoding feature maps through summation, instead of concatenation used in CC-3D-FCN [8] and U-Net [24]. The intuitive way to combine features from the encoder and the decoder is concatenation, providing two sources of inputs to the up-sampling operation. Using summation instead has two advantages. First, summation does not increase the number of feature maps, thus reducing the number of trainable parameters in the following layer. Second, skip connections with summation can be considered as long-range residual connections, which are known to be capable of facilitating the training of models.

Given the output of the decoder, the output block produces the segmentation probability map. Specifically, for each voxel, the probabilities that it belongs to BG, CSF, GM and GM are provided, respectively. The final segmentation map can be obtained through a single

argmax operation on this probability map. The details of each block are introduced in following sections.

Fig. 2: An illustration of the U-Net framework employed by our proposed model.
Fig. 3: An illustration of the residual blocks employed by our proposed model. Details are provided in Sections III-B and III-C.

Iii-B Residual Blocks

Residual connections have been shown to facilitate the training of deep learning models and achieve better performance [33]. Note that skip connections with summation in our U-Net framework are equivalent to long-range residual connections. To further improve U-Net, the study in [30] proposed to add short-range residual connections as well. A similar strategy was employed in [32, 28, 31]. However, those work did not apply residual connections for down-sampling and up-sampling blocks. Down-sampling block with residual connections has been explored in ResNet [33]. We explore the idea for up-sampling blocks based on our proposed up-sampling global aggregation block, which is discussed in detail in Section III-C.

In our proposed model, four different residual blocks are used to form a fully residual network [28], as shown in Fig. 3. Notably, all of them apply the pre-activation pattern [38]. Fig. 3

(a) shows a regular residual block with two consecutive convolutional layers. Batch normalization 

[39] and ReLU6 [40] are used before each convolutional layer. This block is used as the input block in our framework. The output block is constructed by this block followed by a

convolution with a stride of 1. Moreover, after the summation of skip connections, we insert one such block. Fig. 

3(b) is a down-sampling residual block. A convolution with a stride of 2 is used to replace the identity residual connection, in order to adjust the spatial sizes of feature maps accordingly. We employ this block as the down-sampling blocks. Fig. 3(c) illustrates our bottom block. Basically, a residual connection is applied on the proposed global aggregation block. The up-sampling residual block is provided in Fig. 3(d). Similar to the down-sampling block in Fig. 3(b), the identity residual connection is replaced by a deconvolution with a stride of 2 and the other branch is the up-sampling global aggregation block. Our model uses this block as the up-sampling blocks.

Fig. 4: An illustration of our proposed global aggregation block. Note that the spatial size of the output is determined by that of the query (Q) matrix.

Iii-C Global Aggregation Block

To achieve global information fusion through a block, each position of the output feature maps should depend on all positions of the input feature maps. Such an operation is opposite to local operations like convolution and deconvolution, where each output location has a local receptive field on the input. In fact, a fully-connected layer has this global property. However, it is prone to over-fitting and does not work well in practice. As introduced in Section II, the self-attention block used in the Transformer computes outputs at one position by attending to every position of the input. Later, the study in [37]

proposed non-local neural networks for video classification, which employed a similar block. While both studies applied self-attention blocks with the aim of capturing long-term dependencies in sequences, we point out that global information of image feature maps can be aggregated through self-attention blocks.

Based on this insight, we propose the global aggregation block, which is able to fuse global information from feature maps of any size. We further generalize it to handle down-sampling and up-sampling, making it a block that can be used anywhere in deep learning models. Let represent the input to the global aggregation block and represent the output. For simplicity, we use to denote a convolution with a stride of 1 and output channels. Note that does not change the spatial size. The first step of the proposed block is to generate the query (), key () and value () matrices [29], given by


where unfolds a tensor into a matrix, can be any operation that produces feature maps, and are hyper-parameters representing the dimensions of the keys and values. Suppose the size of is . Then the dimensions of and are and , respectively. The dimension of , however, is , where depend on . The left part of Fig. 4 illustrates this step. Here, a tensor is represented by a cube, whose voxels correspond to

-dimensional vectors.

Each row of the , and matrices denotes a query vector, a key vector and a value vector, respectively. Note that the query vector has the same dimension as the key vector. Meanwhile, the number of key vectors is the same as that of value vectors, which indicates a one-to-one correspondence. In the second step, the attention mechanism is applied on , and  [29], defined as


where the dimension of the attention weight matrix is and the dimension of the output matrix is . To see how it works, we take one query vector from as an example. In the attention mechanism, the query vector interacts with all key vectors, where the dot-product between the query vector and one key vector produces a scalar weight for the corresponding value vector. The output of the query vector is a weighted sum of all value vectors, where the weights are normalized through . This process is repeated for all query vectors and generates -dimensional vectors. This step is illustrated in the box of Fig. 4. Note that Dropout [41] can be applied on to avoid over-fitting. As shown in Fig. 4, the final step of the block computes by


where is the reverse operation of and is a hyper-parameter representing the dimension of the outputs. As a result, the size of is .

It is worth noting that the spatial size of is determined by that of the Q matrix, i.e., by the function in (III-C). Therefore, with appropriate functions, our global aggregation block can be used for same-size process, down-sampling and up-sampling. Our proposed model explores two different functions. For the global aggregation block in Fig. 3(c), is . For the up-sampling global aggregation block in Fig. 3(d), is a deconvolution with a stride of 2. The use of this block alleviates the problem that the up-sampling through a single deconvolution loses information. By taking global information into consideration, the up-sampling block is able to recover more accurate details. In our model, we set .

Iii-D Training and Inference Strategies

Our proposed model applies Dropout [41] with a rate of 0.5 in each global aggregation block and the output block before the final convolution. A weight decay [42] with a rate of is also employed. To train the model, we use randomly cropped small patches. In this way, we obtain sufficient training data and the requirement on memory is reduced. No extra data augmentation is needed. The experimental results in Section IV-F suggest that patches with a size of leads to the best performance. The batch size is set to 5. The Adam optimizer [43] with a learning rate of 0.001 is employed to perform the gradient descent algorithm.

In the inference process, following [8], we extract patches with the same size as that used in training. For example, to generate patches for inference, we slide a window of size through the original image with a constant overlapping step size. The overlapping step size must be smaller than or equal to the patch size, in order to guarantee that extracted patches cover the whole image. Consequently, prediction for all these patches provides segmentation probability results for every voxel in the original image. For voxels that receive multiple results due to overlapping, we average them to produce the final prediction. The overlapping step size is an important hyper-parameter affecting the inference speed and the segmentation accuracy. A smaller overlapping step size results in better accuracy, but increases the inference time as more patches are generated. We explore the trade-off in Section IV-E.

Iv Results and Discussion

We perform experiments to evaluate our model and demonstrate its effectiveness. Section IV-A introduces our dataset. Section IV-B describes the baseline model and the evaluation methods used in our experiments. We compare our proposed model with the baseline in Section IV-C. The ablation study in Section IV-D shows that each single part of our model improves the baseline. The trade-off between the inference speed and accuracy is explored in Section IV-E, based on different overlapping step sizes. The impact of patch size is analyzed in Section IV-F.

Modality Direction #Slices TR/TE Flip Angle Resolution
T1 Sagittal 144 1900/4.38 ms 1 mm 1 mm 1 mm
T2 Axial 64 7380/119 ms 1.25 mm 1.25 mm 1.95 mm
DW Axial 60 7680/82 ms - 2 mm 2 mm 2 mm
TABLE I: The parameters of the scanning process for T1, T2 and diffusion-weighted (DW) MR images.
Model CSF GM WM Average
Baseline 0.92500.0118 0.90840.0056 0.89260.0119 0.90870.0066
Our Model 0.95300.0074 0.92450.0049 0.91020.0101 0.92920.0050
TABLE II: Comparison of segmentation performance between our proposed model and the baseline model in terms of DR. The leave-one-subject-out cross-validation is used. Larger values indicate better performance.
Model CSF GM WM Average
Baseline 0.34170.0245 0.65370.0483 0.48170.0454 0.49240.0345
Our Model 0.25540.0207 0.59500.0428 0.44540.0040 0.43190.0313
TABLE III: Comparison of segmentation performance between our proposed model and the baseline model in terms of 3D-MHD. The leave-one-subject-out cross-validation is used. Smaller values indicate better performance.

Iv-a Data Acquisition and Preprocessing

We conduct experiments on a dataset consisting of multimodality brain MR images. To be specific, the dataset contains T1 and T2 MR images from 10 healthy 6-month-old infants. The data collection from these infants was authorized by their parents with written forms. The approval of Institutional Review Board (IRB) has been obtained for these experiments. A Siemens 3T head-only MR scanner with a circular polarized head coil was used to scan the infants’ brains to acquire T1, T2 and diffusion-weighted (DW) MR images. During the scanning process, the infants were asleep with ear protection. A vacuum-fixation device was applied to protect their heads. Table. I provides the settings when scanning the three modalities, respectively. Specifically for DW images, 42 noncollinear diffusion gradients with a diffusion weight of 1000 s/mm were employed. In addition, 7 non-diffusion-weighted reference scans were executed.

To align T2 images with T1 images of the same infant, we first performed a rigid alignment with the help of distortion-corrected DW images and then up-sampled the T2 images into an isotropic resolution of 1 mm 1 mm 1 mm. If moderate or severe motion artifacts [44] were detected, the images were re-scanned. After the alignment, an intensity inhomogeneity correction [45] was applied on T1 and aligned T2 images. Finally, the skull, cerebellum, and brain stem were removed by in-house tools, the Brain Surface Extractor (BSE) [46] and the Brain Extraction Tool (BET) [47]. The skull stripping [48] results were reviewed by a trained rater to manually edit, by using ITK-SNAP [49], to ensure the actual removal of non-brain tissues.

The segmentation labels for training were initially generated by a publicly available infant brain segmentation software named iBEAT ([50]. Careful manual editing was further performed by an experienced rater in order to correct possible errors. Specifically, under the guidance of an experienced neuroradiologist, segmentation errors and geometric defects were corrected using ITK-SNAP, with the help of surface rendering. Generally, the correction took almost one week for one subject. The segmentation labels are composed of 4 classes, i.e., CSF, GM, WM as well as a background (BG) class. Fig. 1 provides an example of the data.

Iv-B Experimental Setup

We re-implement the CC-3D-FCN model [8] as our baseline. CC-3D-FCN is a 3D fully convolutional network (3D-FCN) with convolution and concatenate (CC) skip connections, which is designed for 3D multimodality isointense infant brain image segmentation. It has been shown to outperform traditional machine learning methods, such as FMRIB’s automated segmentation tool (FAST) [19], majority voting (MV) [20]

, random forest (RF) 

[21] and random forest with auto-context model (LINKS) [22]. Moreover, studies in [8] has showed the superiority of CC-3D-FCN to previous deep learning models, like 2D, 3D CNNs [7], DeepMedic [36], and the original 3D U-Net [34]. Therefore, it is appropriate to use CC-3D-FCN as the baseline of our experiments.

In our experiments, we employ the Dice ratio (DR) and 3D modified Hausdorff distance (3D-MHD) as the evaluation metrics. These two methods evaluate the accuracy only for binary segmentation tasks, so it is required to transform the 4-class segmentation task into 4 binary segmentation tasks for evaluation. That is, a 3D binary segmentation map should be constructed for each class, where 1 denotes the voxel in the position belongs to the class and 0 means the opposite. In our experiments, we derive binary segmentation maps directly from 4-class segmentation maps. The evaluation is performed on binary segmentation maps for CSF, GM and WM.

Let and represent the predicted binary segmentation map for one class and the corresponding ground truth label, respectively. The DR is given by


where denotes the number of 1’s in a segmentation map and means the number of 1’s shared by and . Apparently, DR is a value in and a larger DR indicates a more accurate segmentation.

The modified Hausdorff distance (MHD) [51] is designed to compute the similarity between two objects. Here, an object is a set of points where a point is represented by a vector. Specifically, given two sets of vectors and , MHD is computed by


where the distance between two sets is defined as


and the distance between a vector and a set is defined as


Previous studies [22, 7, 8] applied MHD for evaluation by treating a 3D map as -dimensional vectors. However, there are two more different ways to vectorize the 3D map, depending on the direction of forming vectors, i.e., -dimensional vectors and -dimensional vectors. Each vectorization leads to different evaluation results by MHD. To make it a direction-independent evaluation metric as DR, we define 3D-MHD, which computes the averaged MHD based on the three different vectorizations. A smaller 3D-MHD indicates a higher segmentation accuracy.

Iv-C Comparison with the Baseline

We compare our proposed model with the baseline on our dataset. Following [8], the patch size is set to and the overlapping step size for inference is set to

. To remove the bias of different subjects, the leave-one-subject-out cross-validation is used for evaluating segmentation performance. That is, for 10 subjects in our dataset, we train and evaluate models 10 times correspondingly. Each time one of the 10 subjects is left out for validation and the other 9 subjects are used for training. The mean and standard deviation of segmentation performance of the 10 runs are reported.

Tables II and III

provide the experimental results. In terms of both evaluation metrics, our model achieves a significant improvement over the baseline model. Due to the small variances of the results, we focus on one of the 10 runs for visualization and other studies, where the models are trained on the first 9 subjects and evaluated on the

subject. A visualization of the segmentation results in this run is given by Fig. 6. By comparing the areas in red circles, we can see that our model is capable of catching more details than the baseline model. We also visualize the training processes to illustrate the superiority of our model. Fig. 5 shows the training and validation curves in this run of our model and the baseline model, respectively. Clearly, our model converges faster to a lower training loss. In addition, according to the better validation results, our model does not suffer from over-fitting.

To further show the efficiency of our proposed model, we compare the number of parameters as reported in Table. IV. Our model reduces parameters compared to CC-3D-FCN and achieves better performance. A comparison of inference time is also provided in Table V. The settings of our device are - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz; and operation system: Ubuntu 16.04.3 LTS.

Since our data has been used as the training data in the iSeg-2017 challenge (, we also compare the results evaluated on the 13 testing subjects in Table VI. According to the leader board (, our model achieves the best results for WM and GM and comparable results for CSF.

Model Number of Parameters
Baseline [8] 2,534,276
Our Model 1,821,124
TABLE IV: Comparison of the number of parameters between our proposed model and the baseline model.
Model Inference Time (min)
Baseline [8] 3.850.15
Our Model 3.060.12
TABLE V: Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to and the overlapping step size for inference is set to .
Fig. 5: Comparison of training processes and validation results between our proposed model and the baseline model when training on the first 9 subjects and using the subject for validation.
Baseline 0.93240.0067 0.91460.0074 0.89740.0123
Our Model 0.95570.0060 0.92190.0089 0.90440.0153
TABLE VI: Comparison of segmentation performance on the 13 testing subjects of iSeg-2017 between our proposed model and the baseline model in terms of DR. Smaller values indicate better performance.
Model CSF GM WM Average
Baseline 0.9235 0.9085 0.8639 0.8986
Model1 0.9585 0.9099 0.8625 0.9103
Model2 0.9568 0.9172 0.8728 0.9156
Model3 0.9576 0.9198 0.8749 0.9174
Model4 0.9578 0.9210 0.8769 0.9186
Model5 0.9554 0.9225 0.8804 0.9194
Our Model 0.9572 0.9278 0.8867 0.9239
TABLE VII: Ablation study by comparing segmentation performance between different models in terms of DR. All models are trained on the first 9 subjects and evaluated on the subject. Larger values indicate better performance. Details of models are provided in Section IV-D.
Model CSF GM WM Average
Baseline 0.3422 0.6331 0.4541 0.4765
Model1 0.2363 0.6277 0.4705 0.4448
Model2 0.2404 0.6052 0.4480 0.4312
Model3 0.2392 0.5993 0.4429 0.4271
Model4 0.2397 0.5926 0.4336 0.4220
Model5 0.2444 0.5901 0.4288 0.4211
Our Model 0.2477 0.5692 0.4062 0.4077
TABLE VIII: Ablation study by comparing segmentation performance between different models in terms of 3D-MHD. All models are trained on the first 9 subjects and evaluated on the subject. Smaller values indicate better performance. Details of models are provided in Section IV-D.

Iv-D Ablation Study of Different Modules

We perform an ablation study to show the effectiveness of each part in our proposed model. Specifically, we compare the following models in addition to our model and the baseline:

Model1 is a 3D U-Net without short-range residual connections. Down-sampling and up-sampling are implemented by convolutions and deconvolutions with a stride of 2, respectively. The bottom block is simply a convolutional layer.

Model2 is Model1 with short-range residual connections, i.e., the blocks in Fig. 3(a) and (b) are applied. The bottom block and up-sampling blocks are the same as those in Model1.

Model3 replaces the first up-sampling block in Model2 with the block in Fig. 3(d).

Model4 replaces both up-sampling blocks in Model2 with the block in Fig. 3(d).

Model5 replaces the bottom block in Model2 with the block in Fig. 3(c).

All models are trained on the first 9 subjects. We report the segmentation performance on the subject in Table. VII and Table. VIII. The results indicate that the U-Net framework, residual blocks and global aggregation blocks introduced in Section III all provides an improvement over the baseline, in terms of the segmentation accuracy.

Fig. 6: Visualization of the segmentation results on the subject by our proposed model and the baseline model. Both models are trained on the first 9 subjects. The first column shows the original segmentation maps. The second, third and fourth columns show the binary segmentation maps for CSF, GM and WM, respectively.
Fig. 7: Changes of segmentation performance in terms of DR, with respect to different overlapping step sizes during inference. The model is trained on the first 9 subjects and evaluated on the subject.

Iv-E Impact of the Overlapping Step Size

As discussed in Section III-D, a small overlapping step size usually results in better segmentation, due to the ensemble effect. However, with a small overlapping step size, the model has to perform inference for more validation patches and thus decreases the inference speed. We explore the trade-off in our model by setting the overlapping step sizes to 4, 8, 16, 32, respectively. Again, we train our model on the first 9 subjects and perform evaluation on the subject. The patch size is set to . According to the overlapping step sizes, 11880, 1920, 387, 80 patches need to be processed during inference, as shown in Fig. 8. In addition, Fig. 7 plots the changes of segmentation performance in terms of DR. Obviously, 8 and 16 are good choices that achieve accurate and fast segmentation results.

Fig. 8: Changes of the number of validation patches for the subject, with respect to different overlapping step sizes during inference.

Iv-F Impact of the Patch Size

The patch size affects the total number of distinct training samples. Meanwhile, it controls the range of available global information when performing segmentation for a patch. To choose the appropriate patch size for our model, we perform a grid search by training on the first 9 subjects and evaluating on the subject with the overlapping step size of 8. Experiments are conducted with five different patch sizes: , , , , . The results are provided in Fig. 9. obtains the best performance and is selected as the default setting of our model.

Fig. 9: Changes of segmentation performance in terms of DR, with respect to different patch sizes. The model is trained on the first 9 subjects and evaluated on the subject.

V Conclusion

In this work, we investigate 3D multimodality isointense infant brain MR image segmentation. As pointed out, existing models do not have an efficient and effective way to aggregate global information and suffer from information loss during up-sampling operations, which limits their performance. To address these problems, we propose a global aggregation block which can be flexibly used for global information fusion and build a novel model based on 3D U-Net. Thorough experiments are conducted, which indicate that our model outperforms the previous best model significantly. In addition, ablation study shows that every part of our design results in an improvement and our model effectively takes advantage of all of them.


  • [1] K. Zilles, E. Armstrong, A. Schleicher, and H.-J. Kretschmann, “The human pattern of gyrification in the cerebral cortex,” Anatomy and embryology, vol. 179, no. 2, pp. 173–179, 1988.
  • [2] T. Paus, D. Collins, A. Evans, G. Leonard, B. Pike, and A. Zijdenbos, “Maturation of white matter in the human brain: a review of magnetic resonance studies,” Brain research bulletin, vol. 54, no. 3, pp. 255–266, 2001.
  • [3] G. Li, L. Wang, F. Shi, A. E. Lyall, W. Lin, J. H. Gilmore, and D. Shen, “Mapping longitudinal development of local cortical gyrification in infants from birth to 2 years of age,” Journal of Neuroscience, vol. 34, no. 12, pp. 4228–4238, 2014.
  • [4] H. C. Hazlett, M. D. Poe, G. Gerig, M. Styner, C. Chappell, R. G. Smith, C. Vachet, and J. Piven, “Early brain overgrowth in autism associated with an increase in cortical surface area before age 2 years,” Archives of general psychiatry, vol. 68, no. 5, pp. 467–476, 2011.
  • [5] A. E. Lyall, F. Shi, X. Geng, S. Woolson, G. Li, L. Wang, R. M. Hamer, D. Shen, and J. H. Gilmore, “Dynamic development of regional cortical thickness and surface area in early childhood,” Cerebral cortex, vol. 25, no. 8, pp. 2204–2212, 2014.
  • [6] W. Gao, J. H. Gilmore, D. Shen, J. K. Smith, H. Zhu, and W. Lin, “The synchronization within and interaction between the default and dorsal attention networks in early infancy,” Cerebral cortex, vol. 23, no. 3, pp. 594–603, 2012.
  • [7]

    W. Zhang, R. Li, H. Deng, L. Wang, W. Lin, S. Ji, and D. Shen, “Deep convolutional neural networks for multi-modality isointense infant brain image segmentation,”

    NeuroImage, vol. 108, pp. 214–224, 2015.
  • [8] D. Nie, L. Wang, E. Adeli, C. Lao, W. Lin, and D. Shen, “3-d fully convolutional networks for multimodal isointense infant brain image segmentation,” IEEE Transactions on Cybernetics, 2018.
  • [9] J. Nie, G. Li, L. Wang, J. H. Gilmore, W. Lin, and D. Shen, “A computational growth model for measuring dynamic cortical development in the first year of life,” Cerebral cortex, vol. 22, no. 10, pp. 2272–2284, 2011.
  • [10] N. I. Weisenfeld and S. K. Warfield, “Automatic segmentation of newborn brain mri,” Neuroimage, vol. 47, no. 2, pp. 564–572, 2009.
  • [11] H. Xue, L. Srinivasan, S. Jiang, M. Rutherford, A. D. Edwards, D. Rueckert, and J. V. Hajnal, “Automatic segmentation and reconstruction of the cortex from neonatal mri,” Neuroimage, vol. 38, no. 3, pp. 461–477, 2007.
  • [12] L. Gui, R. Lisowski, T. Faundez, P. S. Hüppi, F. Lazeyras, and M. Kocher, “Morphology-driven automatic segmentation of mr images of the neonatal brain,” Medical image analysis, vol. 16, no. 8, pp. 1565–1579, 2012.
  • [13] M. J. Cardoso, A. Melbourne, G. S. Kendall, M. Modat, N. J. Robertson, N. Marlow, and S. Ourselin, “Adapt: an adaptive preterm segmentation algorithm for neonatal brain mri,” NeuroImage, vol. 65, pp. 97–108, 2013.
  • [14] F. Shi, Y. Fan, S. Tang, J. H. Gilmore, W. Lin, and D. Shen, “Neonatal brain image segmentation in longitudinal mri studies,” Neuroimage, vol. 49, no. 1, pp. 391–400, 2010.
  • [15] Z. Song, S. P. Awate, D. J. Licht, and J. C. Gee, “Clinical neonatal brain mri segmentation using adaptive nonparametric data models and intensity-based markov priors,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2007, pp. 883–890.
  • [16] L. Wang, F. Shi, P.-T. Yap, W. Lin, J. H. Gilmore, and D. Shen, “Longitudinally guided level sets for consistent tissue segmentation of neonates,” Human brain mapping, vol. 34, no. 4, pp. 956–972, 2013.
  • [17] S. H. Kim, V. S. Fonov, C. Dietrich, C. Vachet, H. C. Hazlett, R. G. Smith, M. M. Graves, J. Piven, J. H. Gilmore, S. R. Dager et al.

    , “Adaptive prior probability and spatial temporal intensity change estimation for segmentation of the one-year-old human brain,”

    Journal of neuroscience methods, vol. 212, no. 1, pp. 43–55, 2013.
  • [18] L. Wang, F. Shi, P.-T. Yap, J. H. Gilmore, W. Lin, and D. Shen, “4d multi-modality tissue segmentation of serial infant images,” PloS one, vol. 7, no. 9, p. e44596, 2012.
  • [19]

    Y. Zhang, M. Brady, and S. Smith, “Segmentation of brain mr images through a hidden markov random field model and the expectation-maximization algorithm,”

    IEEE transactions on medical imaging, vol. 20, no. 1, pp. 45–57, 2001.
  • [20] A. Criminisi, J. Shotton, E. Konukoglu et al.

    , “Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning,”

    Foundations and Trends® in Computer Graphics and Vision, vol. 7, no. 2–3, pp. 81–227, 2012.
  • [21] A. Criminisi and J. Shotton,

    Decision forests for computer vision and medical image analysis

    .   Springer Science & Business Media, 2013.
  • [22] L. Wang, Y. Gao, F. Shi, G. Li, J. H. Gilmore, W. Lin, and D. Shen, “Links: Learning-based multi-source integration framework for segmentation of infant brain images,” NeuroImage, vol. 108, pp. 160–172, 2015.
  • [23] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2015, pp. 3431–3440.
  • [24] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [25] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
  • [26] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
  • [27] Z. Wang and S. Ji, “Smoothed dilated convolutions for improved dense prediction,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.   ACM, 2018, pp. 2486–2495.
  • [28] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 6000–6010.
  • [30] T. M. Quan, D. G. Hildebrand, and W.-K. Jeong, “Fusionnet: A deep fully residual convolutional neural network for image segmentation in connectomics,” arXiv preprint arXiv:1612.05360, 2016.
  • [31] A. Fakhry, T. Zeng, and S. Ji, “Residual deconvolutional networks for brain electron microscopy image segmentation,” IEEE transactions on medical imaging, vol. 36, no. 2, pp. 447–456, 2017.
  • [32] K. Lee, J. Zung, P. Li, V. Jain, and H. S. Seung, “Superhuman accuracy on the snemi3d connectomics challenge,” arXiv preprint arXiv:1706.00120, 2017.
  • [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [34] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2016, pp. 424–432.
  • [35] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3D Vision (3DV), 2016 Fourth International Conference on.   IEEE, 2016, pp. 565–571.
  • [36] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation,” Medical image analysis, vol. 36, pp. 61–78, 2017.
  • [37] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” arXiv preprint arXiv:1711.07971, 2017.
  • [38] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision.   Springer, 2016, pp. 630–645.
  • [39] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [40]

    A. Krizhevsky and G. Hinton, “Convolutional deep belief networks on cifar-10,”

    Unpublished manuscript, vol. 40, p. 7, 2010.
  • [41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [42] A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in Advances in neural information processing systems, 1992, pp. 950–957.
  • [43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [44] J. D. Blumenthal, A. Zijdenbos, E. Molloy, and J. N. Giedd, “Motion artifact in magnetic resonance imaging: implications for automated analysis,” Neuroimage, vol. 16, no. 1, pp. 89–92, 2002.
  • [45] J. G. Sled, A. P. Zijdenbos, and A. C. Evans, “A nonparametric method for automatic correction of intensity nonuniformity in mri data,” IEEE transactions on medical imaging, vol. 17, no. 1, pp. 87–97, 1998.
  • [46] D. W. Shattuck and R. M. Leahy, “Automated graph-based analysis and correction of cortical volume topology,” IEEE transactions on medical imaging, vol. 20, no. 11, pp. 1167–1177, 2001.
  • [47] S. M. Smith, “Fast robust automated brain extraction,” Human brain mapping, vol. 17, no. 3, pp. 143–155, 2002.
  • [48] F. Shi, L. Wang, Y. Dai, J. H. Gilmore, W. Lin, and D. Shen, “Label: pediatric brain extraction using learning-based meta-algorithm,” Neuroimage, vol. 62, no. 3, pp. 1975–1986, 2012.
  • [49] P. A. Yushkevich, J. Piven, H. C. Hazlett, R. G. Smith, S. Ho, J. C. Gee, and G. Gerig, “User-guided 3d active contour segmentation of anatomical structures: significantly improved efficiency and reliability,” Neuroimage, vol. 31, no. 3, pp. 1116–1128, 2006.
  • [50] Y. Dai, F. Shi, L. Wang, G. Wu, and D. Shen, “ibeat: a toolbox for infant brain magnetic resonance image processing,” Neuroinformatics, vol. 11, no. 2, pp. 211–225, 2013.
  • [51] M.-P. Dubuisson and A. K. Jain, “A modified hausdorff distance for object matching,” in Pattern Recognition, 1994. Vol. 1-Conference A: Computer Vision & Image Processing., Proceedings of the 12th IAPR International Conference on, vol. 1.   IEEE, 1994, pp. 566–568.