I Introduction
Human brains undergo a rapid tissue growth and a dramatic development of cognitive and motor functions during the first year after birth [1, 2]. Study of human brain development in this phase is essential to understand human brains and thus has attracted considerable attention [3, 4, 5, 6]. Recently, the availability of infant brain magnetic resonance (MR) image data has increased, which motivated studies that provided more insights in both normal and disease brain growth [3]. For example, the Baby Connectome Project (http://babyconnectomeproject.org) collects multimodality magnetic resonance (MR) image data of 500 children from birth to five years old. Based on these data, many important studies have been performed. For instance, identifying early neuromarkers of the risk for disorders, such as autism and schizophrenia, may help disease prognosis or even prevention [4].
In this research area, an important step is to obtain tissue segmentation of 3D infant brain MR images into cerebrospinal fluid (CSF), gray matter (GM) and white matter (WM) regions [7, 8, 9]. Accurate segmentation can lead to volumetric quantification of GM and WM structures, which may indicate very early neuroanatomical developmental events [5, 6]. This segmentation of infant brain MR images is known to be more difficult than that of adult brain MR images as infant brain MR images typically have lower tissue contrast, higher noise level, severe partial volume effect and ongoing WM myelination [10, 11, 12]. This task is especially challenging in the isointense stage (approximately 68 months of age), during which GM and WM exhibit highly similar levels of intensities in MR images. While many previous studies addressed the segmentation tasks for either the infantile ( 3 months) or earlyadultlike ( 12 months) stage [13, 12, 14, 15, 16, 10, 11], the methods did not work well for the isointense stage. The left two figures in Fig. 1 provide an example of a 2D slice of 3D isointense infant brain T1 and T2 MR images, where it is hard to segment GM and WM regions.
Due to the difficulty of this task, manual segmentation of 3D multimodality isointense infant brain MR images requires expertise and experience and is very timeconsuming. This raises the need for developing automatic segmentation tools. Some studies proposed to use longitudinal datasets [17, 18]
to guide the segmentation. However, the infant brain MR images come from a singe time point in most cases, where longitudinal datasets are not available. In this case, machine learning methods
[19, 20, 21, 22] without using longitudinal datasets have been developed for this task. Yet the performance of these models in terms of accuracy and speed is not satisfactory for medical research.In recent years, deep learning methods, such as fully convolutional networks (FCN) [23], UNet [24], Deeplab [25, 26, 27], and RefineNet [28], have been proposed to set new performance records on 2D image segmentation. Motivated by these successes in 2D cases, the study in [8] proposed to extend FCN into 3DFCN to perform 3D segmentation. Particularly, their model was composed of an encoder and a decoder. The encoder used convolutional layers for extracting features along with pooling layers for reducing the spatial sizes of feature maps. The decoder applied deconvolutional layers to upsample feature maps and finally output a segmentation map with the same size as the input. Additionally, they employed the strategy of UNet and added skip connections. Experimental results indicated that their model, known as the convolutionconcatenate 3DFCN (CC3DFCN), achieved promising results on 3D multimodality isointense infant brain MR image segmentation.
We conduct an indepth study of prior models for this task and observe three limitations shared by most of them. First, in the encoding part, the feature maps have to be downsampled to a very small size in their model in order to perform effective global information aggregation. For example, in CC3DFCN [8], with input feature maps, the encoder uses three downsampling layers to obtain output feature maps. Then a deconvolution with a kernel size of is capable of covering the whole feature maps and fusing global information. As the number of feature maps usually increases after each downsampling layers, employing more downsampling layers will introduce a considerably large amount of training parameters, making the model less efficient. In addition, more downsampling layers cause the loss of more spatial information during encoding, which is crucial for image segmentation tasks. On the other hand, reducing the number of downsampling layers will result in feature maps of larger spatial sizes before global information aggregation. In this case, using operations based on small local kernels, like convolutions or deconvolutions, to aggregate global information may not be effective. To conclude, it will improve both the effectiveness and efficiency to develop a new method capable of performing global information aggregation on feature maps of any size. The second limitation is in the decoding part, where the spatial sizes of feature maps gradually increase through upsampling. Deconvolutions are the most popular upsampling operations used in previous models. However, deconvolutions apply the same local kernels to scan each location, without taking global information into consideration. When inputs are feature maps with a larger spatial size than the kernel size, the deconvolution fails to recover all necessary information during upsampling. The third and last limitation we observe lies in the setting of the number of feature maps in each layer. Empirical results [24] indicate that it is beneficial to have the number of feature maps increase after downsampling and decrease after upsampling. We find the strategy also works in this task, demonstrated by experiments in Section IVD.
In this work, we address the three limitations and propose a novel model for 3D multimodality isointense infant brain MR image segmentation. To address the first limitation, we propose a global aggregation block based on the selfattention scheme [29], which is able to aggregate global information from feature maps of any size without introducing more parameters. This module is further extended to an upsampling global aggregation block, which can alleviate the second problem mentioned above. To the best of our knowledge, we are the first to make this extension. To address the third limitation, we build our model based on the UNet [24] framework. We conduct extensive experiments to compare our model with CC3DFCN on our dataset. The results and analysis shown that our model improves CC3DFCN significantly in terms of segmentation accuracy.
Ii Related Work
The UNet [24] architecture incorporates both local and global contextual information through the encodingdecoding process. In the past several years, many variants of UNet have been developed and they achieved improved performance on biomedical image segmentation. For example, FusionNet [30], residual deconvolutional network (RDN) [31] and residual symmetric UNet [32]
addressed the 2D electron microscopy image segmentation task by building a UNetbased network with additional shortrange residual connections
[33]. In addition, UNet was extended from 2D to 3D cases for volumetric biomedical images, leading to models like 3D UNet [34], VNet [35], and CC3DFCN [8]. Meanwhile, DeepMedic [36] explored another way to fuse both local and global contextual information by removing the decoder of UNet and employing a dual pathway architecture. However, without the decoder, the spatial sizes of outputs become smaller as compared to UNetbased models, which harms the inference efficiency since more patches need to be processed during inference. DeepMedic has been outperformed by UNetbased models like CC3DFCN [8]. In this work, we unified previous models and employed the 3D UNet architecture with shortrange residual connections as the basic framework.The selfattention mechanism was used in the Transformer [29] on machine translation tasks. The Transformer did not apply any recurrence and convolution based on the insight that the selfattention mechanism makes it easier to learn longrange dependencies between sequences and requires less computation [29]. The study in [37] proposed the spacetime nonlocal block for video classification, where the selfattention mechanism was also employed for capturing longrange dependencies. In this work, we explore the selfattention mechanism for a different functionality. As the selfattention mechanism provides an operation with a global receptive field, it can aggregate global information both effectively and efficiently. We further generalize it to perform downsampling and upsampling.
Iii The Proposed Methods
In this section, we introduce our proposed model. Section IIIA illustrates our model framework derived from UNet [24]. Based on the framework, our model is composed of different blocks. Basic residual blocks are discussed in Section IIIB. We then propose the global aggregation block and its extension for upsampling in Section IIIC. Finally, Section IIID provides details of training and inference strategies.
Iiia UNet Framework
We adopt the 3D UNet [34] as the framework of our proposed model for 3D multimodality isointense infant brain MR image segmentation. An illustration is given in Fig. 2. The input first goes through an encoding input block, which extracts lowlevel features. Two downsampling blocks are used to reduce the spatial sizes and obtain highlevel features. Note that the number of channels is doubled after each downsampling block. A bottom block then aggregates global information and gives the output of the encoder. Correspondingly, the decoder uses two upsampling blocks to recover the spatial sizes for the segmentation output. The number of feature maps is halved after an upsampling operation.
To assist the decoding process, skip connections copy feature maps from the encoder to the decoder. In the proposed model, the copied feature maps are combined with decoding feature maps through summation, instead of concatenation used in CC3DFCN [8] and UNet [24]. The intuitive way to combine features from the encoder and the decoder is concatenation, providing two sources of inputs to the upsampling operation. Using summation instead has two advantages. First, summation does not increase the number of feature maps, thus reducing the number of trainable parameters in the following layer. Second, skip connections with summation can be considered as longrange residual connections, which are known to be capable of facilitating the training of models.
Given the output of the decoder, the output block produces the segmentation probability map. Specifically, for each voxel, the probabilities that it belongs to BG, CSF, GM and GM are provided, respectively. The final segmentation map can be obtained through a single
argmax operation on this probability map. The details of each block are introduced in following sections.IiiB Residual Blocks
Residual connections have been shown to facilitate the training of deep learning models and achieve better performance [33]. Note that skip connections with summation in our UNet framework are equivalent to longrange residual connections. To further improve UNet, the study in [30] proposed to add shortrange residual connections as well. A similar strategy was employed in [32, 28, 31]. However, those work did not apply residual connections for downsampling and upsampling blocks. Downsampling block with residual connections has been explored in ResNet [33]. We explore the idea for upsampling blocks based on our proposed upsampling global aggregation block, which is discussed in detail in Section IIIC.
In our proposed model, four different residual blocks are used to form a fully residual network [28], as shown in Fig. 3. Notably, all of them apply the preactivation pattern [38]. Fig. 3
(a) shows a regular residual block with two consecutive convolutional layers. Batch normalization
[39] and ReLU6 [40] are used before each convolutional layer. This block is used as the input block in our framework. The output block is constructed by this block followed by aconvolution with a stride of 1. Moreover, after the summation of skip connections, we insert one such block. Fig.
3(b) is a downsampling residual block. A convolution with a stride of 2 is used to replace the identity residual connection, in order to adjust the spatial sizes of feature maps accordingly. We employ this block as the downsampling blocks. Fig. 3(c) illustrates our bottom block. Basically, a residual connection is applied on the proposed global aggregation block. The upsampling residual block is provided in Fig. 3(d). Similar to the downsampling block in Fig. 3(b), the identity residual connection is replaced by a deconvolution with a stride of 2 and the other branch is the upsampling global aggregation block. Our model uses this block as the upsampling blocks.IiiC Global Aggregation Block
To achieve global information fusion through a block, each position of the output feature maps should depend on all positions of the input feature maps. Such an operation is opposite to local operations like convolution and deconvolution, where each output location has a local receptive field on the input. In fact, a fullyconnected layer has this global property. However, it is prone to overfitting and does not work well in practice. As introduced in Section II, the selfattention block used in the Transformer computes outputs at one position by attending to every position of the input. Later, the study in [37]
proposed nonlocal neural networks for video classification, which employed a similar block. While both studies applied selfattention blocks with the aim of capturing longterm dependencies in sequences, we point out that global information of image feature maps can be aggregated through selfattention blocks.
Based on this insight, we propose the global aggregation block, which is able to fuse global information from feature maps of any size. We further generalize it to handle downsampling and upsampling, making it a block that can be used anywhere in deep learning models. Let represent the input to the global aggregation block and represent the output. For simplicity, we use to denote a convolution with a stride of 1 and output channels. Note that does not change the spatial size. The first step of the proposed block is to generate the query (), key () and value () matrices [29], given by
(1) 
where unfolds a tensor into a matrix, can be any operation that produces feature maps, and are hyperparameters representing the dimensions of the keys and values. Suppose the size of is . Then the dimensions of and are and , respectively. The dimension of , however, is , where depend on . The left part of Fig. 4 illustrates this step. Here, a tensor is represented by a cube, whose voxels correspond to
dimensional vectors.
Each row of the , and matrices denotes a query vector, a key vector and a value vector, respectively. Note that the query vector has the same dimension as the key vector. Meanwhile, the number of key vectors is the same as that of value vectors, which indicates a onetoone correspondence. In the second step, the attention mechanism is applied on , and [29], defined as
(2) 
where the dimension of the attention weight matrix is and the dimension of the output matrix is . To see how it works, we take one query vector from as an example. In the attention mechanism, the query vector interacts with all key vectors, where the dotproduct between the query vector and one key vector produces a scalar weight for the corresponding value vector. The output of the query vector is a weighted sum of all value vectors, where the weights are normalized through . This process is repeated for all query vectors and generates dimensional vectors. This step is illustrated in the box of Fig. 4. Note that Dropout [41] can be applied on to avoid overfitting. As shown in Fig. 4, the final step of the block computes by
(3) 
where is the reverse operation of and is a hyperparameter representing the dimension of the outputs. As a result, the size of is .
It is worth noting that the spatial size of is determined by that of the Q matrix, i.e., by the function in (IIIC). Therefore, with appropriate functions, our global aggregation block can be used for samesize process, downsampling and upsampling. Our proposed model explores two different functions. For the global aggregation block in Fig. 3(c), is . For the upsampling global aggregation block in Fig. 3(d), is a deconvolution with a stride of 2. The use of this block alleviates the problem that the upsampling through a single deconvolution loses information. By taking global information into consideration, the upsampling block is able to recover more accurate details. In our model, we set .
IiiD Training and Inference Strategies
Our proposed model applies Dropout [41] with a rate of 0.5 in each global aggregation block and the output block before the final convolution. A weight decay [42] with a rate of is also employed. To train the model, we use randomly cropped small patches. In this way, we obtain sufficient training data and the requirement on memory is reduced. No extra data augmentation is needed. The experimental results in Section IVF suggest that patches with a size of leads to the best performance. The batch size is set to 5. The Adam optimizer [43] with a learning rate of 0.001 is employed to perform the gradient descent algorithm.
In the inference process, following [8], we extract patches with the same size as that used in training. For example, to generate patches for inference, we slide a window of size through the original image with a constant overlapping step size. The overlapping step size must be smaller than or equal to the patch size, in order to guarantee that extracted patches cover the whole image. Consequently, prediction for all these patches provides segmentation probability results for every voxel in the original image. For voxels that receive multiple results due to overlapping, we average them to produce the final prediction. The overlapping step size is an important hyperparameter affecting the inference speed and the segmentation accuracy. A smaller overlapping step size results in better accuracy, but increases the inference time as more patches are generated. We explore the tradeoff in Section IVE.
Iv Results and Discussion
We perform experiments to evaluate our model and demonstrate its effectiveness. Section IVA introduces our dataset. Section IVB describes the baseline model and the evaluation methods used in our experiments. We compare our proposed model with the baseline in Section IVC. The ablation study in Section IVD shows that each single part of our model improves the baseline. The tradeoff between the inference speed and accuracy is explored in Section IVE, based on different overlapping step sizes. The impact of patch size is analyzed in Section IVF.
Modality  Direction  #Slices  TR/TE  Flip Angle  Resolution 

T1  Sagittal  144  1900/4.38 ms  1 mm 1 mm 1 mm  
T2  Axial  64  7380/119 ms  1.25 mm 1.25 mm 1.95 mm  
DW  Axial  60  7680/82 ms    2 mm 2 mm 2 mm 
Model  CSF  GM  WM  Average 

Baseline  0.92500.0118  0.90840.0056  0.89260.0119  0.90870.0066 
Our Model  0.95300.0074  0.92450.0049  0.91020.0101  0.92920.0050 
Model  CSF  GM  WM  Average 

Baseline  0.34170.0245  0.65370.0483  0.48170.0454  0.49240.0345 
Our Model  0.25540.0207  0.59500.0428  0.44540.0040  0.43190.0313 
Iva Data Acquisition and Preprocessing
We conduct experiments on a dataset consisting of multimodality brain MR images. To be specific, the dataset contains T1 and T2 MR images from 10 healthy 6monthold infants. The data collection from these infants was authorized by their parents with written forms. The approval of Institutional Review Board (IRB) has been obtained for these experiments. A Siemens 3T headonly MR scanner with a circular polarized head coil was used to scan the infants’ brains to acquire T1, T2 and diffusionweighted (DW) MR images. During the scanning process, the infants were asleep with ear protection. A vacuumfixation device was applied to protect their heads. Table. I provides the settings when scanning the three modalities, respectively. Specifically for DW images, 42 noncollinear diffusion gradients with a diffusion weight of 1000 s/mm were employed. In addition, 7 nondiffusionweighted reference scans were executed.
To align T2 images with T1 images of the same infant, we first performed a rigid alignment with the help of distortioncorrected DW images and then upsampled the T2 images into an isotropic resolution of 1 mm 1 mm 1 mm. If moderate or severe motion artifacts [44] were detected, the images were rescanned. After the alignment, an intensity inhomogeneity correction [45] was applied on T1 and aligned T2 images. Finally, the skull, cerebellum, and brain stem were removed by inhouse tools, the Brain Surface Extractor (BSE) [46] and the Brain Extraction Tool (BET) [47]. The skull stripping [48] results were reviewed by a trained rater to manually edit, by using ITKSNAP [49], to ensure the actual removal of nonbrain tissues.
The segmentation labels for training were initially generated by a publicly available infant brain segmentation software named iBEAT (http://www.nitrc.org/projects/ibeat) [50]. Careful manual editing was further performed by an experienced rater in order to correct possible errors. Specifically, under the guidance of an experienced neuroradiologist, segmentation errors and geometric defects were corrected using ITKSNAP, with the help of surface rendering. Generally, the correction took almost one week for one subject. The segmentation labels are composed of 4 classes, i.e., CSF, GM, WM as well as a background (BG) class. Fig. 1 provides an example of the data.
IvB Experimental Setup
We reimplement the CC3DFCN model [8] as our baseline. CC3DFCN is a 3D fully convolutional network (3DFCN) with convolution and concatenate (CC) skip connections, which is designed for 3D multimodality isointense infant brain image segmentation. It has been shown to outperform traditional machine learning methods, such as FMRIB’s automated segmentation tool (FAST) [19], majority voting (MV) [20]
, random forest (RF)
[21] and random forest with autocontext model (LINKS) [22]. Moreover, studies in [8] has showed the superiority of CC3DFCN to previous deep learning models, like 2D, 3D CNNs [7], DeepMedic [36], and the original 3D UNet [34]. Therefore, it is appropriate to use CC3DFCN as the baseline of our experiments.In our experiments, we employ the Dice ratio (DR) and 3D modified Hausdorff distance (3DMHD) as the evaluation metrics. These two methods evaluate the accuracy only for binary segmentation tasks, so it is required to transform the 4class segmentation task into 4 binary segmentation tasks for evaluation. That is, a 3D binary segmentation map should be constructed for each class, where 1 denotes the voxel in the position belongs to the class and 0 means the opposite. In our experiments, we derive binary segmentation maps directly from 4class segmentation maps. The evaluation is performed on binary segmentation maps for CSF, GM and WM.
Let and represent the predicted binary segmentation map for one class and the corresponding ground truth label, respectively. The DR is given by
(4) 
where denotes the number of 1’s in a segmentation map and means the number of 1’s shared by and . Apparently, DR is a value in and a larger DR indicates a more accurate segmentation.
The modified Hausdorff distance (MHD) [51] is designed to compute the similarity between two objects. Here, an object is a set of points where a point is represented by a vector. Specifically, given two sets of vectors and , MHD is computed by
(5) 
where the distance between two sets is defined as
(6) 
and the distance between a vector and a set is defined as
(7) 
Previous studies [22, 7, 8] applied MHD for evaluation by treating a 3D map as dimensional vectors. However, there are two more different ways to vectorize the 3D map, depending on the direction of forming vectors, i.e., dimensional vectors and dimensional vectors. Each vectorization leads to different evaluation results by MHD. To make it a directionindependent evaluation metric as DR, we define 3DMHD, which computes the averaged MHD based on the three different vectorizations. A smaller 3DMHD indicates a higher segmentation accuracy.
IvC Comparison with the Baseline
We compare our proposed model with the baseline on our dataset. Following [8], the patch size is set to and the overlapping step size for inference is set to
. To remove the bias of different subjects, the leaveonesubjectout crossvalidation is used for evaluating segmentation performance. That is, for 10 subjects in our dataset, we train and evaluate models 10 times correspondingly. Each time one of the 10 subjects is left out for validation and the other 9 subjects are used for training. The mean and standard deviation of segmentation performance of the 10 runs are reported.
provide the experimental results. In terms of both evaluation metrics, our model achieves a significant improvement over the baseline model. Due to the small variances of the results, we focus on one of the 10 runs for visualization and other studies, where the models are trained on the first 9 subjects and evaluated on the
subject. A visualization of the segmentation results in this run is given by Fig. 6. By comparing the areas in red circles, we can see that our model is capable of catching more details than the baseline model. We also visualize the training processes to illustrate the superiority of our model. Fig. 5 shows the training and validation curves in this run of our model and the baseline model, respectively. Clearly, our model converges faster to a lower training loss. In addition, according to the better validation results, our model does not suffer from overfitting.To further show the efficiency of our proposed model, we compare the number of parameters as reported in Table. IV. Our model reduces parameters compared to CC3DFCN and achieves better performance. A comparison of inference time is also provided in Table V. The settings of our device are  GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E52620v4 2.10GHz; and operation system: Ubuntu 16.04.3 LTS.
Since our data has been used as the training data in the iSeg2017 challenge (http://iseg2017.web.unc.edu/), we also compare the results evaluated on the 13 testing subjects in Table VI. According to the leader board (http://iseg2017.web.unc.edu/rules/results/), our model achieves the best results for WM and GM and comparable results for CSF.
Model  Number of Parameters 

Baseline [8]  2,534,276 
Our Model  1,821,124 
Model  Inference Time (min) 

Baseline [8]  3.850.15 
Our Model  3.060.12 
Model  CSF  GM  WM 

Baseline  0.93240.0067  0.91460.0074  0.89740.0123 
Our Model  0.95570.0060  0.92190.0089  0.90440.0153 
Model  CSF  GM  WM  Average 

Baseline  0.9235  0.9085  0.8639  0.8986 
Model1  0.9585  0.9099  0.8625  0.9103 
Model2  0.9568  0.9172  0.8728  0.9156 
Model3  0.9576  0.9198  0.8749  0.9174 
Model4  0.9578  0.9210  0.8769  0.9186 
Model5  0.9554  0.9225  0.8804  0.9194 
Our Model  0.9572  0.9278  0.8867  0.9239 
Model  CSF  GM  WM  Average 

Baseline  0.3422  0.6331  0.4541  0.4765 
Model1  0.2363  0.6277  0.4705  0.4448 
Model2  0.2404  0.6052  0.4480  0.4312 
Model3  0.2392  0.5993  0.4429  0.4271 
Model4  0.2397  0.5926  0.4336  0.4220 
Model5  0.2444  0.5901  0.4288  0.4211 
Our Model  0.2477  0.5692  0.4062  0.4077 
IvD Ablation Study of Different Modules
We perform an ablation study to show the effectiveness of each part in our proposed model. Specifically, we compare the following models in addition to our model and the baseline:
Model1 is a 3D UNet without shortrange residual connections. Downsampling and upsampling are implemented by convolutions and deconvolutions with a stride of 2, respectively. The bottom block is simply a convolutional layer.
Model2 is Model1 with shortrange residual connections, i.e., the blocks in Fig. 3(a) and (b) are applied. The bottom block and upsampling blocks are the same as those in Model1.
Model3 replaces the first upsampling block in Model2 with the block in Fig. 3(d).
Model4 replaces both upsampling blocks in Model2 with the block in Fig. 3(d).
Model5 replaces the bottom block in Model2 with the block in Fig. 3(c).
All models are trained on the first 9 subjects. We report the segmentation performance on the subject in Table. VII and Table. VIII. The results indicate that the UNet framework, residual blocks and global aggregation blocks introduced in Section III all provides an improvement over the baseline, in terms of the segmentation accuracy.
IvE Impact of the Overlapping Step Size
As discussed in Section IIID, a small overlapping step size usually results in better segmentation, due to the ensemble effect. However, with a small overlapping step size, the model has to perform inference for more validation patches and thus decreases the inference speed. We explore the tradeoff in our model by setting the overlapping step sizes to 4, 8, 16, 32, respectively. Again, we train our model on the first 9 subjects and perform evaluation on the subject. The patch size is set to . According to the overlapping step sizes, 11880, 1920, 387, 80 patches need to be processed during inference, as shown in Fig. 8. In addition, Fig. 7 plots the changes of segmentation performance in terms of DR. Obviously, 8 and 16 are good choices that achieve accurate and fast segmentation results.
IvF Impact of the Patch Size
The patch size affects the total number of distinct training samples. Meanwhile, it controls the range of available global information when performing segmentation for a patch. To choose the appropriate patch size for our model, we perform a grid search by training on the first 9 subjects and evaluating on the subject with the overlapping step size of 8. Experiments are conducted with five different patch sizes: , , , , . The results are provided in Fig. 9. obtains the best performance and is selected as the default setting of our model.
V Conclusion
In this work, we investigate 3D multimodality isointense infant brain MR image segmentation. As pointed out, existing models do not have an efficient and effective way to aggregate global information and suffer from information loss during upsampling operations, which limits their performance. To address these problems, we propose a global aggregation block which can be flexibly used for global information fusion and build a novel model based on 3D UNet. Thorough experiments are conducted, which indicate that our model outperforms the previous best model significantly. In addition, ablation study shows that every part of our design results in an improvement and our model effectively takes advantage of all of them.
References
 [1] K. Zilles, E. Armstrong, A. Schleicher, and H.J. Kretschmann, “The human pattern of gyrification in the cerebral cortex,” Anatomy and embryology, vol. 179, no. 2, pp. 173–179, 1988.
 [2] T. Paus, D. Collins, A. Evans, G. Leonard, B. Pike, and A. Zijdenbos, “Maturation of white matter in the human brain: a review of magnetic resonance studies,” Brain research bulletin, vol. 54, no. 3, pp. 255–266, 2001.
 [3] G. Li, L. Wang, F. Shi, A. E. Lyall, W. Lin, J. H. Gilmore, and D. Shen, “Mapping longitudinal development of local cortical gyrification in infants from birth to 2 years of age,” Journal of Neuroscience, vol. 34, no. 12, pp. 4228–4238, 2014.
 [4] H. C. Hazlett, M. D. Poe, G. Gerig, M. Styner, C. Chappell, R. G. Smith, C. Vachet, and J. Piven, “Early brain overgrowth in autism associated with an increase in cortical surface area before age 2 years,” Archives of general psychiatry, vol. 68, no. 5, pp. 467–476, 2011.
 [5] A. E. Lyall, F. Shi, X. Geng, S. Woolson, G. Li, L. Wang, R. M. Hamer, D. Shen, and J. H. Gilmore, “Dynamic development of regional cortical thickness and surface area in early childhood,” Cerebral cortex, vol. 25, no. 8, pp. 2204–2212, 2014.
 [6] W. Gao, J. H. Gilmore, D. Shen, J. K. Smith, H. Zhu, and W. Lin, “The synchronization within and interaction between the default and dorsal attention networks in early infancy,” Cerebral cortex, vol. 23, no. 3, pp. 594–603, 2012.

[7]
W. Zhang, R. Li, H. Deng, L. Wang, W. Lin, S. Ji, and D. Shen, “Deep convolutional neural networks for multimodality isointense infant brain image segmentation,”
NeuroImage, vol. 108, pp. 214–224, 2015.  [8] D. Nie, L. Wang, E. Adeli, C. Lao, W. Lin, and D. Shen, “3d fully convolutional networks for multimodal isointense infant brain image segmentation,” IEEE Transactions on Cybernetics, 2018.
 [9] J. Nie, G. Li, L. Wang, J. H. Gilmore, W. Lin, and D. Shen, “A computational growth model for measuring dynamic cortical development in the first year of life,” Cerebral cortex, vol. 22, no. 10, pp. 2272–2284, 2011.
 [10] N. I. Weisenfeld and S. K. Warfield, “Automatic segmentation of newborn brain mri,” Neuroimage, vol. 47, no. 2, pp. 564–572, 2009.
 [11] H. Xue, L. Srinivasan, S. Jiang, M. Rutherford, A. D. Edwards, D. Rueckert, and J. V. Hajnal, “Automatic segmentation and reconstruction of the cortex from neonatal mri,” Neuroimage, vol. 38, no. 3, pp. 461–477, 2007.
 [12] L. Gui, R. Lisowski, T. Faundez, P. S. Hüppi, F. Lazeyras, and M. Kocher, “Morphologydriven automatic segmentation of mr images of the neonatal brain,” Medical image analysis, vol. 16, no. 8, pp. 1565–1579, 2012.
 [13] M. J. Cardoso, A. Melbourne, G. S. Kendall, M. Modat, N. J. Robertson, N. Marlow, and S. Ourselin, “Adapt: an adaptive preterm segmentation algorithm for neonatal brain mri,” NeuroImage, vol. 65, pp. 97–108, 2013.
 [14] F. Shi, Y. Fan, S. Tang, J. H. Gilmore, W. Lin, and D. Shen, “Neonatal brain image segmentation in longitudinal mri studies,” Neuroimage, vol. 49, no. 1, pp. 391–400, 2010.
 [15] Z. Song, S. P. Awate, D. J. Licht, and J. C. Gee, “Clinical neonatal brain mri segmentation using adaptive nonparametric data models and intensitybased markov priors,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2007, pp. 883–890.
 [16] L. Wang, F. Shi, P.T. Yap, W. Lin, J. H. Gilmore, and D. Shen, “Longitudinally guided level sets for consistent tissue segmentation of neonates,” Human brain mapping, vol. 34, no. 4, pp. 956–972, 2013.

[17]
S. H. Kim, V. S. Fonov, C. Dietrich, C. Vachet, H. C. Hazlett, R. G. Smith,
M. M. Graves, J. Piven, J. H. Gilmore, S. R. Dager et al.
, “Adaptive prior probability and spatial temporal intensity change estimation for segmentation of the oneyearold human brain,”
Journal of neuroscience methods, vol. 212, no. 1, pp. 43–55, 2013.  [18] L. Wang, F. Shi, P.T. Yap, J. H. Gilmore, W. Lin, and D. Shen, “4d multimodality tissue segmentation of serial infant images,” PloS one, vol. 7, no. 9, p. e44596, 2012.

[19]
Y. Zhang, M. Brady, and S. Smith, “Segmentation of brain mr images through a hidden markov random field model and the expectationmaximization algorithm,”
IEEE transactions on medical imaging, vol. 20, no. 1, pp. 45–57, 2001. 
[20]
A. Criminisi, J. Shotton, E. Konukoglu et al.
, “Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semisupervised learning,”
Foundations and Trends® in Computer Graphics and Vision, vol. 7, no. 2–3, pp. 81–227, 2012. 
[21]
A. Criminisi and J. Shotton,
Decision forests for computer vision and medical image analysis
. Springer Science & Business Media, 2013.  [22] L. Wang, Y. Gao, F. Shi, G. Li, J. H. Gilmore, W. Lin, and D. Shen, “Links: Learningbased multisource integration framework for segmentation of infant brain images,” NeuroImage, vol. 108, pp. 160–172, 2015.

[23]
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2015, pp. 3431–3440.  [24] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computerassisted intervention. Springer, 2015, pp. 234–241.
 [25] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
 [26] L.C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
 [27] Z. Wang and S. Ji, “Smoothed dilated convolutions for improved dense prediction,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 2486–2495.
 [28] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multipath refinement networks for highresolution semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 6000–6010.
 [30] T. M. Quan, D. G. Hildebrand, and W.K. Jeong, “Fusionnet: A deep fully residual convolutional neural network for image segmentation in connectomics,” arXiv preprint arXiv:1612.05360, 2016.
 [31] A. Fakhry, T. Zeng, and S. Ji, “Residual deconvolutional networks for brain electron microscopy image segmentation,” IEEE transactions on medical imaging, vol. 36, no. 2, pp. 447–456, 2017.
 [32] K. Lee, J. Zung, P. Li, V. Jain, and H. S. Seung, “Superhuman accuracy on the snemi3d connectomics challenge,” arXiv preprint arXiv:1706.00120, 2017.
 [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [34] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d unet: learning dense volumetric segmentation from sparse annotation,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2016, pp. 424–432.
 [35] F. Milletari, N. Navab, and S.A. Ahmadi, “Vnet: Fully convolutional neural networks for volumetric medical image segmentation,” in 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016, pp. 565–571.
 [36] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multiscale 3d cnn with fully connected crf for accurate brain lesion segmentation,” Medical image analysis, vol. 36, pp. 61–78, 2017.
 [37] X. Wang, R. Girshick, A. Gupta, and K. He, “Nonlocal neural networks,” arXiv preprint arXiv:1711.07971, 2017.
 [38] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision. Springer, 2016, pp. 630–645.
 [39] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.

[40]
A. Krizhevsky and G. Hinton, “Convolutional deep belief networks on cifar10,”
Unpublished manuscript, vol. 40, p. 7, 2010.  [41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [42] A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in Advances in neural information processing systems, 1992, pp. 950–957.
 [43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [44] J. D. Blumenthal, A. Zijdenbos, E. Molloy, and J. N. Giedd, “Motion artifact in magnetic resonance imaging: implications for automated analysis,” Neuroimage, vol. 16, no. 1, pp. 89–92, 2002.
 [45] J. G. Sled, A. P. Zijdenbos, and A. C. Evans, “A nonparametric method for automatic correction of intensity nonuniformity in mri data,” IEEE transactions on medical imaging, vol. 17, no. 1, pp. 87–97, 1998.
 [46] D. W. Shattuck and R. M. Leahy, “Automated graphbased analysis and correction of cortical volume topology,” IEEE transactions on medical imaging, vol. 20, no. 11, pp. 1167–1177, 2001.
 [47] S. M. Smith, “Fast robust automated brain extraction,” Human brain mapping, vol. 17, no. 3, pp. 143–155, 2002.
 [48] F. Shi, L. Wang, Y. Dai, J. H. Gilmore, W. Lin, and D. Shen, “Label: pediatric brain extraction using learningbased metaalgorithm,” Neuroimage, vol. 62, no. 3, pp. 1975–1986, 2012.
 [49] P. A. Yushkevich, J. Piven, H. C. Hazlett, R. G. Smith, S. Ho, J. C. Gee, and G. Gerig, “Userguided 3d active contour segmentation of anatomical structures: significantly improved efficiency and reliability,” Neuroimage, vol. 31, no. 3, pp. 1116–1128, 2006.
 [50] Y. Dai, F. Shi, L. Wang, G. Wu, and D. Shen, “ibeat: a toolbox for infant brain magnetic resonance image processing,” Neuroinformatics, vol. 11, no. 2, pp. 211–225, 2013.
 [51] M.P. Dubuisson and A. K. Jain, “A modified hausdorff distance for object matching,” in Pattern Recognition, 1994. Vol. 1Conference A: Computer Vision & Image Processing., Proceedings of the 12th IAPR International Conference on, vol. 1. IEEE, 1994, pp. 566–568.
Comments
There are no comments yet.