Accurate automatic brain image segmentation in magnetic resonance (MR) images is a prerequisite for the quantitative assessment of the brain in large-scale studies with images acquired at all ages [1, 2, 3, 4, 5, 6, 7, 8, 9]. Especially the field of neonatal brain image segmentation has developed rapidly in the last ten years. A popular approach for brain image segmentation is the use of (population specific) atlases[10, 11, 12, 13, 14, 15, 16, 17]
, but also pattern recognition methods are used[18, 19], sometimes in combination with an atlas-based approach [20, 21, 22]. To obtain anatomically correct segmentations, these methods need explicit spatial and intensity information. For atlas-based methods spatial information is provided in the form of an atlas which is deformed to match the image at hand. For methods based on pattern recognition spatial information is included in the feature set as distances within an atlas space , distances to a brain mask , probabilistic results of a previous segmentation step [18, 19], or by imposing anatomical constraints . Intensity information is, in pattern recognition methods, included as a set of features based on (local) intensity, and atlas-based methods are typically performed by matching intensity information between the atlas and target images.
The explicit definition of such spatial and intensity features could be avoided by using convolutional neural networks (CNNs). CNNs have shown to be successful in several computer vision tasks including the recognition of handwriting24]. In recent years, CNNs also gained popularity in the field of medical image analysis [25, 26, 27, 28, 29, 30, 31, 32, 33, 34]
. In contrast to classical machine learning methods, CNNs do not require a set of hand-crafted features for the classification, but learn sets of convolution kernels that are specifically trained for the classification problem at hand. While classical machine learning methods applied to image segmentation would use e.g. Gaussian or Haar-like kernels to acquire appearance information, CNNs optimise sets of kernels based on the provided training data. In this way, the system can automatically extract information that is relevant for the task. If spatial and intensity information within the image are important to distinguish between classes, this can be learned from the provided information, much like a human observer would recognise objects within a (medical) image.
CNNs have also been used for brain image segmentation. Zhang et al.  presented a method using CNNs for the segmentation of three tissue types: white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF), in MR brain images of 6–8 months old infants, which have low contrast between WM and GM. The method uses 2D patches of a single size from one image plane in T1-weighted, T2-weighted and fractional anisotropy images as input for a CNN. Prior to the classification, all voxels that do not belong to either of the three tissue types, i.e. skull, cerebellum and brain stem are excluded using in-house tools. De Brébisson et al.  presented a method for the segmentation of adult T1-weighted MR brain images in 134 regions as provided by the MICCAI challenge on multi-atlas labelling 111https://masi.vuse.vanderbilt.edu/workshop2012. The method uses multiple parallel networks of 2D patches in orthogonal planes, a 3D patch, and distances to a previous segmentation step.
Recent MICCAI challenges in neonatal  and adult  MR brain image segmentation show that various segmentation methods achieve accurate results, but also that different methods are better at different aspects of brain image segmentation. The best results per tissue type in both the NeoBrainS12 222http://neobrains12.isi.uu.nl/mainResults.php and the MRBrainS13 333http://mrbrains13.isi.uu.nl/results.php challenges were not obtained by a single best performing method. These challenges also show that, despite the overall accurate segmentations achieved by these automatic methods, various inaccuracies are still present.
This paper presents a method for the automatic segmentation of anatomical MR brain images into a number of classes based on a multi-scale CNN. The multi-scale approach allows the method to obtain accurate segmentation details as well as spatial consistency. Unlike previous work that employs CNNs for a brain image segmentation task , the proposed method allows omitting the explicit definition of spatial features. Furthermore, unlike previous work used for brain image segmentation 
, the method uses multiple patch and kernel sizes combined. This approach allows the method to learn multi-scale features that estimate both intensity and spatial characteristics. In contrast to using these multiple patch and kernel sizes, other approaches to multi-scale CNNs, used in different applications, provide multi-scale features by directly using the feature maps after the first convolution layer as additional input for a fully connected layer[37, 38].
Additionally, unlike previous work on brain image segmentation, the method is applied to the segmentation of images of developing neonates at different ages as well as young adults and ageing adults, and to coronal as well as axial images. This allows demonstrating that the method is able to adapt to the segmentation task at hand based on training data.
The paper is organised as follows. Section II provides a general description of the method. Section III describes the five test data sets that were used. Section IV-A describes the preprocessing that was performed. Section IV-B provides the parameter settings of the CNN that were determined experimentally with a subset of the data. Section IV-C lists the evaluation experiments and their results. Section IV-D compares the obtained results with the results in the recent literature and in the NeoBrainS12 and MRBrainS13 challenges. Section IV-E illustrates the influence of the multi-scale approach, and Section IV-F illustrates the influence of the single-plane approach. Finally, Section V discusses the results.
Each voxel in the image is classified to one of the brain tissue classes using a CNN. Information about each voxel is provided in the form of image patches where the voxel of interest is in the centre. To allow the method to use multi-scale information about each voxel, multiple patch sizes are used. The larger scales implicitly provide spatial information, because they are large enough to recognise where in the image the voxel is located, while the smaller scales provide detailed information about the local neighbourhood of the voxel. For each of these patch sizes, different kernel sizes are trained, i.e. larger kernel sizes are used for larger patches. This multi-scale approach allows the network to incorporate local details as well as global spatial consistency.
For each of the patch sizes, a separate network branch is used; only the output layer is shared. This allows the weights and biases in the CNN to be specifically optimised for each patch size and corresponding kernel size. Multiple convolution layers are used; after each convolution, the resulting images are sub-sampled by extracting the highest responses with max-pooling. While the dimensions of the patches decrease in each layer, the number of kernels that are trained increases.
After the convolution layers, separate fully connected layers are used for each input patch size. To perform the final classification, these layers are connected to a single softmax output layer with one node for each class (including a background class).
Because the number of samples per tissue class might vary between the classes, a defined number of samples is extracted per class from each training image. In this way, the number of samples used in the training procedure is the same for each class, which avoids a bias towards larger tissue classes. To provide the network with as many training samples as possible, in every epoch, a new random selection of samples is made and provided to the network in a randomised order. Rectified linear units are used for all nodes because of their speed in training CNNs . Drop-out 
is used on the fully connected layers to decrease the effect of overfitting on the training set. Mini-batch learning and RMSprop are used to train the network and cross-entropy is used as cost function to optimise the weights and biases.
|Cor. 30 wks||Cor. 40 wks||Ax. 40 wks||Ageing adults||Young adults|
|Age||30 weeks PMA||40 weeks PMA||40 weeks PMA||70 years||23 years|
|Acquisition protocol||Coronal T2-weighted||Coronal T2-weighted||Axial T2-weighted||Axial T1-weighted||Sagittal T1-weighted|
|Number of images||10||5||7||20||15|
|Reconstruction matrix||384 384 50||512 512 110||512 512 50||240 240 48||256 256 (261–334)|
|Reconstructed voxel sizes [mm3]||0.34 0.34 2.0||0.35 0.35 1.2||0.35 0.35 2.0||0.96 0.96 3.0||1.0 1.0 1.0|
|1||Cor. 30 wks||5 training, 5 test||0.92 0.02||0.69 0.06||0.91 0.02||0.88 0.05||0.96 0.00||0.87 0.02||0.84 0.01||0.91 0.02|
|2||LOSO 10, NBS12||0.91 0.02||0.74 0.06||0.89 0.02||0.91 0.01||0.96 0.00||0.88 0.02||0.83 0.02||0.91 0.02|
|3||Cor. 40 wks||LOSO 5, NBS12||0.93 0.01||0.56 0.05||0.86 0.02||0.85 0.04||0.92 0.00||0.78 0.05||0.82 0.01||0.86 0.02|
|4||Ax. 40 wks||LOSO 7||0.93 0.01||0.55 0.05||0.91 0.02||0.81 0.03||0.92 0.01||0.84 0.02||0.88 0.02||0.84 0.04|
|5||LOSO 7, NBS12||0.93 0.01||0.56 0.06||0.91 0.02||0.81 0.03||0.93 0.01||0.85 0.02||0.87 0.02||0.83 0.04|
|6||Ageing adults||5 training, 15 test||0.90 0.03||-||0.81 0.03||0.92 0.02||0.88 0.02||0.90 0.02||0.84 0.01||0.76 0.04|
|7||Young adults||5 training, 10 test||0.95 0.01||-||0.85 0.01||0.85 0.04||0.94 0.01||0.92 0.01||0.91 0.01||-|
|8||15 training, 20 test||0.95 0.01||-||0.86 0.02||0.86 0.05||0.93 0.02||0.93 0.01||0.91 0.03||-|
|Mean surface distance [mm]|
|1||Cor. 30 wks||5 training, 5 test||0.42 0.32||0.90 0.70||0.49 0.20||0.23 0.07||0.12 0.01||0.29 0.02||0.13 0.02||0.10 0.02|
|2||LOSO 10, NBS12||0.40 0.11||0.59 0.34||0.49 0.16||0.22 0.08||0.12 0.01||0.25 0.08||0.11 0.01||0.10 0.02|
|3||Cor. 40 wks||LOSO 5, NBS12||0.72 0.32||0.94 0.44||0.96 0.50||0.45 0.11||0.15 0.01||0.87 0.56||0.12 0.01||0.17 0.02|
|4||Ax. 40 wks||LOSO 7||1.04 0.38||0.82 0.30||0.49 0.20||0.67 0.62||0.13 0.02||0.39 0.12||0.11 0.02||0.19 0.04|
|5||LOSO 7, NBS12||1.14 0.30||0.68 0.24||0.46 0.19||0.43 0.14||0.12 0.02||0.35 0.08||0.11 0.02||0.19 0.05|
|6||Ageing adults||5 training, 15 test||2.01 2.10||-||0.67 0.17||0.43 0.67||0.28 0.05||2.41 2.21||0.23 0.02||0.48 0.10|
|7||Young adults||5 training, 10 test||1.17 0.80||-||0.75 0.07||0.52 0.12||0.21 0.05||0.64 0.28||0.28 0.06||-|
|8||15 training, 20 test||0.61 0.25||-||0.72 0.23||0.52 0.25||0.28 0.17||0.46 0.19||0.39 0.23||-|
|Gui et al., 2012 ||3T coronal T1 and T2||40 weeks PMA||10||0.87||0.78||0.88||-||0.94||0.90||0.92||-||-||-||0.84|
|Cardoso et al., 2013 ||1.5T T1||40 weeks PMA||5||0.87||-||0.84||0.94||0.91||0.74||0.76||-||-||-||-|
|Makropoulos et al., 2014 ||3T axial T1 and T2||40 weeks PMA||20||0.93||-||0.84||0.84||-||0.92||-||-||-||-||-|
|Wang et al., 2015 ||3T isotropic T1, T2 and FA||40 weeks PMA||-||-||-||-||-||-||-||-||-||0.92||0.89||0.84|
|Moeskops et al., 2015 ||3T coronal T2||30 weeks PMA||5||-||-||-||-||0.95||-||0.81||0.89||-||-||-|
|3T coronal T2||40 weeks PMA||5||-||-||-||-||0.92||-||0.80||0.85||-||-||-|
|NeoBrainS12||3T axial T1 and T2||40 weeks PMA||5||0.92||0.47||0.92||0.86||0.90||0.83||0.86||0.79||0.92||-||0.80|
|3T coronal T1 and T2||30 weeks PMA||5||0.88||0.69||0.84||0.88||0.91||0.76||0.71||0.83||0.90||-||0.84|
|3T coronal T1 and T2||40 weeks PMA||5||0.92||0.48||0.88||0.84||0.89||0.75||0.77||0.77||0.84||-||0.79|
|MRBrainS13||3T axial T1, T1 IR and T2 FLAIR||70 years||15||-||-||-||-||-||-||-||-||0.89||0.86||0.84|
|Image set||Rank||Method||Dice||Image set||Rank||Method||Dice||Image set||Rank||Method||Dice|
|Cor. 30 wks||1||Proposed||0.827||Ax. 40 wks||1||Proposed||0.805||Cor. 40 wks||1||Proposed||0.819*|
The method was applied to the segmentation of 3 different sets of volumetric T2-weighted MR brain images of preterm infants and 2 sets of volumetric T1-weighted MR brain images of adults, hence, 5 different sets of images in total. Because of the inverted contrast between WM and GM in neonatal images as compared with adult images, T2-weighted images of neonates provide a similar contrast between these tissue types as T1-weighted images of adults.
Basic information about the images is listed in Table I.
Iii-a Neonatal images
In total, 22 images of preterm infants were acquired on a Philips Achieva 3T scanner in accordance with standard clinical practice in the neonatal intensive care unit of the University Medical Center Utrecht (UMCU), The Netherlands. Permission from the medical ethical review committee of the UMCU and informed parental consent were obtained. 10 coronal images were acquired at an age of 30.9 0.6 (mean standard deviation) weeks postmenstrual age (PMA). For 5 of these patients a coronal follow-up image was acquired at 41.3 1.3 weeks PMA. Furthermore, 7 axial images (of different patients) were acquired at an age of 41.3 0.5 weeks PMA. No brain pathology was visible on these images.
The neonatal images were manually segmented by experts into 8 classes: cerebellum (CB), myelinated white matter (mWM), basal ganglia and thalami (BGT), ventricular cerebrospinal fluid (vCSF), unmyelinated white matter (uWM), brain stem (BS), cortical grey matter (cGM), and extracerebral cerebrospinal fluid (eCSF).
Detailed information about the acquisition and the manual segmentation protocol can be found in the paper by Išgum et al.  and on the NeoBrainS12 website 444http://neobrains12.isi.uu.nl/reference.php. Išgum et al. also evaluated the inter-observer variability for the manual segmentations of the data in the NeoBrainS12 challenge.
Iii-B Ageing adult images
20 axial images of adults were acquired on a Philips Achieva 3T scanner at an age of 70.5 4.0 years, as provided by the MRBrainS13 challenge. Permission from the medical ethical review committee of the UMCU was obtained and all participants signed an informed consent form. The patients had a varying degree of atrophy and white matter lesions.
The ageing adult images were manually segmented by experts into the same classes as the neonatal images. However, because all WM is myelinated in adults, only one WM class was made, resulting in 7 different classes. Possible white matter lesions were included in the WM segmentation.
Iii-C Young adult images
15 images from the OASIS project  were acquired on a Siemens Vision 1.5T scanner at an age of 23.0 4.1 years, as provided by the MICCAI challenge on multi-atlas labelling . The images were acquired in the sagittal plane with a voxel size of 1.0 1.0 1.25 mm3 and were subsequently resized to an isotropic resolution of 1.0 1.0 1.0 mm3 666https://masi.vuse.vanderbilt.edu/workshop2012.
The images were manually segmented, in the coronal plane, into 134 classes. To match the tissue definitions used in this paper, these classes were merged into the same tissue classes as used for the other images. Because no eCSF was segmented in these images, this resulted in 6 tissue classes.
Detailed information about the acquisition can be found in the paper by Marcus et al.  and on the OASIS website 777http://www.oasis-brains.org/. Detailed information about the manual segmentation protocol can be found on the NeuroMorphometrics website 888http://neuromorphometrics.org:8080/Seg/ 999http://braincolor.mindboggle.info/protocols/.
The automatic segmentations were evaluated in 3D using the Dice coefficient and the mean surface distance between the manual and automatic segmentations.
Iv Experiments and results
The parameter settings of the CNN were defined in preliminary leave-one-subject-out experiments with 5 of the images acquired at 30 weeks PMA. This set of images was selected such that it included none of the patients with a follow-up image acquired at 40 weeks PMA. This allowed independent evaluation on the remaining 5 images acquired at 30 weeks PMA as well as on the images of the other 4 test sets by training with representative training data.
Prior to the segmentation, the neonatal images were bias corrected using the method of Likar et al. . The adult images were bias corrected as provided in the MRBrainS13 challenge and the multi-atlas labelling challenge. To limit the number of voxels considered in the classification, brain masks were generated with BET . In each of the experiments, the samples from the training images were only selected from within the brain mask volumes. For each test image, only voxels within the brain mask volume were considered in the classification. The intensities were scaled such that for all images, the range of intensities within the brain mask was between 0 and 1023.
Iv-B CNN parameters
For all voxels within the brain mask, three in-plane patches with sizes of 25 25, 51 51 and 75 75 voxels are extracted, where the voxel of interest is in the centre. Because of the large slice thickness relative to the in-plane voxel size in 4 of the 5 image sets, orthogonal or 3D patches are not used, i.e. only information from the imaging plane is used for the segmentation of volumetric images.
A CNN with multiple convolution layers is used; a schematic of the network is shown in Figure 1. In the first layers 24 kernels are trained for each patch size. For the patches of 25 25 voxels, kernels of 5 5 voxels are used, for the patches of 51 51 voxels, kernels of 7 7 voxels are used, and for the patches of 75 75 voxels, kernels of 9 9 voxels are used. Max-pooling with kernels of 2 2 voxels is performed on the convolved images. The output images are input for the second layer, where 32 kernels of 3 3, 5 5 and 7 7 voxels are used. Again, max-pooling with kernels of 2 2 voxels is performed on the convolved images. In the third layer, 48 kernels of 3 3, 3 3 and 5 5 voxels are used. After this layer, max-pooling is only performed for the two largest patches, because for the smallest patch only an image of 3
3 voxels is left after three convolution layers. The contiguous even-sized max-pooling kernels cannot exactly cover the odd-sized patches. Therefore, before all max-pooling operations, the values are mirrored along the borders. The output of the third layer is input to fully connected layers with 256 nodes for each patch size. These nodes are subsequently connected to the softmax output layer with 9 nodes (8 tissue classes and background) for the neonatal images, 8 nodes (7 tissue classes and background) for the ageing adult images, and 7 nodes (6 tissue classes and background) for the young adult images. An example of the trained kernels in the first layer and an example of the convolution responses for images acquired at 30 weeks PMA is shown in Figure2.
The tissue classes consist of a very different number of voxels in these images. For example, in the images acquired at 30 weeks PMA, mWM consists of about 200 voxels, while uWM consists of 400 000 voxels. Therefore, to better balance the number of training samples of each class a fixed number of samples are extracted per class from each training image. If a class contains less than this number of samples, all samples are used. Multiple epochs are used in the training process, where a new set of random samples is extracted in each epoch. The performance on the test images acquired at 30 weeks PMA for each epoch using 25 000 and 50 000 training samples is shown in Figure 3. Based on this evaluation, in all experiments, 10 epochs and 50 000 training samples per class from each image are used.
Iv-C Evaluation experiments
The performance for the images acquired at 30 weeks PMA was evaluated on the independent set of 5 images not included in the parameter estimation, using the other 5 images as training data. For the other image sets, the method was retrained using representative training data. Because of the limited number of available training images, the performance on the coronal and axial images acquired at 40 weeks PMA was evaluated in leave-one-subject-out cross-validation for each set. The average Dice coefficients and mean surface distances for each tissue class are shown per set in Table II: Experiments 1, 3, and 4. The segmentation results in one slice of one image of each of the test sets are shown in Figure 4.
A subset of the neonatal images used in this paper is used as test data in the NeoBrainS12 challenge. This includes 5 of the coronal images acquired at 30 weeks PMA, all 5 images acquired at 40 weeks PMA, and 5 of the axial images acquired at 40 weeks PMA. Therefore, to allow an indication of the performance in comparison with the results in the challenge, the average results for the axial images acquired at 40 weeks are presented separately for the 5 test images in the challenge (Table II: Experiment 5). For the coronal images acquired at 30 weeks PMA an additional experiment is done in the same fashion as for the other images: leave-one-subject-out cross-validation with all ten images where the average over the test images in the challenge is listed in Table II: Experiment 2.
Given that a larger set of manually annotated images of ageing adults was available, their performance was evaluated using 5 images as training data and 15 images to test the performance (Table II: Experiment 6, and Figure 4). This furthermore allowed (indirect) comparison with the results of the MRBrainS13 challenge, which is performed using the same training and test images, except that only 3 combined tissue classes, namely WM, GM, and CSF are evaluated in the challenge.
Similar to the images of ageing adults, the images of young adults were evaluated using 5 images as training data and 10 images to test the performance (Table II: Experiment 7, and Figure 4). Additionally, the performance on the independent test set of 20 images  is evaluated using all 15 images as training data (Table II: Experiment 8).
Iv-D Comparison with previous methods
For comparison purposes, Table III lists segmentation results in terms of average Dice coefficients reported in the recent literature on neonatal MR brain image segmentation. Note that these methods used different scans that were segmented possibly using different tissue definitions. Hence, these results can only be used as an indication. Furthermore, Table III lists the best results per tissue class at the time of writing in the NeoBrainS12 and MRBrainS13 challenges. Note that in these challenges, the best results for different tissue types are not necessarily obtained by a single method.
To directly compare the method with previous methods, it was also evaluated in the NeoBrainS12 challenge using only the training images provided by the challenge. Table IV compares the results of the proposed method with other methods that have segmented the same tissue classes in the same images. Comparison has been performed by averaging the Dice coefficients over all tissue classes in all test images. Detailed comparison per tissue class is available on the NeoBrainS12 website 101010http://neobrains12.isi.uu.nl/mainResults.php.
Iv-E Multi-scale approach
To show the influence of combining different patch sizes in the network, the method was additionally trained using each of the three patch sizes separately. A segmentation result for the lateral sulcus and the hippocampus in an image acquired at 30 weeks PMA using only the patches of 25 25, 51 51 and 75 75 voxels, as well as those three patches combined, is shown in Figure 5. The results in terms of Dice coefficients and mean surface distance over all 5 test images acquired at 30 weeks PMA are shown in Table V: Experiments 1, 3, 5 and 7.
In addition, the proposed multi-scale approach was compared with the multi-scale approach of Sermanet and LeCun . To allow fair comparison with the other experiments, we used the largest patch size, i.e. 75 75 voxels, as input. To acquire multi-scale information in this approach, the output of the first, second and third layers provide combined input for the fully connected layer (Table V: Experiment 6).
Iv-F Single-plane approach
To motivate the choice of providing the network with patches from the acquisition planes only, the method was additionally evaluated using patches from consecutive slices, patches from the orthogonal planes, and 3D patches.
First, the method was trained for the images acquired at 30 weeks PMA using three consecutive slices for each of the three patch sizes (25 25, 51 51 and 75 75). Instead of a network with three branches, i.e. one for each of the three patch sizes (Figure 1), now a network with nine branches, i.e. three for each of the three patch sizes, was used (Table V: Experiment 8). Because of the large anatomical variation between slices due to the large slice thickness in 4 of the 5 image sets, separate branches allow the network to extract the relevant information by optimising dedicated kernels for each of the nine provided input patches.
Second, the method was trained for the images acquired at 30 weeks PMA using patches from the orthogonal planes (coronal, sagittal and axial) for each of the three patch sizes. Because these images are highly anisotropic, they were first interpolated to isotropic resolution using an interpolation factor of 6 in the out-of-plane direction. This resulted in voxel sizes of 0.340.34 0.33. Again, these orthogonal patches of three sizes were input for a network with nine branches (Table V: Experiment 9).
Third, two networks with interpolated 3D input patches and 3D convolutions were evaluated for the images acquired at 30 weeks PMA, one with input patches of 25 25 25 voxels, and one with input patches of 51 51 51 voxels (Table V: Experiments 2 and 4).
Fourth, because the images of the young adults were already acquired with a (nearly) isotropic resolution, the performance of the network with three orthogonal patches for each of the three patch sizes was evaluated for this set as well. This network was evaluated for the segmentation of the 6 combined tissue classes (Table V: Experiment 11), as well as for the segmentation of all 134 classes. The average Dice coefficient for the segmentation of 134 classes in the 20 test images was 0.7353 overall, 0.7170 for the cortical regions, and 0.7850 for the non-cortical regions. This performance would result in the 9th overall ranking out of 26 participants on the MICCAI challenge on multi-atlas labelling .
|1||Cor. 30 wks||2525 patch||0.69 0.04||0.33 0.13||0.70 0.05||0.76 0.06||0.91 0.02||0.58 0.03||0.78 0.02||0.86 0.02|
|2||252525 patch||0.78 0.03||0.60 0.03||0.77 0.04||0.78 0.02||0.92 0.01||0.79 0.02||0.75 0.02||0.85 0.02|
|3||5151 patch||0.90 0.02||0.57 0.08||0.87 0.02||0.84 0.04||0.94 0.01||0.85 0.01||0.78 0.02||0.88 0.02|
|4||515151 patch||0.82 0.02||0.48 0.14||0.77 0.06||0.70 0.07||0.90 0.01||0.77 0.02||0.67 0.03||0.80 0.01|
|5||7575 patch||0.91 0.01||0.62 0.08||0.88 0.02||0.85 0.06||0.94 0.01||0.85 0.01||0.78 0.02||0.88 0.02|
|6||7575, multi-scale||0.89 0.02||0.61 0.07||0.89 0.01||0.83 0.07||0.94 0.01||0.86 0.02||0.74 0.03||0.86 0.01|
|7||3 patch sizes||0.92 0.02||0.69 0.06||0.91 0.02||0.88 0.05||0.96 0.00||0.87 0.02||0.84 0.01||0.91 0.02|
|8||3 patch sizes, slices||0.92 0.01||0.68 0.08||0.90 0.01||0.89 0.05||0.96 0.00||0.87 0.02||0.84 0.01||0.91 0.02|
|9||3 patch sizes, ortho||0.93 0.01||0.68 0.06||0.90 0.01||0.89 0.03||0.96 0.00||0.88 0.02||0.83 0.01||0.91 0.01|
|10||Young adults||3 patch sizes||0.95 0.01||-||0.85 0.01||0.85 0.04||0.94 0.01||0.92 0.01||0.91 0.01||-|
|11||3 patch sizes, ortho||0.96 0.01||-||0.88 0.01||0.84 0.04||0.94 0.01||0.93 0.02||0.91 0.01||-|
|Mean surface distance [mm]|
|1||Cor. 30 wks||2525 patch||8.12 1.25||2.83 0.94||3.82 0.99||1.53 0.65||0.40 0.07||2.09 0.74||0.21 0.03||0.18 0.04|
|2||252525 patch||4.76 0.95||1.40 0.62||2.12 0.55||0.85 0.58||0.31 0.04||0.91 0.09||0.21 0.03||0.19 0.04|
|3||5151 patch||0.47 0.23||1.41 0.64||0.79 0.18||0.33 0.09||0.19 0.01||0.53 0.21||0.19 0.04||0.14 0.03|
|4||515151 patch||1.82 0.71||1.97 0.70||2.66 1.30||1.58 0.65||0.36 0.02||0.88 0.13||0.25 0.04||0.26 0.05|
|5||7575 patch||0.32 0.16||1.33 0.53||0.63 0.16||0.29 0.11||0.16 0.01||0.36 0.09||0.15 0.02||0.15 0.02|
|6||7575, multi-scale||0.46 0.18||1.19 0.51||0.56 0.13||0.33 0.11||0.19 0.02||0.29 0.04||0.19 0.03||0.18 0.03|
|7||3 patch sizes||0.42 0.32||0.90 0.70||0.49 0.20||0.23 0.07||0.12 0.01||0.29 0.02||0.13 0.02||0.10 0.02|
|8||3 patch sizes, slices||0.33 0.15||0.72 0.37||0.55 0.15||0.25 0.08||0.12 0.01||0.35 0.04||0.11 0.01||0.10 0.02|
|9||3 patch sizes, ortho||0.19 0.05||0.65 0.56||0.70 0.38||0.21 0.05||0.12 0.01||0.25 0.04||0.12 0.02||0.10 0.02|
|10||Young adults||3 patch sizes||1.17 0.80||-||0.75 0.07||0.52 0.12||0.21 0.05||0.64 0.28||0.28 0.06||-|
|11||3 patch sizes, ortho||0.65 0.19||-||0.64 0.20||0.77 0.67||0.20 0.05||0.52 0.35||0.27 0.06||-|
We have presented a method for the automatic segmentation of MR brain images, which has been evaluated on manually segmented preterm neonatal and adult images. Accurate segmentation results were obtained in images acquired at different ages (preterm neonatal vs. ageing adult) and with different acquisition protocols (coronal vs. axial, T2- vs. T1-weighted). The method uses CNNs to automatically extract the relevant information from the training data to learn convolution kernels, which allows adaptation to the problem at hand. Post-processing of the segmentation results, which is typically performed in brain image segmentation tasks, was here not necessary.
The method achieved accurate segmentations in terms of Dice coefficients for all tissue classes except mWM in the neonatal images. mWM consists, especially in the images acquired at 30 weeks PMA, of very few voxels, hence each mislabelled voxel strongly influences the Dice coefficient. Furthermore, this tissue is poorly visible in T2-weighted images and slightly better visible in T1-weighted images, which makes the manual annotation very difficult and prone to inter-observer variability . This consequently has influence on the performance of the neighbouring tissue classes, which can be seen from e.g. the performance on BS for the coronal images acquired at 40 weeks PMA (Table II: Experiment 3). Even though the performance in terms of Dice coefficients is lowest for mWM, the location of the tissue class is quite well recognised by the method, even based on T2-weighted images only. This can be seen from the mean surface distances (Table II) and from Figure 4: e.g. for the coronal images acquired at 40 weeks PMA, where mWM is recognised but does not follow the exact shape of the manual segmentation.
The method uses patches of three different sizes. Because of the highly anisotropic voxel sizes in 4 of the 5 evaluated image sets, the use of orthogonal or 3D patches would result in anisotropic patches, non-square/cubic patches, or strongly resampled data. Therefore, only in-plane information was used, i.e. the volumetric images were segmented using information from the imaging plane only. This corresponds to the way a human observer performs annotation of these images: by selecting voxels that belong to each of the tissue types in a single image plane based on trained anatomical knowledge. This approach is further motivated by additional experiments using patches from consecutive slices and using (resampled) orthogonal patches (Table V: Experiments 8, 9 and 11), which did not provide information that allowed more accurate classification. The performance in terms of average Dice coefficient over all tissue classes remained 0.87 for the anisotropic images acquired at 30 weeks PMA, and improved slightly, from 0.90 to 0.91, for the (nearly) isotropic images of young adults. The limited increase in performance for the images of young adults was possibly influenced by the manual segmentations, which have been performed in the coronal plane and are therefore less accurate in the other planes. Furthermore, experiments with interpolated 3D patches of single sizes suggested that 3D convolutions also do not necessarily result in increased performance (Table V: Experiments 2 and 4), possibly influenced by the large increase in number of weights and biases in the optimisation. The performance of a multi-scale 3D CNN was not evaluated because training a network with three large 3D input patches would become computationally infeasible.
The results using each of the three patches separately show that each patch focusses on a different aspect of the segmentation problem. The smallest patch allows detailed analysis of local texture but misses spatial consistency. The largest patch results in a smooth segmentation but misses small details. This can especially be seen for the segmentation of cGM; the average Dice coefficients for each of the patch sizes separately is 0.78, while the combined patches result in an average Dice coefficient of 0.84 (Table V: Experiments 1, 3, 5 and 7). This can also be observed in Figure 5: the internal CSF in the lateral sulcus is recognised using only the smallest patch size, but not with the two larger sizes. On the other hand, spatially inconsistent results are obtained for the segmentation of the hippocampus using only the smallest patch size, whereas the largest patch size shows better consistency in that respect. In both cases, the result with the three patches combined shows the most accurate segmentation. The kernels and responses in the first layer (Figure 2) show that different kernels enhance different parts of the image and the larger kernels operate on a larger scale. In addition, the proposed multi-scale approach is compared with the multi-scale approach of Sermanet and LeCun  and shows better performance for every tissue class.
The method is based on T2-weighted images for the neonatal images and on T1-weighted images in adults. Furthermore, the segmentation of both coronal and axial images was evaluated. This shows that the method is able to adapt to different acquisition protocols based on representative training data. Furthermore, being able to segment the images based on a single T1- or T2-weighted image, instead of relying on multiple acquisitions, allows omitting registration between images and thus precludes possible registration errors.
The segmentation method was applied to MR images of the developing, the young adult, and the ageing brain. No images acquired between these three age ranges were evaluated, but the method can likely be straightforwardly applied to images acquired at other ages. The method might however be limited in the application to images with abnormalities that are not included in the training set.
Additional evaluation in the segmentation of 134 classes as provided in the MICCAI challenge on multi-atlas labelling demonstrated that the approach can straight-forwardly be applied to a different segmentation task, without any parameter tuning. This resulted in an overall average Dice coefficient of 0.7353 as compared with 0.725 reported by de Brébisson et al. .
A brain mask  is used to limit the number of samples considered in the classification. Because the results of this brain mask were not always consistent for the images of ageing adults, a very conservative setting for the fractional intensity parameter was chosen for these images such that almost the whole head was included in the mask. It might be possible to completely omit the use of a brain mask, but this would also increase the computational load.
CNNs are known to need a large number of training samples. Because we applied CNNs in a voxel classification task, many training samples were available. They were, however, extracted from a limited number of training images. More training images could therefore increase the performance. In addition, including diverse training data from different data sets in a single training step could generalise the problem and would in this way not require retraining the network. This generalisability over a more diverse data set might, however, be achieved at the cost of a decreased performance on each of the data sets that are included.
Many adjustments to the architecture of the network are possible, including more or different patch and kernel sizes, a different number of convolution layers, or a different number of nodes or layers in the fully connected network. Several settings were evaluated in preliminary experiments, but future work could focus on additional optimisation in this respect. Note that the network architecture provides a search space for the method and all weights and biases are optimised by the system. Small variations in the architecture are therefore not likely to have a large effect on the performance.
The presented CNN method for the automatic segmentation of MR brain images shows accurate segmentation results in images acquired at different ages and with different acquisition protocols.
The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used in this research.
The authors thank Bennett Landman for providing the data of the MICCAI challenge on multi-atlas labelling.
-  D. H. Salat, R. L. Buckner, A. Z. Snyder, D. N. Greve, R. S. Desikan, E. Busa, J. C. Morris, A. M. Dale, and B. Fischl, “Thinning of the cerebral cortex in aging,” Cereb Cortex, vol. 14, no. 7, pp. 721–730, 2004.
-  J. Dubois, M. Benders, C. Borradori-Tolsa, A. Cachia, F. Lazeyras, R. Ha-Vinh Leuchter, S. V. Sizonenko, S. K. Warfield, J. F. Mangin, and P. S. Hüppi, “Primary cortical folding in the human newborn: an early marker of later functional development.,” Brain, vol. 131, pp. 2028–2041, Aug 2008.
-  C. E. Rodriguez-Carranza, P. Mukherjee, D. Vigneron, J. Barkovich, and C. Studholme, “A framework for in vivo quantification of regional brain folding in premature neonates.,” NeuroImage, vol. 41, pp. 462–478, Jun 2008.
-  M. Thambisetty, J. Wan, A. Carass, Y. An, J. L. Prince, and S. M. Resnick, “Longitudinal changes in cortical thickness associated with normal aging,” NeuroImage, vol. 52, no. 4, pp. 1215–1223, 2010.
-  L. J. Hogstrom, L. T. Westlye, K. B. Walhovd, and A. M. Fjell, “The structure of the cerebral cortex across adult life: age-related patterns of surface area, thickness, and gyrification,” Cereb Cortex, vol. 23, no. 11, pp. 2521–2530, 2013.
-  G. Li, L. Wang, F. Shi, A. E. Lyall, W. Lin, J. H. Gilmore, and D. Shen, “Mapping longitudinal development of local cortical gyrification in infants from birth to 2 years of age,” J Neurosci, vol. 34, no. 12, pp. 4228–4238, 2014.
-  A. E. Lyall, F. Shi, X. Geng, S. Woolson, G. Li, L. Wang, R. M. Hamer, D. Shen, and J. H. Gilmore, “Dynamic development of regional cortical thickness and surface area in early childhood,” Cereb Cortex, vol. 25, no. 8, pp. 2204–2212, 2014.
-  R. Wright, V. Kyriakopoulou, C. Ledig, M. Rutherford, J. Hajnal, D. Rueckert, and P. Aljabar, “Automatic quantification of normal cortical folding patterns from fetal brain MRI,” NeuroImage, vol. 91, pp. 21–32, 2014.
-  P. Moeskops, M. J. Benders, K. J. Kersbergen, F. Groenendaal, L. S. de Vries, M. A. Viergever, and I. Išgum, “Development of cortical morphology evaluated with longitudinal MR brain images of preterm infants,” PLOS ONE, vol. 10, no. 7, p. e0131552, 2015.
-  B. Fischl, D. H. Salat, E. Busa, M. Albert, M. Dieterich, C. Haselgrove, A. van der Kouwe, R. Killiany, D. Kennedy, S. Klaveness, A. Montillo, N. Makris, B. Rosen, and A. M. Dale, “Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain.,” Neuron, vol. 33, no. 3, pp. 341–355, 2002.
-  M. Prastawa, J. H. Gilmore, W. Lin, and G. Gerig, “Automatic segmentation of MR images of the developing newborn brain.,” Med Image Anal, vol. 9, no. 5, pp. 457–466, 2005.
-  H. Xue, L. Srinivasan, S. Jiang, M. Rutherford, A. D. Edwards, D. Rueckert, and J. V. Hajnal, “Automatic segmentation and reconstruction of the cortex from neonatal MRI.,” NeuroImage, vol. 38, no. 3, pp. 461–477, 2007.
-  N. I. Weisenfeld and S. K. Warfield, “Automatic segmentation of newborn brain MRI.,” NeuroImage, vol. 47, no. 2, pp. 564–572, 2009.
-  P. A. Habas, K. Kim, F. Rousseau, O. A. Glenn, A. J. Barkovich, and C. Studholme, “Atlas-based segmentation of developing tissues in the human brain with quantitative validation in young fetuses.,” Hum Brain Mapp, vol. 31, no. 9, pp. 1348–1358, 2010.
-  F. Shi, Y. Fan, S. Tang, J. H. Gilmore, W. Lin, and D. Shen, “Neonatal brain image segmentation in longitudinal MRI studies.,” NeuroImage, vol. 49, no. 1, pp. 391–400, 2010.
-  M. J. Cardoso, A. Melbourne, G. S. Kendall, M. Modat, N. J. Robertson, N. Marlow, and S. Ourselin, “AdaPT: An adaptive preterm segmentation algorithm for neonatal brain MRI.,” NeuroImage, vol. 65, pp. 97–108, 2013.
-  A. Makropoulos, I. Gousias, C. Ledig, P. Aljabar, A. Serag, J. Hajnal, A. D. Edwards, S. Counsell, and D. Rueckert, “Automatic whole brain MRI segmentation of the developing neonatal brain.,” IEEE Trans Med Imaging, vol. 33, no. 9, pp. 1818–1831, 2014.
-  L. Wang, Y. Gao, F. Shi, G. Li, J. H. Gilmore, W. Lin, and D. Shen, “LINKS: Learning-based multi-source integration framework for segmentation of infant brain images,” NeuroImage, vol. 108, pp. 160–172, 2015.
-  P. Moeskops, M. J. Benders, S. M. Chiţǎ, K. J. Kersbergen, F. Groenendaal, L. S. de Vries, M. A. Viergever, and I. Išgum, “Automatic segmentation of MR brain images of preterm infants using supervised classification,” NeuroImage, vol. 118, pp. 628–641, 2015.
-  H. A. Vrooman, C. A. Cocosco, F. van der Lijn, R. Stokking, M. A. Ikram, M. W. Vernooij, M. M. B. Breteler, and W. J. Niessen, “Multi-spectral brain tissue segmentation using automatically trained k-nearest-neighbor classification.,” NeuroImage, vol. 37, no. 1, pp. 71–81, 2007.
-  V. Srhoj-Egekher, M. J. Benders, M. A. Viergever, and I. Išgum, “Automatic neonatal brain tissue segmentation with MRI,” in SPIE Medical Imaging, pp. 86693X–86693X, International Society for Optics and Photonics, 2013.
-  P. Anbeek, I. Išgum, B. J. van Kooij, C. P. Mol, K. J. Kersbergen, F. Groenendaal, M. A. Viergever, L. S. de Vries, and M. J. Benders, “Automatic segmentation of eight tissue classes in neonatal brain MRI,” PLOS ONE, vol. 8, no. 12, p. e81895, 2013.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
A. Prasoon, K. Petersen, C. Igel, F. Lauze, E. Dam, and M. Nielsen, “Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network,” inMedical Image Computing and Computer-Assisted Intervention–MICCAI 2013, pp. 246–253, Springer, 2013.
-  H. R. Roth, L. Lu, A. Seff, K. M. Cherry, J. Hoffman, S. Wang, J. Liu, E. Turkbey, and R. M. Summers, “A new 2.5 D representation for lymph node detection using random sets of deep convolutional neural network observations,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2014, pp. 520–527, Springer, 2014.
-  M. Veta, P. J. van Diest, S. M. Willems, H. Wang, A. Madabhushi, A. Cruz-Roa, F. Gonzalez, A. B. Larsen, J. S. Vestergaard, A. B. Dahl, et al., “Assessment of algorithms for mitosis detection in breast cancer histopathology images,” Med Image Anal, vol. 20, no. 1, pp. 237–248, 2015.
-  H. R. Roth, A. Farag, L. Lu, E. B. Turkbey, and R. M. Summers, “Deep convolutional networks for pancreas segmentation in CT imaging,” in SPIE Medical Imaging, pp. 94131G–94131G, International Society for Optics and Photonics, 2015.
-  H. R. Roth, L. Lu, A. Farag, H.-C. Shin, J. Liu, E. B. Turkbey, and R. M. Summers, “DeepOrgan: Multi-level deep convolutional networks for automated pancreas segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, pp. 556–564, Springer, 2015.
-  Y. Bar, I. Diamant, L. Wolf, and H. Greenspan, “Deep learning with non-medical training used for chest pathology identification,” in SPIE Medical Imaging, pp. 94140V–94140V, International Society for Optics and Photonics, 2015.
-  A. de Brébisson and G. Montana, “Deep neural networks for anatomical brain segmentation,” in CVPR Bioimage Computing Workshop, 2015.
-  W. Zhang, R. Li, H. Deng, L. Wang, W. Lin, S. Ji, and D. Shen, “Deep convolutional neural networks for multi-modality isointense infant brain image segmentation,” NeuroImage, vol. 108, pp. 214–224, 2015.
-  F. Ciompi, B. de Hoop, S. J. van Riel, K. Chung, E. T. Scholten, M. Oudkerk, P. A. de Jong, M. Prokop, and B. van Ginneken, “Automatic classification of pulmonary peri-fissural nodules in computed tomography using an ensemble of 2D views and a convolutional neural network out-of-the-box,” Med Image Anal, vol. 26, no. 1, pp. 195–202, 2015.
-  J. M. Wolterink, T. Leiner, M. A. Viergever, and I. Išgum, “Automatic coronary calcium scoring in cardiac CT angiography using convolutional neural networks,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, pp. 589–596, Springer, 2015.
-  I. Išgum, M. J. Benders, B. Avants, M. J. Cardoso, S. J. Counsell, E. F. Gomez, L. Gui, P. Hüppi, K. J. Kersbergen, A. Makropoulos, et al., “Evaluation of automatic neonatal brain segmentation algorithms: the NeoBrainS12 challenge,” Med Image Anal, vol. 20, no. 1, pp. 135–151, 2015.
-  A. M. Mendrik, K. L. Vincken, H. J. Kuijf, M. Breeuwer, W. H. Bouvy, J. de Bresser, A. Alansary, M. de Bruijne, A. Carass, A. El-Baz, A. Jog, R. Katyal, A. R. Khan, F. van der Lijn, Q. Mahmood, R. Mukherjee, A. van Opbroek, S. Paneri, S. Pereira, M. Persson, M. Rajchl, D. Sarikaya, Örjan Smedby, C. A. Silva, H. A. Vrooman, S. Vyas, C. Wang, L. Zhao, G. J. Biessels, , and M. A. Viergever, “MRBrainS challenge: Online evaluation framework for brain image segmentation in 3T MRI scans,” Computational Intelligence and Neuroscience, 2015.
-  P. Sermanet and Y. LeCun, “Traffic sign recognition with multi-scale convolutional networks,” in Neural Networks (IJCNN), The 2011 International Joint Conference on, pp. 2809–2813, IEEE, 2011.
-  J. Li, C. Niu, and M. Fan, “Multi-scale convolutional neural networks for natural scene license plate detection,” in Advances in Neural Networks–ISNN 2012, pp. 110–119, Springer, 2012.
V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” inProceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814, 2010.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” in COURSERA: Neural Networks for Machine Learning, vol. 4, 2012.
-  L. Gui, R. Lisowski, T. Faundez, P. S. Hüppi, F. Lazeyras, and M. Kocher, “Morphology-driven automatic segmentation of MR images of the neonatal brain.,” Med Image Anal, vol. 16, no. 8, pp. 1565–1579, 2012.
-  D. S. Marcus, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris, and R. L. Buckner, “Open access series of imaging studies (OASIS): cross-sectional MRI data in young, middle aged, nondemented, and demented older adults,” J Cognitive Neurosci, vol. 19, no. 9, pp. 1498–1507, 2007.
-  B. A. Landman, A. Ribbens, B. Lucas, C. Davatzikos, B. Avants, C. Ledig, D. Ma, D. Rueckert, D. Vandermeulen, F. Maes, G. Erus, J. Wang, H. Holmes, H. Wang, J. Doshi, J. Kornegay, J. Manjon, A. Hammers, A. Akhondi-Asl, A. J. Asman, and S. K. Warfield, MICCAI 2012 workshop on multi-atlas labeling. CreateSpace Independent Publishing Platform, 2012.
-  B. Likar, M. A. Viergever, and F. Pernus, “Retrospective correction of MR intensity inhomogeneity by entropy minimization,” IEEE Trans Med Imaging, vol. 20, pp. 1398–1410, 2001.
-  S. M. Smith, “Fast robust automated brain extraction.,” Hum Brain Mapp, vol. 17, no. 3, pp. 143–155, 2002.