Accurate segmentation of brain anatomy provides the basis for quantitative measurements such as volume, thickness and shape from magnetic resonance images (MRIs). These measurements are widely used in the research field of neuroscience to investigate structural brain changes associated with age and disease. Since the manual segmentation of brain MRI scans is expert demanding and time consuming, a growing number of computational approaches have been proposed for accurate automatic segmentation of subcortical brain structures, which is particularly important for the vast boom of large-scale brain studies.
Recently, several state-of-the-art methods based on convolutional neural networks (CNNs) have been proposed for fast or accurate segmenting competence of brain anatomy. These CNN-based networks can be divided into three main categories: multi-task voting models, U-shaped models and downsampling models, where the latter two base on fully convolutional networks(FCNs). Multi-task voting models, like DeepNAT
, consist of conventional CNN components such as convolution, max-pooling and linear fully-connected layers. Multiple predictions are generated from repeated linear fully connected layers and then reshaped into the three-dimensional spatial space. Then spatial predictions add their votes to the segmentation mask step-by-step, leading to the segmenting time with no less than one hour per scan. With engineered means like cascaded approach and spectral coordinates, such 3D multi-task models generate segmentation of subcortical structures in a state-of-the-art accuracy, although time demanding. Unlike multi-task voting models, U-shaped models concentrate more on dense prediction. Originated from FCN with upsampling and skip-connection path, this category takes the advantage of fast full-image segmentation with satisfactory but limited accuracy. Later, units built upon ResNet and DenseNet have separately been embedded into the U-shaped model, namely the successive state-of-the-art methods of voxel-wise residual network(VoxResNet) and fully convolutional densenet(FC-DenseNet). Although 3D U-shaped models have the capacity of fast full-image segmenting in seconds, with complex network and millions of learnable parameters, fast full-image training fails to achieve due to the limitation of GPU memory and annotated data. In fact, generally trained on patches cropped from the original images, U-shaped models run a time-consuming training process with only small minibatch available, whose accuracy demands further improvement to align with multi-task models. In summary, the absence of lightweight learnable models for fast but accurate segmentation still retains a bottleneck in the research field of brain anatomy.
Downsampling models point out a solution to address this issue[8, 9]. Similarly based on FCN, this type of method discards upsampling layers in FCN and obtains downsampled dense inference directly from outputs of bottleneck layer, which enables fast segmentation and overcomes the computational burden in training process. With limited performance from the oversimplified architecture, this branch of FCN needs further exploration. In this paper, we propose a 3D end-to-end downsampling model of fully dense and fully convolutional network (FD-FCN) for T1-weighted MRI brain structural segmentation, that shows competence in accuracy and efficiency in both training and segmenting process. In time consumption, this method retains the capacity of fast segmentation originated from FCNs, while vastly accelerating the training process with less memory occupied compared to the state-of-the-art U-shaped ones, owing to the carefully designed architecture. In segmenting competence, experiments performed over the IBSR dataset show that FD-FCN produces higher dice accuracy of 11-structural brain segmentation than the other FCN-based methods. For convincing comparison, we firstly apply FC-DenseNet, the state-of-the-art method among FCNs, on multiple label subcortical brain segmentation of 3D volumes by the way. The experiments show that FD-FCN achieves a 3.66% absolute improvement of dice accuracy (89.81% vs 86.15%) than FC-DenseNet. Furthermore, FD-FCN inherit the accurately segmenting capability of the multi-task models (89.81% vs 89.76%) in averaging 53 seconds vs 73 minutes per scan. The main contributions of FD-FCN are (i) vivid division of local and global information endows the proposed FD-FCN with better capability of dense inference, while alleviating the problem of parameter explosion from dense connection, (ii) the newly designed dense blocks to enlarge receptive fields without significantly increasing parameters and (iii) the first incorporation of spectral coordinates to FCNs for spatial context. FD-FCN could further exploit its potential for semantic segmentation of brain anatomy by incorporating the fully connected conditional random field (CRF) and fine-tuning learning strategy.
We start by presenting the proposed multi-scale fully dense and fully convolutional architecture of FD-FCN, which is at the core of our segmenting method. Section 2.2 introduces the incorporation of dilated convolutions to design new dense blocks with enlarged receptive fields but negligible parameter increase. Then section 2.3 describes the calculation of spectral coordinates and finally section 2.4 gives other details.
2.0.1 2.1 Network Architecture.
As mentioned in section 1, the incorporation of DenseNets leads to an outstanding performance among U-shaped FCNs. However, since the training process consumes both time and memory, such architecture faces with difficulties to give a solution to three-dimensional brain segmentation. As downsampling models own similar dense inference and are easier to train owning to the natural lightweight size, we propose to apply DenseNet carefully to the downsampling model for further improvement.
DenseNet designs a sophisticated connectivity that iteratively concatenates all output features in a feedforward fashion. Dense blocks form the basis of DenseNet, which is further composed of unit layers. The output of each unit layer has feature maps where , hereafter referred as growth rate, is typically set to a small value (e.g. ).
Assuming dense blocks contain densely connected unit layers, the output of a dense block is the concatenation of outputs from all layers, with input concatenated as well. The number of feature maps increases by after each dense block, growing linearly. With such linear growth of input channels, the parameter of convolutions explodes as depth increases.
As is in Fig. 1
, FD-FCN is mainly composed of dense blocks. To alleviate parameter explosion, the inputs of dense blocks are no longer directly concatenated to subsequent layers in feature extraction (FE) process. This makes a slimmer model, although global context tends to be lost as well. To model both local and global context, FD-FCN embeds intermediate-layer outputs in the final prediction, which encourages consistency between features extracted at different scales and embeds fine-grained information directly in the segmenting process. Since such construction allows the model fitness with both slimmer size and better performance, we adopt FD-FCN a multi-scale architecture of network. Apart from FE dense blocks, dense blocks are first explored in the fully connected (FC) process. Dense connection hierarchically organizes FC layers and enables the local and global context pass through all FC layers, leading to a outperformance. Other than that, FC layers give outputs with lower dimension and thus the parameters in the FC process are reduced by multiple times, which helps to alleviate the problem of overfitting. In addition, other components of FD-FCN, namely the downsampling and classifying layers, carry out only one convolution operation. The varying feature resolution through downsampling layers and the limited label classes of classifying layers prevent further appliance of dense networks on FD-FCN. Since the amount of these layers is few in number compared to that of feature extraction and fully connected ones, we adopt FD-FCN a fully dense architecture of network.
2.0.2 2.2 New Dense Blocks.
Unlike unit layers in original blocks of DenseNets, which commonly consists of batch normalization (BN), activation function and convolution, hybrid dilated (HD) layers are newly designed to enlarge receptive fields without significantly increasing parameters. The HD layers are ameliorated by dilated convolution, forming the basis of newly designed dense blocks. Dilated convolution is constructed by insertingzeros between neighboring voxels in the 3D convolution kernel, where r corresponds to the dilation rate. For a convolution kernel with size , the size of resulting dilated filter is , where . Given a , we have , leading to the enlarged receptive fields. With the theoretical issue of gridding problem solved by hybrid dilated convolution, we are inspired to apply such solution to construct the unit layers of dense blocks. The new HD layer consists of BN, PReLU and a group of parallel convolutions, where the output is the summation of outputs from all parallel convolutions(see Fig. 1
). These convolutions share the same output channel and convolution kernel size, discriminated by different dilation rates and corresponding padding lengths. The group of dilation ratesshould meet two conditions in order to ensure the parallel convolutions cover a larger region without any holes or missing edges. Firstly, the dilation rate within a group should not have a common factor relationship. And secondly, giving , there should be . Here we adopt parallel convolutions with dilated rates and other combinations could be explored for further improvement.
2.0.3 2.3 Spectral Coordinates.
A downside of patch-based FD-FCN is the loss of spatial context, which provides valuable information for structures with low tissue contrast. To increase the spatial information, we adopt the spectral coordinates proposed in DeepNAT, augmenting the patches with location information in the final prediction. The calculation of spectral coordinates starts with the definition of the adjacency matrix . The weight in W between two points and is set to if both points are neighbors and within the brain mask, otherwise set to . Then the Laplacian operator on the volume is , with the node degree matrix where
and others are zero. Then we solve the Laplacian eigenvalue problemwith eigenvalues
and eigenvectors. We compute the first three eigenvectors corresponding to three eigenvalues with largest real part, where each eigenvector is reshaped to a 3D image and the ensemble forms the spectral brain coordinates. To provide FD-FCN with more information, we combine three spectral coordinates with three Cartesian ones. The Cartesian coordinates are normalized by dividing the separate dimensional length of the brain volume. To the best of our knowledge, this is the first application of eigenvectors to FCN-based methods.
2.0.4 2.4 Details.
In FD-FCN, there are four FE dense blocks, three transition down convolutions, one FC dense block and one classifying layer (convolution II) in total, including the first feature extractor convolution I. Convolution I owns kernel size of
, with padding & stride length 3&1. The FE dense blocks consist of HD layers and FC dense block consists of FC layers(see Fig.1), where the growth rate of HD layers is and that of FC layers is . The transition down convolutions adopt the convolution kernel size of with padding & stride length either 1&2(the first) or 0&1(the second and the third), where the included convolutional transition down (CTD) layers increase the channels by from inputs to outputs. The outputs of three FE dense blocks are concatenated together before the FC dense block, where outputs of the former two are center cropped to maintain size consistency. In addition, the spectral and cartesian coordinate patch are also concatenated here, both centered at the same point with the input and the output of FD-FCN. Finally, the FC and classifying layer share the convolution kernel size of and the padding length of . Note that PReLU is adopted as the activation function and BN is exploited in HD, CTD and FC layers. Since FD-FCN is an end-to-end approach, the widely used method of CRF is not adopted here. The FD-FCN version with CRF could yield better performance if needed.
We evaluate FD-FCN on the IBSR dataset, which consists of T1-weighted MRI scans with size for all. The dataset contains expert-labelled segmentations of brain structures, among which a subset of important structures are considered (see Fig. 3). In addition, we employed a
-fold cross validation strategy for unbiased estimates of model performance, where each fold is composed oftraining examples, validation examples and test examples.
In data arrangement, we select the input patch size and corresponding output patch size for FD-FCN as a trade-off between a large enough image region and a fast processing speed. In training process, we randomly sample at most patches per structure from the skull-stripped MRIs, where we double the number of patches for cerebral cortex and cerebral white matter to account for the higher variability in these classes. In segmenting process, the output patches of size are stacked up to form the segmenting image and the corresponding input patches of size are cropped from the original image, centered at the same locations. Further more, we apply intensity normalization to the input patches in division by .
FD-FCN is implemented on the Pytorch framework. The optimization of network parameters is performed with adaptive moment estimation (Adam) for fast convergence, using cross-entropy as cost function. The actual learning rate at each epoch is, where the base learning rate is set to and the maximum epoch . However, we observed that the performance did not improve after epochs, allowing us to early stop at this point. In addition, the minibatch size of fills up most of the GB GPU memory on the NVIDIA Tesla P100 GPU.
In the experiment we horizontally compare the accuracy of FD-FCN against two state-of-the-art methods, FC-DenseNet and DeepNAT, on the IBSR dataset with a cross validation strategy. The average Dice coefficient of FD-FCN, FC-DenseNet and DeepNAT are 89.81%, 86.15% and 89.76% separately, with average IoU coefficients 81.93%, 76.25% and 81.83%. The segmenting process of FD-FCN consumes average 53 seconds per image, while FC-DenseNet 21 seconds and DeepNAT 73 minutes. The 12GB GPU memory limited the full image segmentation of FC-DenseNet let alone full image training, since which we adopt patch-based FC-DenseNet with patch resolution and observe time increase in segmenting process. In addition, we incorporate coordinates in the bottleneck layer of FC-DenseNet for control. Fig. 3 shows the 11-structural comparison of Dice and IoU scores for all three methods, where the Dice and IoU coefficients show the same trend in performance measurement. And visual comparison of these three methods is illustrated in Fig. 3, where the rugged surface is accurately segmented by FD-FCN without foreign particles. Since DeepNAT is a highly accurate but time consuming segmenting method, FD-FCN inherit the accurate segmentation capability (Dice 89.81% vs 89.76%) with vast segmenting efficiency (53 seconds vs 73 minutes). While FC-DenseNet outperforms among FCN-based methods and enables fast segmentation, FD-FCN performs significantly better than FC-DenseNet (Dice 89.81% vs 86.15%) with easier training (about 1.5 hours vs 3 days per epoch) and fast segmenting (53 vs 21 seconds). The slight segmenting time increase of FD-FCN comes from the downsampled patch size of output, compared to that of FC-DenseNet. Since a T1-weighted MRI scan is generated in average 10 minutes from MRI machines, we believe such increase in time is negligible and FD-FCN could nearly achieve the real-time segmentation.
Further more, we vertically evaluate the impact of the proposed contributions in FD-FCN through control experiments. By introducing the newly designed dense blocks, we observe a 1.24% improvement of dice accuracy. And by incorporating spectral and cartesian coordinates, we observe a 1.37% dice improvement. Apart from these contributions, the multi-scale fully dense and fully convolutional architecture of network with ordinary dense blocks and no coordinates still outperforms other 3D FCN-based methods.
4 Discussion and Conclusion
We have described FD-FCN, a FCN-based multi-scale fully dense network for semantic segmentation of brain anatomy. In the 11-structural segmenting experiment on IBSR, FD-FCN produces the best accuracy compared to two state-of-art methods, with fast segmenting and easy training. In the future, larger scale experiments on extensive datasets will be investigated. And we intend to adopt CRF and fine-tuning training strategy later to further explore the improvement of model competence.
-  Wachinger C, Reuter M, Klein T. DeepNAT: deep convolutional neural network for segmenting neuroanatomy[J]. NeuroImage, 2018, 170: 434-445.
-  Çiçek Ö, Abdulkadir A, Lienkamp S S, et al. 3D U-Net: learning dense volumetric segmentation from sparse annotation[C]//International conference on medical image computing and computer-assisted intervention. Springer, Cham, 2016: 424-432.
-  He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
-  Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4700-4708.
-  Chen H, Dou Q, Yu L, et al. VoxResNet: Deep voxelwise residual networks for brain segmentation from 3D MR images[J]. NeuroImage, 2018, 170: 446-455.
-  Zhang R, Zhao L, Lou W, et al. Automatic segmentation of acute ischemic stroke from DWI using 3-D fully convolutional densenets[J]. IEEE transactions on medical imaging, 2018, 37(9): 2149-2160.
-  Kamnitsas K, Ledig C, Newcombe V F J, et al. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation[J]. Medical image analysis, 2017, 36: 61-78.
-  Dolz J, Desrosiers C, Ayed I B. 3D fully convolutional networks for subcortical segmentation in MRI: A large-scale study[J]. NeuroImage, 2018, 170: 456-470.
-  Jégou S, Drozdzal M, Vazquez D, et al. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017: 11-19.
Akkus Z, Galimzianova A, Hoogi A, et al. Deep learning for brain MRI segmentation: state of the art and future directions[J]. Journal of digital imaging, 2017, 30(4): 449-459.