Tissue magnetic susceptibility is a physical magnetic resonance imaging (MRI) parameter that indicates how the local magnetic field changes in tissues when an external magnetic field is applied. Tissue magnetic susceptibility is able to reflect the unique information about tissue composition including the iron and myelin[li_association_2015, Chai2019]. It has been reported that focal abnormal iron accumulation in brain has been observed in various neurodegenerative diseases, such as Alzheimer’s Disease[Gong2019], Parkinson’s Disease[Xuan2017] and Huntington’s Disease[Chen2019]. The abnormal brain iron deposition in these diseases is prone to occur in the subcortical nuclei including the caudate nucleus (CN), globus pallidus (GP), putamen (PUT), thalamus (THA), substantia nigra (SN), red nucleus (RN), and dentate nucleus (DN). These subcortical nuclei were involved in executive functions and motor control, such as behavioral control, emotion, and motor learning[li_association_2015, florio_basal_2018]. Owing to its ability in quantitively measuring the magnetic suspectibility, the quantitative susceptibility mapping (QSM)[liu_susceptibility-weighted_2015] has shown to be an important non-invasion measurement method in monitoring neurodegenerative diseases, and presented great potential in test new therapies or drugs.
Currently, to quantitatively measure the magnetic susceptibility of subcortical nuclei from the QSM, the regions of interest (ROIs) were mostly obtained from manual delineation, which was a tedious and labor-intensive work, and the delineation was also heavily dependent on the evaluators’ experience. Therefore, to improve both the efficiency and the accuracy, it is urgent to develop a segmentation method in an automatic way. Conventionally in medical imaging, the automatic nuclei segmentation methods were mostly atlas based[igual_fully-automatic_2011, igual_automatic_2012, xia_automatic_2007, su_thalamus_2019, li_multi-atlas_2019], where an atlas of the targeted ROIs were annotated on a standard brain, typically in the Montreal Neurological Institute (MNI) space. To segment the ROIs, the target images were first registered with the standard brains by using T weighted imaging (TWI) or Tweighted imaging (TWI) with thin slice thicknesses to obtain a transform between the subject-in-question’s brain and the standard brain. After annotating the ROIs according to the templates, an inverse transform was applied to the ROI segmentation maps to obtain the segmentation results. Due to the dissimilarities between the subject-in-question’s and the standard brain, one of the key challenges is how to obtain an perfect registration between the subject-in-question’s and the standard brains, so as to assure the accuracy of the ROI templates. Another challenge comes from the fact that the the appearances of the gray matter nuclei are highly correlated to the anatomical structures, which may vary due to aging, tumor or other types of diseases. To tackle these challenges, probabilistic models were usually adopted to improve the segmentation performance [li_multi-atlas_2019, su_thalamus_2019]. By using several subjects with both sexes and various ages, the segmentation method becomes more robust against the anatomical dissimilarities, which further increase the segmentation accuracy.
Another possible way to solve the imperfect registration and the dissimilarities between the subject-in-question and the MNI space is to avoid registering with the standard brain, which requires that the algorithm understands
the tissue-to-segment. Deep learning has recently been adopted in computer vision and medical image processing thanks to its strong ability in extracting features from big data. By using a large amount of labeled data samples, the deep artificial neural network is able to learn effective representation of features that improves the performance on specific tasks. The rapid development in the parallel computing capability of graphic processing unit (GPU) further accelerates the training a complicated neural network structure on a massive number of data samples. In image classification task, for instance, the accuracy of the convolution neural network (CNN) has surplus the average performance of human. The strong power of CNN in learning representation from images should contribute most to the convolution layers[lecun_deep_2015]. Despite that deeper network has been shown to be more expressive than its shallower counterpart, simply stacking many convolution layers would impose significant difficulties in training due to the so-called gradient-vanishing problem. To build deeper and trainable network, by introducing residual structures to the CNN, ResNet is able to build very deep networks, and achieved remarkable performance in image recognition tasks[he_deep_2015]. With residual structures, the networks have no spurious local optima, making it easier to converge [hardt_identity_2018].
Basically, the well-performed segmentation networks [chen_deeplab_2017, chen_encoder-decoder_2018] were mostly designed for 2D natural images. In medical image segmentation, the U-Net structure [ronneberger_u-net_2015] and its deformations have achieved tremendous performance in 2D images segmentation tasks, such as cell nuclei or cell boundaries segmentation from histopathological images [isensee_nnu-net_2018, dolz_ivd-net_2019]. When processing 3D images such as CT and MR images, despite that the methods developed for 2D images can be directly applied by treating the 2D slices as independent images, the segmentation performance would be worse due to the loss of the inter-slice contextual information, as the CNN cannot identify the inter-slice relationship [zhang_automatic_2018]
. It is, therefore, straightforward to generalize the conventional 2D convolution layers to 3D ones. Compared to the 2D images, the 3D volumetric data occupies much larger memory space, especially when training the network. In deep learning, the GPU has to store not only the input images, but also the activation maps of all neurons to compute the gradients during training. Therefore, when dealing with 3D volumetric images, one usually facing a pressing dilemma between memory efficiency and segmentation accuracy. The segmentation accuracy will be lost due to the absence of spatial contextual information if we treat the volumetric image slices as separate 2D images, while the training may become infeasible if we feed the whole volume into the network. One of the solutions to the dilemma is to split the whole volumetric image into 3D image patches, which is considered to be beneficial in reducing memory cost while preserving the inter-slice contextual information. In fact, it is the most commonly adopted approach in designing 3D-CNN based methods[cicek_3d_2016, kamnitsas_deepmedic_2016, kamnitsas_efficient_2017, chen_voxresnet:_2018, zhang_automatic_2018, isensee_nnu-net_2018, chang_brain_2018], and has achieved remarkable performance in various tasks such as brain lesion and tumor segmentations[maier_isles_2017, menze_multimodal_2015].
Despite that splitting into 3D patches reduces the memory cost while preserving the spatial correlations across slices, it in turn brings another spatial contextual information loss, as the fields of view (FoVs) of the CNNs were strictly limited by the patch size. When the foreground object is large, or located near the edge of a patch, the segmentation accuracy will be reduced due to the loss of semantic information. To enlarge the FOV while preserving memory feasibility, Kamnitsas et al. proposed to introduce downsampled images to their DeepMedic model as an auxiliary[kamnitsas_deepmedic_2016], where the output feature maps of the downsampled and the original image patches were fused before the last convolution layer to generate the final segmentation map. Owining to its multi-resolution input structure, the DeepMedic won in both ischemic stroke lesion segmentation and brain tumor segmentation challenges in 2015. Inspired by DeepMedic, we propose to adopt patches with different resolutions as inputs. To further improve the performance, the local and global features were fused in a layer-by-layer manner.
U-Net has been one of the most successful structure in medical image segmentation[Ronneberger2015]. It is designed as a symmetric encoder-decoder structure, with several skip connections between the encoder and decoder to refine the segmentation results, as depicted in Fig. 2a. To tackle the limit-FoV problem of 3D-volumetric segmentation networks, we propose to add another branch of encoder for the U-Net, as shown in Fig. 2b. Two encoders, denoted as the local branch and the global branch, take patches with the same matrix sizes but different resolutions as inputs. In particular, the local branch uses the patches with original resolution as input to extract local patterns, while the global branch uses the downsampled patches to enlarge the FoVs. The feature maps from the local and global branches were fused at every layer and fed into the decoder network for generating the segmentation results. As we will show in this paper, the proposed double-branch residual-structured U-Net (DB-ResUNet) structure can achieve better segmentation accuracy than its single-branch counterpart.
To evaluate the segmentation performance, we collected 41 subjects from Tianjin First Central Hospital (Tianjin, China), and pretended to segment seven grey matter nuclei, including CN, GP, PUT, THA, SN, RN and DN in both left and right hemispheres. By training on 20 subjects, the proposed DB-ResUNet is able to achieve much better performance than the atlas-based method and the single-branch U-Net.
Ii-a Residual-Structured U-Net
In this paper, a 3D residual-structured U-Net (ResUNet) is first proposed as a baseline model for nuclei segmentation, as depicted in Fig. 4. The ResUNet employs the similar structure to the U-Net with several modifications made to improve the performance.
First, we adapt the original U-Net, which used 2D images as input, to a 3D CNN, due to the fact that each nucleus appears across several slices, and better segmentation accuracy would be expected by utilizing the inter-slice spatial contextual information.
Second, the stacked convolution layers in the original U-Net are replaced by a residual block as shown in Fig. 3. As pointed out in [he_deep_2015], the residual structure enabled the network parameters to be updated from the beginning, and elegantly solved the gradient vanishing problem.
Due to limited GPU memory, the images are divided into patches of size before fed into the network. Note that although feeding patches instead of the whole image has been a regular approach in 3D CNNs, only utilizing local context information from the patches may led to segmentation performance loss. To this end, we further propose to incorporate global context information without significantly increasing the number of parameters and the memory cost, which will be introduced in the next subsection in detail.
Fig. 2b illustrates the overall structure of our proposed double-branch Residual-structured U-Net (DB-ResUNet). We propose to segment on 3D volumetric patches due to the fact that each nucleus appears across several slices, and using 3D patches can efficiently utilize the cross-layer information. To tradeoff between the memory efficiency and a large FoV, the proposed model is basically an encoder-decoder structure that integrated both local patches, which has a higher resolution but a smaller FoV, and global patches, which is downsampled but has a larger FoV, to generate an accurate segmentation. Despite that the local and global patches have different resolutions, we propose to use patches with the same matrix size, i.e., . We employ residual block shown in Fig. 3
as the basic brick to extract features, and adopt weighted crossentropy loss function to improve its optimization.
Ii-B1 Local Branch
Fig. 5 presents the main branch of our proposed DB-ResUNet. The local patches are fed into this local branch, and the final segmentation map is generated with the assistance from the feature maps extracted from the global branch. The local branch network is basically a U-Net with residual block, which is similar to the ResUNet structure introduced in Sec. III-A.s We directly cut the images to patches of size as input, and fed into the network. The encoder part employs a similar structure with 3D UNet except for that we use residual blocks, instead of stacked convolution layers to extract features.
In the decoder part, the feature maps are upsampled by learnable convolution transpose layers. The upsampled feature maps are concatenated with the feature maps from both the local and global branch encoders, followed by a residual block. The feature maps from the global branch compensate the loss of spatial contextual information due to the limited patch size, and is beneficial in improving the segmentation accuracy.
Ii-B2 Global Branch
Fig. 6 illustrates the global branch encoder. The global branch employs a fully-convolutional network (FCN) structure. The original input image was first downsampled, and then fed into the network. The output feature maps of each residual block are cropped to the same FoVs as the corresponding feature maps in the local branch, upsampled to the same spatial resolution, and then concatenated to the decoder part of the local branch.
The global branch also generates its own downsampled segmentation map by employing a FCN-8s head. The generated downsampled segmentation map is also cropped, upsampled and concatenated with the feature maps before the final convolution layer of the local branch.
Ii-C Loss Function
Loss function is used to measure the differences between the predicted image and the label, which provides gradients to update the network parameters. In our work, we use the weighted sum of the losses from both the global and local branches as
where and are data and the corresponding labels, respectively. is the network parameters, denotes the predicted output of the network. and denotes the loss functions of the final segmentation and the global auxiliary result, respectively. is a tradeoff constant. The global branch loss in fact serves as an regularization, and helps improving the segmentation accuracy.
We propose to use weighted cross entropy loss in both the global and local loss functions. By considering the fact that there are significantly more background pixels than the foreground ones, in both and , the weights for the background and foreground categories were assigned as and , respectively.
Ii-D Evaluation Metrics
Dice coefficient (DC) is adopted to evaluate the segmentation performance. In particular, for the -th category, the DC is defined as
where and denote the prediction and ground truth, respectively. denote the area.
Iii Experiment Results
Iii-a Data Acquisition and Preprocessing
We collected 43 subjects from Tianjin First Central Hospital (Tianjin, China), where the mean age is . Ethic approval has been granted by Tianjin First Central Hospital Ethic Committee. The 3D T weighted imaging (TWI) and the susceptibility weighted imaging (SWI) were acquired using a 3.0T MR scanner (TRIO scanner, Siemens Healthineers, Erlangen, Germany), with an 8-channel phased array head coil. The voxel sizes for 3D TWI and SWI are and , respectively. The corresponding matrix sizes for 3D TWI and SWI are and , respectively. The magnitude and phase images of SWI were used to generate the QSM following the method in [Chen2018].
As shown in Fig. 1
, in this paper, we focus on the segmentation of gray matter nuclei, including CN, GP, PUT, THA, SN, RN and DN, which were manually annotated by trained neuroradiologists (Dr. Chao Chai with 9 years’ experience in neuroimaging). The whole dataset was randomly split as training set, validation set and test set, with 20, 4 and 19 subjects, respectively. The training and validation sets were used for tuning the network parameters and hyperparameters. The test set was used for evaluating the segmentation performance only.
Note that the TWI and the SWI were different in both spatial resolution and matrix size. The T
WI was first registered to the corresponding SWI by adopting rigid affine transform with the mutual information as the criterion, and then resampled to the same spatial resolution and matrix size as the SWI using linear interpolator. Both the QSM and the TWI were then zero-padded to matrix sizes of before use. Finally, we clip the intensities on the QSM and the TWI to the range of and , respectively, and linearly rescale them to the range of . The TWI and the QSM are then concatenated as dual-channel images before fed into the network.
Iii-B Implementation Setup
The experiments were performed on a workstation with an Intel Core i7-7700K CPU, 64GB RAM and Nvidia GeForce 1080Ti GPU with 11GB memory. The workstation operated on Linux (Ubuntu 14.04 LTS) with CUDA 8.0. The network was implemented on PyTorch v1.0[Paszke2017]. The MR image files were stored as Neuroimaging Informatics Technology Initiative (NIfTI) format, and processed using Simple Insight Toolkit (SimpleITK)[Lowekamp2013]. The visualized results were presented by using ITK-SNAP[Yushkevich2006].
Data augmentation methods, including random flipping along three axes and random rotation around z-axis, were adopted to prevent overfitting, where the rotation were restricted on the plane within an angle range of . We adopted Adam method[Kingma2014] as the optimizer, and set the initial learning rate as with and , with a weight decay of . Due to limited memory space on a single GPU, the batch size was set to be . The learning rate is reduced by a factor of if no progress was observed on the validation loss, and the training was terminated if the learning rate was smaller than . The tradeoff coefficient was set to be 1.
Iii-C Experiment Results
|Method||# of parameters||Inference Time (GPU)||Inference Time (CPU)|
Fig. 7 presented some examples of the predicted segmentation maps of the proposed ResUNet and DB-ResUNet. For comparison, the segmentation results of an atlas-based method [li_multi-atlas_2019] and another deep-learning-based method, i.e., 3D-UNet[cicek_3d_2016], were also presented. The 3D-UNet was trained on the same training samples with the same size of patches as ResUNet and DB-ResUNet. The results of the atlas-based method [li_multi-atlas_2019] were obtained by uploading our test data samples to their website111http://www.mricloud.org.
As we can see from Fig. 7, the segmentation maps produced by the atlas-based method were less smooth and with many false positives. The deep-learning methods, on the other hand, were able to extract features from various levels, leading to much smoother segmentations, which can be clearly observed from their 3D views presented in the last row of Fig. 7.
The numerical results on the test set were further summarized in Tab. I. As we can see, the deep learning methods were in general much better than the MRICloud result. Basically, it is required in the atlas-based method to align the images to the MNI template, so that the nuclei segmentations can be performed by comparing the registered image and the images in the training set in a pixel-by-pixel manner. Clearly, the atlas-based method has been more vulnerable to the deformation of brain structure. The deep-learning based methods, however, enabled the neural network to extract multi-level features by itself, making it more robust to the image deformation, leading to a better segmentation performance.
We can also observe from Tab. I that thanks to the residual structure, the ResUNet achieved better segmentation accuracy than the 3D-UNet. In fact, it was shown in [hardt_identity_2018] that the use of residual structures makes the objective function more smooth, making the convergence point closer to the global optima. The DB-ResUNet, on the other hand, achieved the best segmentation performance. The improvement should contribute to the larger FoV brought by the global branch. With a larger FoV, the network is able to incorporate more spatial context information in segmentation, making the voxel-wise classification more accurate.
Fig. 8 further compared the susceptibility values and the volumes between the predicted and manual segmentations. The regression line was also plotted. As we can see, deep-learning-based methods were more accuracy in both susceptibility values and volumes, where the DB-ResUNet achieves the best performance, which highlights the outstanding performance of the proposed method.
Another advantage of deep-learning-based methods lied in its fast inference. As reported in [li_multi-atlas_2019], the method adopted in MRICloud take h to process one subject. Deep learning methods, as summarized in Tab. II, consumed much shorter time, where the proposed DB-ResUNet achieved an average time of s to generate segmentation map for each subject. To make a fair comparison, we also evaluated the results by running the trained network on the CPU (Intel Core i7-7700K with 64GB RAM). As Tab. II shows, the CPU running time, despite significantly longer than the GPU running time, is also much shorter than the atlas-based method, which highlights the computational advantage by using deep learning.
The numbers of parameters of the deep-learning methods are also summarized in Tab. II. As we can see, by replacing the stacked convolution layers by a residual block, the ResUNet was able to achieve better segmentation performance, while at the same time significantly reduces the number of parameters. Also note from Figs. 4 and 5 that the numbers of filters in both global and local branches in the DB-ResUNet were halved compared to the ResUNet. Therefore, despite that an additional branch was added, the number of parameters can be further reduced, which brings a shorter computation time during inference.
Iv-a Impact of Downsampling Rate
In the global branch of the proposed DB-ResUNet, the image patches were cropped from the image that downsampled for 2 times, i.e., the width and height were halved compared to the original image. Note that compared to the baseline model, i.e., ResUNet, DB-ResUNet also introduces an auxiliary output for the global branch, and the auxiliary loss is added as a regularizer. It is, therefore, necessary to discuss whether the performance gain of DB-ResUNet really comes from the enlarged FoV.
Tab. III presented the numerical results where the global branch input patches were cropped from images downsampled by different rates. As we can see, the 2x case achieves the best segmentation accuracy, while the 4x case achieves worse performance in small nuclei, such as SN, RN and DN. Intuitively, with a larger downsampling rate, the global branch input brings more spatial contextual but less detailed information, leading to worse segmentation accuracy in small nuclei.
Iv-B Segmentation on Either TWI or QSM
It has been shown in Sec. III-C that the proposed segmentation method achieved high segmentation accuracy. One limitation of the TWI/QSM-based method was that we need to obtain both TWI and QSM images, which may not be feasible in all studies. Therefore, we will further assess the performance of the proposed DB-ResUNet when only QSM or TWI is available.
In particular, we adopted either the TWI or the QSM as a single-channel image, and fed it into the network. Fig. 9 presented the visualized example of the segmentation results. For the sake of comparison, the segmentation results by using both the TWI and QSM were also plotted. As we can see from Fig. 9, merely using one image may lead to worse performance.
Tab. IV presented the numerical evaluation results. As we can see, with only TWI, the method achieved the worst segmentation accuracy due to the low T1 contrast in the iron-rich deep gray matter nuclei, which was also confirmed by the comparison between the predicted and manual annotated results in both susceptibility values and volumes shown in Fig. 10. In fact, the QSM, despite of its low structural resolution, has advantage in presenting the iron-accumulating deep gray matter nuclei. Our study found that the segmentation accuracy was the highest based on the combined QSM/T images compared with that based on the only QSM or TWI images, and we also found that the segmentation accuracy was higher based on the only QSM images compared with that based on the only TWI. In the atlas-based nuclei segmentation methods, most pipelines [igual_fully-automatic_2011, igual_automatic_2012, su_thalamus_2019] compulsorily adopted the TWI for segmentation, so that an accurate registration between the subject-in-question’s and the MNI template can be achieved. Our method, other other hand, suggested that with deep learning, a reasonably accurate nuclei segmentation result can be achieved, even when only QSM is available.
In this paper, we proposed to adopt a 3D CNN-based method to segment gray matter nuclei from the QSM and TWI. To tackle the loss of FoVs that comes from splitting 3D volumetric images into patches, we proposed to adopt a double-branch network that takes patches with both original and low resolution as input. The proposed method achieved much higher segmentation accuracy than either the atlas-based method or the conventional 3D-UNet. The contribution of TWI and QSM was also discussed. Experimental results implies that the QSM had more contribution in distinguishing the nuclei, which made it possible for accurate quantitative susceptibility assessment when 3D TWI was absent.