Automated segmentation of volumetric medical images is a challenging task. The more recent development of deep neural networks, and in particular convolutional neural networks (CNN) , suggests a course of methods . Contrary to conventional shallow learning methods, where feature design is crucial, deep learning methods automatically learn hierarchies of relevant features directly from the training data. Early works 
treat the image segmentation as a classification problem with sliding window where CNNs are applied to input patches to classify the central pixel/voxel of each patch. As classifying each pixel/voxel in a sliding window fashion results in orders of magnitude of redundant calculation, most of recent works[10, 3] are based on Fully Convolutional Networks (FCNs) , which can process the input image in an end-to-end way and can provide a full resolution segmentation map . In several biomedical image segmentation benchmarking competitions, methods built on CNNs [10, 14] are on the top list of the associated leaderboard. Despite the fact that CNNs-based methods have achieved state-of-the-art performance in many different 2D medical image analysis tasks, in clinical practice, however, a large part of the medical imaging data available is in 3D. This has motivated the development of 3D CNNs for volumetric image segmentation in order to benefit from more spatial context. For example, Kamnitsas et al.  proposed a dual pathway, 11 layers deep 3D multi-scale CNN with fully connected Conditional Random Field (CRF) for brain lesion segmentation and achieved state-of-the-art performance. Çiçek et al.  proposed the 3D U-net as an extension to the 2D U-net by replacing all 2D operations with their 3D counterparts. By incorporation of residual blocks and using a similar architecture as the 3D U-net, Milletari et al.  proposed the 3D V-net for volumetric medical image segmentation. One thing common to all these 3D CNNs-based approaches is that they all follow a fully convolutional downsample-upsample pathway. More specifically, the downsampling path tries to achieve higher-level feature abstraction by gradually downsampling low-level features with high spatial resolutions while the upsampling path aims to upsample the learned high-level features to achieve a full-resolution segmentation. Deviating from the fully convolutional downsample-upsample pathway, Li et al.  proposed a high-resolution network architecture referred as “HighRes3DNet” for the segmentation of fine structure in volumetric images. HighRes3DNet preserves the spatial resolution throughout the layers and the enlargement of the receptive field is then achieved by incorporating dilated convolution.
Due to GPU memory restrictions caused by moving to fully 3D, state-of-the-art methods [2, 8, 3, 5] depend on subvolume/patch processing. The size of the input patch is usually small if no specialized hardware with large GPU memory is used, limiting the incorporation of larger context information for a better performance. To tackle these challenges, we present a novel and efficient approach which allows for using large size of patches for an effective and efficient semantic segmentation of volumetric images by leveraging context information in a large patch. Our contributions can be summarized as follows:
First, we propose a novel Holistic Decomposition Convolution (HDC), which can be regarded as an inverse operation to the previously introduced Dense Upsampling Convolution (DUC) [11, 12]. HDC consists of a periodic down-shuffling operation followed by a conventional 3D convolution. HDC has the advantage of significantly reducing the size of the data for sub-sequential processing while using all the information available in the input irrespective of the down-shuffling factors. We apply HDC directly to the input data, whose output will be used as the input for sub-sequential CNNs. In order to achieve volumetric dense prediction at final output, we need to recover full resolution, which is done by using DUC.
Second, we extensively validate the proposed approach on the task of segmentation of hip bony structures from T1 MR images of limited field of view. We show that HDC and DUC are network agnostic and can be combined with different FCNs for an improved performance. More specifically, we demonstrate that the improved performance can be obtained when HDC and DUC are used with 3D U-net, 3D V-net , and HighRes3DNet , respectively. We investigate the influence of the down-shuffling factors on the segmentation results.
Third, in addition to the hip MR image segmentation task, we apply the proposed approach off-the-shelf to a typical yet highly challenging segmentation task, i.e., intervertebral disc (IVD) segmentation from T2 MR images. We conduct comprehensive cross-validation experiments on an open dataset to compare the performance of our approach with that of state-of-the-art methods. We have achieved better segmentation results than state-of-the-art methods.
In this section, we will first briefly present the usage of DUC for semantic segmentation, followed by a detailed description of HDC. We will then show how to combine HDC and DUC with FCNs for effective segmentation of volumetric images.
2.1 Dense Upsampling Convolution for Semantic Segmentation
For a typical FCNs-based approach that follows the downsample-upsample pathway, in order to achieve volumetric dense prediction, we need to recover full resolution at output. Conventional methods such as bilinear upsampling  is not attractive as the upsampling parameters are not learnable. Deconvolution could be an alternative but, unfortunately, it can easily lead to “uneven overlap”, resulting in checkboard artifacts . In 
, DUC, which consists of low-resolution convolution with a periodic up-shuffling operator, was proposed to jointly learn the feature extraction and upsampling weights for super-resolution reconstruction. DUC was later used as the last layer for semantic segmentation in. For details about DUC, we refer to previouw work [11, 12].
2.2 Holistic Decomposition Convolution
HDC can be regarded as the inverse operation to DUC and as shown in Fig. 1, it consists of a periodic down-shuffling operator with low-resolution convolution. HDC is designed to be directly applicable to the input data with the aim to reduce the size of the data for sub-sequential processing while using all the information available in the input irrespective of the down-shuffling factors. This is also the reason why we call this novel operation as “Holistic Decomposition Convolution”. More specifically, let’s assume that the size of input data () is and the size of the output from HDC is , where are the down-shuffling factors along the three spatial axes, respectively; is the number of channels in the input data; is the number of feature maps in the output of HDC. Instead of applying convolution to high resolution (HR) images, we first apply a periodic down-shuffling operator to the input data to get channels of feature maps with low resolution (LR) and then apply convolutions with a kernel size of to get the feature maps of size . Mathematically, this can be described as:
is an non-linear activation function that is applied element-wise;, are trainable weights and bias, respectively;
is a periodic down-shuffling operator which aims to rearrange the tensor () in the shape of to the tensor () in the shape of . And the operation can be mathematically described as below:
where are the coordinates of the voxels in the low resolution space, and .
2.3 HDC and DUC Augmented FCNs for Volumetric Image Segmentation
Both DUC and HDC are network agnostic and can be combined with existing FCNs such as 3D U-net , 3D V-net , and HighRes3dNet  for semantic segmentation as shown in Fig. 2, as long as the dimensions of the output from HDC satisfy the input requirement of the deep neural networks. Fig. 3 shows an example of combining HDC and DUC with 3D U-net for segmenting 3D hip MR images of limited field of view. The advantage of such a pipeline is apparent. When a HDC with down-shuffling factors of is applied to the input data, both the computational and the storage cost for the underlying 3D U-net will be reduced by a factor of , allowing one to use large patch as the input. The full resolution segmentation map is then obtained at the final output by applying a DUC with up-shuffling factors of . To differentiate from the original 3D U-net, we call the 3D U-net augmented with HDC and DUC as 3D large patch U-net (3D LP-U-net). Similarly we can derive 3D LP-V-net and LP-HighRes3DNet respectively by augmenting the original 3D V-net  and HighRes3DNet  with HDC and DUC. In this study, we take the original 3D U-net, 3D V-net and HighRes3DNet as the baseline to evaluate the performance of the associated networks augmented with HDC and DUC. For all the studies, a combination of cross entropy loss with Dice loss as introduced in  is used.
2.4 Implementation Details
All methods reported in this study were implemented in Python using TensorFlow framework and were trained and tested on a desktop with a 3.6 GHz Intel(R) i7 CPU and a NVIDIA GTX 1080 Ti graphics card with 11 GB GPU memory. We empirically fixed the number of output feature maps from the HDC as
. All the weights were initialized from Gaussian distribution (,
) and were then updated by the stochastic gradient descent (SGD) algorithm (momentum = 0.9, weight decay = 0.005). For the baseline networks, the initial learning rate was chosen to be 0.001 and was halved every 3000times iterations. For the networks augmented with HDC and DUC, depending on the shuffling factors, different initial learning rates were used as described below. During a training stage, we randomly cropped sub-volume patches of a fixed size from training samples. Each sampled patch was normalized as zero mean and unit variance before fed into network. During a testing stage, given a test volumetric image, we extracted overlapped sub-volume patches with the same size as we used in the associate training stage and fed them to the trained network to get prediction probability maps. For the overlapped voxels, the final probability maps would be the average of the overlapped patches, which were then used to derive the final segmentation results.
3 Experiments and Results
In this section, we present experimental results of the proposed pipeline for volumetric image segmentation. Two datasets, i.e., an in-house dataset consisting of 25 T1 hip MR images with limited field of view and a publicly available dataset from the MICCAI 2015 IVD localization and segmentation challenge , are used in our study. More specifically, first, we conduct an ablation study on the in-house hip dataset to evaluate the influence of the shuffling factors and of the underlying FCNs on the performance of the proposed pipeline. Based on the findings from the ablation study, we choose the 3D LP-U-net for our remaining studies. Following 
, we used Dice Overlap Coefficients (DOC), Average Surface Distance (ASD) and Hausdorff Distance (HD) as the evaluation metrics.
3.1 Ablation study on hip MR images with limited field of view
3.1.1 Data and augmentation
In this study, we used 25 3D T1 MR images, acquired from patients with hip pain. Those images were acquired by using a dual-flip angle 3D gradient-echo technique (TR/TE = 15/3.3 ms; flip angles: and ; slice thickness: 1.0mm; field of view: ). All images were resampled to have a uniform size of voxels with an average voxel spacing of 0.374mm 0.363mm
1.078mm. Slice by slice manual segmentation was used to create the reference ground truth segmentation. We randomly distributed the 25 datasets into two groups with one group containing 20 datasets as the training data and the remaining 5 datasets as the testing data. During training, data augmentation was used to enlarge the training samples. Specifically, we applied a smooth deformation field on both data and ground truth labels. For this, we sampled random vectors from a normal distribution with a standard deviation of 15 voxels in a
grid of control points and then applied a B-spline interpolation. For each training sample, we generated four additional augmented samples. All the networks used in this study were trained on the augmented training data for 10,000 iterations.
|Patch size||(50, 50, 40)||(96, 96, 96)||(200, 200, 40)|
|DOC (%)||37.45 5.73||30.62 3.55||91.30 5.84||95.89 1.21||92.06 5.37||96.84 0.90|
|ASD (mm)||28.15 5.04||29.27 4.90||5.11 7.57||1.41 1.11||0.88 0.76||0.63 0.31|
|HD (mm)||111.10 10.41||95.1 8.98||33.92 29.0||29.78 20.35||13.71 5.07||10.85 6.09|
3.1.2 Ablation study
We first investigated the influence of patch sizes on the performance of the original 3D U-net. The results are presented in Table 1. It was observed that better performance was obtained when larger patch size was used. Due to the GPU memory constraint, is the maximum size that we can use.
We then examined the effect of different shuffling factors on the performance of the 3D LP-U-net when a fixed patch size of was used. The results are reported in Table 2. From this table, we can see that (1) the higher the shuffling factors, the bigger the initial learning rate that we used; (2) the higher the shuffling factor, in general the less accurate the results but the best results were achieved when the shuffling factor was (4, 4, 2); (3) even with a shuffling factor as high as (25, 25, 2), we still get sub-millimeter segmentation accuracy for both the acetabulum and the proximal femur; and (4) in comparison with the results reported in Table 1, 3D LP-U-net achieved better results than the original 3D U-net with the largest patch size when the shuffling factor was smaller than (16, 16, 2).
We further analyze the learning process of the proposed approach and the 3D U-net. As shown in Fig. 4, in all cases, as the training loss goes down, the validation loss decreases consistently, demonstrating that there is no serious over-fitting for all models even with such small datasets. From Fig. 4, we observe that due to the smaller patch size allowed by the 3D U-net, its learning curves are not smooth. Furthermore, the 3D U-net with large patch size has lower losses on both training and validation datasets than the one with small patch size, demonstrating the importance of using large patch size.
When comparing the learning curves of the 3D LP-U-net and the 3D U-net in Fig. 4, clear distinctions can be observed. First, due to the usage of large patch size, the learning curves of 3D LP-U-net are quite smooth. More importantly, the 3D LP-U-net not only converges much faster than the 3D U-net but also produces much lower losses on both training and validation datasets. It is also interesting to observe that for the 3D LP-U-net, in general, the bigger the shuffling factors, the larger the converged losses but the best results were obtained when the shuffling factor was (4, 4, 2). Such a qualitative observation was consistent with the quantitative results shown in Table 2. These results also demonstrate that the proposed HDC can effectively speed up the training procedure by overcoming optimization difficulties via learning better context features from large patches.
Fig. 5 visually compares the segmentation results obtained by the 3D LP-U-net with a fixed patch size of but different shuffling factors and the 3D U-net with different patch sizes. In this figure, we show both the overall segmentation and the probability of each structure as well as the results around the hip joint. From this figure, we observe that (1) less false positive segmentation was observed when comparing the results obtained by the 3D LP-U-net with those by the 3D U-net; and (2) for the 3D LP-U-net, the larger the shuffling factors, the higher the uncertainty around the boundary.
|Shuffling factors||(2, 2, 2)||(4, 4, 2)||(8, 8, 2)||(16, 16, 2)||(25, 25, 2)|
|Initial learning rate||1.0E-03||2.0E-03||3.0E-03||5.0E-03||2.0E-02|
|DOC (%)||96.77 1.27||97.41 1.34||96.77 1.26||97.95 0.63||96.30 0.97||97.25 0.59||94.24 1.73||95.75 1.02||91.57 2.03||93.82 1.52|
|ASD (mm)||0.39 0.28||0.43 0.28||0.39 0.28||0.33 0.16||0.37 0.11||0.41 0.09||0.62 0.23||0.64 0.16||0.86 0.25||0.96 0.25|
|HD (mm)||7.73 3.81||6.23 2.32||8.57 5.68||3.59 3.95||6.97 3.15||5.15 1.43||10.69 7.44||6.39 1.50||12.64 2.87||8.18 0.66|
|DOC (%)||88.71 6.21||92.27 3.68||92.78 0.50||96.67 0.85||90.66 6.68||86.18 5.08||93.04 4.31||93.58 2.46|
|ASD (mm)||1.77 1.29||1.75 0.77||0.97 0.97||0.59 0.23||1.80 2.36||2.37 0.59||1.77 2.34||1.50 0.82|
|HD (mm)||15.77 6.16||14.0 3.70||12.15 6.81||9.92 4.26||15.94 11.70||17.27 4.93||22.79 13.92||16.67 6.53|
|Results of the 3D LP-V-Net with a fixed patch size of but different shuffling factors|
|Shuffling factors||(2, 2, 2)||(4, 4, 2)||(8, 8, 2)||(16, 16, 2)||(25, 25, 2)|
|DOC (%)||95.58 1.43||97.11 0.63||94.98 1.81||96.62 0.38||93.21 1.74||94.55 0.88||91.66 2.06||93.45 1.40||90.05 2.83||92.69 1.47|
|ASD (mm)||0.63 0.58||0.49 0.15||0.56 0.34||0.51 0.06||0.69 0.25||0.82 0.14||0.85 0.27||1.0 0.22||1.05 0.36||1.10 0.21|
|HD (mm)||11.21 9.97||7.24 2.01||8.51 4.33||5.97 1.74||10.77 6.88||6.76 0.97||11.22 4.76||7.85 1.26||11.73 6.62||7.48 1.29|
|Results of the LP-HighRes3DNet with a fixed patch size of but different shuffling factors|
|Shuffling factors||(4, 4, 1)||(4, 4, 2)||(8, 8, 2)||(16, 16, 2)||(25, 25, 2)|
|DOC (%)||95.99 1.18||97.38 0.52||95.35 1.30||96.62 1.08||93.72 1.69||95.52 0.94||91.15 2.09||92.41 1.69||88.21 2.48||90.0 2.24|
|ASD (mm)||0.43 0.18||0.44 0.12||0.53 0.29||0.60 0.33||0.66 0.21||0.75 0.19||0.90 0.25||1.22 0.32||1.27 0.51||1.55 0.36|
|HD (mm)||8.21 4.05||6.76 2.68-||8.48 4.33||7.85 4.11||11.31 4.47||8.38 3.75||12.95 7.14||9.53 2.25||16.23 5.12||9.62 2.30|
Finally, we checked the influence of different architectures of the underlying FCNs on the performance of the proposed pipeline. Table 3 shows the results when the original 3D V-net and the original HighRes3DNet were used with different patch sizes. Please note that caused by high spatial resolution, HighRes3DNet  requires largest GPU memory to store intermediate results among all three architectures, though it has the smallest number of training parameters. Thus, the maximally allowed size of the input patch for the HighRes3DNet was . In comparison, the results of the 3D LP-V-net and the LP-HighRes3DNet with different shuffling factors are reported in Table 4. From the results reported in Tables 2, 3 and 4, we can observe that (1) results achieved by the 3D LP-V-Net and the LP-HighRes3DNet are better than those achieved by the associated baseline when the chosen shuffling factor is not too big. For example, even with a shuffling factor of (8, 8, 2), the performance of the LP-HighRes3DNet is much better than that achieved by the original HighRes3DNet with the largest patch size allowed; and (2) the bigger the shuffling factor, the less accurate the results.
3.2 Validation on hip MR images
We conducted a standard 5-fold cross validation study on the 25 T1 hip MR images with limited field of view. We used the same data augmentation strategy and the same training strategy as we used in the ablation study. In this study, for the 3D LP-U-net, we chose a fixed patch size of and a fixed shuffling factor of (4, 4, 2). We compared the performance of the 3D LP-U-net with state-of-the-art methods such as 3D U-net , 3D V-net , and HighRes3dNet . For the 3D U-net and the 3D V-net, the chosen patch size is while for the HighRes3DNet, the patch size was chosen to be . The top row of Fig. 6 shows boxplots for overall DOC, ASD and HD of all four methods for segmenting the acetabulum. An average DOC of 96.76 0.92%, 94.01 2.80%, 93.35 3.21% and 90.51 7.32% was found for the 3D LP-U-net, the 3D U-net, the 3D V-net and the HighRes3DNet, respectively. The 3D LP-U-net showed significantly higher accuracy than all other three methods (). For ASD, the same significance was also observed. The bottom row of Fig. 6 shows the comparison results for the proximal femur. An average DOC of 98.14 0.47%, 96.89 0.85%, 96.47 1.54% and 89.99 4.91% was found for the 3D LP-U-net, the 3D U-net, the 3D V-net and the HighRes3DNet, respectively. The 3D LP-U-net showed significantly higher accuracy than all other three methods () when segmenting the proximal femur. These results demonstrate the efficacy of the proposed approach.
3.3 Validation on MICCAI 2015 IVD localization and segmentation challenge data
We conducted experiments on the MICCAI 2015 IVD localization and segmentation challenge data , which contains 25 3D T2-weighted MR images. The resolution of all images were resampled to . The size of the images is between and voxels. Each image contains at least 7 IVDs T11-S1. These 25 MR images were divided into three non-overlapped subsets as training data (15 3D MR images), Test1 data (5 3D MR images) and Test2 data (the remaining 5 3D MR images). All methods were trained on the training data and then separately evaluated on the two testing datasets. Manual segmentation was used as the reference for all evaluations.
We compared the performance of the 3D LP-U-net with top-5 state-of-the-art methods described in . In the training phase, we chose a fixed patch size of voxels and a fixed shuffling factor of (1, 2, 2) for the 3D LP-U-net in order to incorporate as large as possible context information. Table 5 shows the accuracy comparison between 3D LP-U-net and the state-of-the-art methods as described in . For both testing datasets, the 3D LP-U-net achieved consistently better results than other state-of-the-art methods. It is worth to mention that the 3D LP-U-net outperforms the method from the team UNICHK, which is a deeply supervised 3D segmentation network, by nearly 3.6% in terms of average DOC, which is a large improvement. The lower standard deviation of DOC shows that the 3D LP-U-net is the most stable and robust across all different IVD cases. The results that we obtained proves the effectiveness of our approach.
|Method||Test1 results (%)||Test2 results (%)||Average (%)|
|3D LP-U-net||92.4 1.5||92.1 1.7||92.2 1.7|
|UNILJU||91.5 2.3||92.0 1.9||91.8 2.1|
|UNIBE||89.8 2.9||91.2 2.0||90.5 2.6|
|UNIEXE||89.8 3.6||90.2 2.6||90.0 3.1|
|Sectra||90.0 2.6||90.0 2.2||90.0 2.4|
|UNICHK||88.4 3.7||88.9 3.4||88.6 3.5|
We proposed a simple yet effective holistic decomposition convolution for improving semantic segmentation systems. The HDC consists of a periodic down-shuffling operation followed by a conventional 3D convolution. It can be directly applied to the input data and has the advantage of significantly reducing the size of the data for sub-sequential processing while using all the information available in the input irrespective of the down-shuffling factors. To achieve volumetric dense prediction at the output, we used a previously introduced dense upsampling convolution. We showed that HDC and DUC were network agnostic and could be combined with different FCNs for an improved performance. Experimental results demonstrated the effectiveness of our framework on different semantic segmentation tasks.
-  Aitken, A., Ledig, C., Theis, L., Caballero, J., Wang, Z., Shi, W.: Checkerboard artifact free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize. arXiv preprint arXiv:1707.02937 (2017)
-  Çiçek, Ö., Abdulkadir, A., et al.: 3d u-net: learning dense volumetric segmentation from sparse annotation. In: Proc. MICCCAI. vol. 9901, pp. 424–432 (2016)
-  Kamnitsas, K., Ledig, C., et al.: Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Med Image Anal 36, 61–78 (2017)
Krizhevsky, A., ISutskever, Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012),http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
-  Li, W., Wang, G., et al.: On the compactness, efficiency, and representation of 3d convolutional networks: Brain parcellation as a pretext task. In: Proc. IPMI. vol. 10265, pp. 348–360 (2017)
-  Litjens, G., Kooi, T., et al.: A survey on deep learning in medical image analysis. Med Image Anal 42, 60–88 (2017)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proc. CVPR. pp. 3431–3440 (2015)
-  Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: Proceedings of 2016 International Conferece on 3D Vision (3DV), pp. 565–571. IEEE (2016)
Prasson, A., Igel, C., Petersen, K., et al: Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network. In: MICCAI 2013;16(Pt 2). pp. 246–53 (2013)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Proc. MICCAI. vol. 9351, pp. 234–241 (2015)
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1874–1883 (2016)
-  Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Understanding convolution for semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1451–1460. IEEE (2018)
-  Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Proc. CVPR. pp. 636–644 (2017)
-  Zheng, G., Chu, C., Belav́y, D., et al.: Evaluation and comparison of 3d intervertebral disc localization and segmentation methods for 3d t2 mr data: A grand challenge. Medical Image Analysis 35, 327–344 (2017)