Semantic segmentation, which is essential for various applications, is a challenging task in medical imaging. Accurate volumetric medical image segmentation can not only quantitatively assess the volumes of interest (VOIs), but also contribute to the precise disease diagnosis, computer-aided interventions, and surgical planning [9, 19]. Manually annotating volumetric medical images (with hundreds of slices and complicate structures) is tedious, time-consuming, and error-prone. Thus, automatic volumetric medical image segmentation methods are highly desired.
Two-dimensional fully convolutional neural network (2D FCN)-based methods have been widely adopted for medical image segmentation[16, 21]. However, medical images are commonly in 3D with rich spatial information. Meanwhile, large variations exist in structural appearance, size, and shape among patients. Thus, exploiting 3D structural and anatomical information is critical for accurate volumetric medical image segmentation. Recent works extended 2D FCNs to 3D FCNs by directly adding an operation in the extra dimension [3, 4, 6, 13, 15]. Although satisfactory performances were obtained, the parameters and floating-point-operations (FLOPs) increased extremely compared with the 2D counterparts. As a result, increased demands for large training datasets and advanced computational resources arise.
To reduce model parameters and FLOPs and at the same time, maintain the segmentation performance, convolutional kernel factorization-based methods have been extensively investigated for Deep Convolutional Neural Networks (DCNNs)[5, 17, 18, 22, 20]. In the earliest DCNNs, filters with large kernels were designed to enlarge the receptive field (RF) and make full use of the spatial context . Later studies found that by decomposing a large filter into several consecutive small filters, the same RF could be obtained and superior performance with fewer parameters and FLOPs could be achieved [17, 18]. For example, a filter can be decomposed into three filters. Decomposing a high dimensional filter into several low dimensional filters along the different dimensions is another method of convolutional kernel factorization. Depthwise Separable Convolutions (DSCs) decompose filters along the spatial and channel dimensions . DSCs treat pointwise convolutions ( for 2D networks and for 3D networks) as the endpoint of convolution factorization. Pointwise convolutions are the most efficient convolutions in DCNNs with the fewest parameters and FLOPs. Nonetheless, the severely limited RF of pointwise convolutions makes it difficult to construct a working neural network with pure pointwise convolutions.
In this paper, we attempt to build a novel DCNN for volumetric medical image segmentation by answering the following question: Can we replace all the convolutions in DCNNs with pointwise convolutions while keeping the segmentation performance? To achieve the objective, we need to solve the following problems of FCNs with only stacked pointwise convolutions: (1) The receptive field never enlarges. (2) The 3D spatial image context cannot be utilized. (3) Long-term dependencies in images are not exploited. To address these issues, we propose Group Shift (GS), a parameter-free operation. Equipped with GS, our final model with only pointwise convolutions (pointwise FCNs) can achieve comparable or even better performances than the corresponding 3D FCNs with significantly reduced parameters and FLOPs.
The major innovation of our proposed method lies in the design of GS. GS is developed to compensate for the limited RF of pointwise convolutions in a parameter-free manner and construct long-term regional dependencies. GS consists of two key steps, grouping and shift. In this section, we will describe the two steps as well as the formulation of GS in detail.
Spatial Grouping. Given the input and output feature maps of GS as and . , , and are the three spatial dimensions. is the number of channels. We first divide the images equally into , , and groups along the three spatial dimensions as shown in Fig. 1a, resulting in image groups in total. The dimension of each spatial group is , and we have . So after spatial grouping, the input feature maps are transformed to .
Channel Grouping. Empirically, we want to shift only a part of the features. The un-shifted features contain the original localization information that is also important for the final segmentation performance. Supposing the number of channels to be shifted is and the number of channels to keep un-shifted is , and . Then, we split into groups (same as the spatial groups). Each channel group contains channels, and . After channel grouping, the output feature map is . Channel grouping is illustrated in Fig. 1b.
Therefore, the input feature maps are transformed to after spatial and channel grouping. can proceed to the subsequent shift operation.
To force the pointwise convolutions into extracting more spatial information, we elaborately design a shift operation. Fig. 1c is an example to illustrate how the shift operation works. We assume that the feature maps are divided into four spatial groups () (corresponding to the four columns with different colors) and rearrange the spatial groups in a column-wise manner (Fig. 1c, left figure). The channels are divided into shift channels and un-shift channels . The shift channels are further grouped into four groups (corresponding to the upper four rows in Fig. 1c). Then, we shift each channel group in with a step equals to the index of the channel group (Fig. 1c, right figure). Shifting one step means that moving one spatial group in the specific channel group to the neighbor spatial group. All the channel groups shift in the same direction and shifting happens only within the specific channel group without channel shifting.
From Fig. 1c, we can observe that after shifting, every spatial group (i.e. every column) contains one channel group of all the other spatial groups. In other words, one voxel in a specific location in a spatial group contains one group of channels of the corresponding voxel with the same location in all the other spatial groups. Thus, the elaborately designed shift operation can not only increase the RF but also make full advantage of the spatial context, especially long-term dependence. Ideally, it can effectively solve the raised three problems.
2.3 Formulation of Group Shift
Let be the coordinates of a specific voxel in the shifted feature map and be the corresponding coordinates of the same voxel in the input feature map . Specifically, we should find:
where , , , and . The spatial groups along three dimensions are , , and . The spatial size of each spatial group is , and . The number of channels to be shifted is . Suppose the current spatial group index of in the input feature map is , shift step is , and the shifted spatial group index of in the shifted feature map is . The relationships of the coordinates between the shifted feature map and input feature map are defined as follows:
Extensive experiments are conducted on two benchmark datasets, PROMISE12  and BraTS18 [1, 2, 12]. PROMISE12 released 50 transversal T2-weighted MR images of the prostate and corresponding segmentation ground truths as the training set and 30 MR images without ground truths as the validation set. The input size of this dataset is set to through random cropping. BraTS18 provides multimodal MR scans (T1, T1ce, T2, and FLAIR) for brain tumor segmentation. In the training set, there are 285 images with segmentation labels. All provided volumes have the same matrix size of . The input size of BraTS18 is set to through random cropping.
The network architecture of the 3D FCN adopted in this study is a tiny 3D U-Net  (See supplementary material for the detailed structure). When all convolutions in the network are replaced by pointwise convolutions, the 3D FCN becomes our pointwise FCN. The proposed GS can be inserted to any position of the pointwise FCN. In this paper, we investigate four GS-related configurations, “CSC”, “CCS”, “CSCS”, and “CSCSUpShift” as shown in Fig. 2. The numbers of spatial and channel groups of GS are determined by the size of the input feature maps.
Two baselines are investigated, 3D FCNs with
convolutions and pointwise FCNs without GS. We randomly split the two datasets into two groups with a ratio of 8:2 for network training and validation. For preprocessing, we normalize each 3D image independently with the mean and standard deviation calculated from the corresponding foreground regions. The poly learning rate policy is adopted with an initial learning rate of 0.01 and a power of 0.9. The optimizer utilized is stochastic gradient descent (SGD), and the loss function is Dice loss
. All our models are implemented with PyTorch on a Titan XP GPU (12G) with a batch size of 4. Two evaluation metrics, “dice” and “mDice”, are reported. Here, “dice” is the Dice score calculated for each foreground class, and “mDice” is the average “dice” of all foreground classes.
|WT||TC||ET||mDice (%)||mDice (%)|
|pointwise FCN without GS||86.4||79.7||72.7||79.6||65.5|
|Different SG settings||Results under different SG and IP|
|SG||Stage 1||Stage 2||Stage 3||Stage 4||Stage 5||CSC||CCS||CSCS||CSCSUpShift|
|ProSGv1||(2, 2, 2)||(2, 2, 2)||(2, 4, 4)||(1, 8, 8)||(1, 8, 8)||84.9||84.0||83.1||83.0|
|ProSGv2||(1, 2, 2)||(1, 4, 4)||(2, 4, 4)||(1, 8, 8)||(1, 8, 8)||84.5||85.4||85.2||84.5|
|ProSGv3||(2, 2, 2)||(1, 4, 4)||(1, 4, 4)||(1, 8, 8)||(1, 8, 8)||85.6||84.3||84.8||83.6|
|ProSGv4||(1, 2, 2)||(2, 2, 2)||(2, 4, 4)||(1, 8, 8)||(1, 8, 8)||85.0||84.3||85.3||84.6|
3.1 Results on PROMISE12
Results of the two baselines on PROMISE12 are shown in Table 1. As expected, when all convolutions in 3D FCNs are replaced with pointwise convolutions, the network performance drops dramatically. The mDice value is decreased by more than 20%. This reflects that large effective RFs and long-term dependencies in images are important for large foreground object segmentation, such as the prostate.
Considering the matrix sizes of the input images and the feature maps at different network stages, four settings of spatial groups (ProSGv1 to ProSGv4 in Table 2) are investigated. Specifically, we test different spatial group numbers at different stages. Basically, more spatial groups at deeper stages and more spatial groups in the in-plane dimensions are utilized. Together with the four GS configurations (“CSC”, “CCS”, “CSCS”, and “CSCSUpShift”), there are 16 experimental conditions in total.
Overall, the segmentation results of pointwise FCNs adding GS (Table 2) are better than that without GS (65.5% in Table 1) with a large margin. Among the four spatial group settings, “ProSGv2” achieves the best average results (84.9%) under the four GS configurations. Among the four GS configurations, “CSC” achieves the best average results (85.0%) under the four spatial group settings. Nevertheless, “ProSGv3” with “CSC” achieves the best result with a mDice value of 85.6%, which is only slightly worse than that obtained with normal 3D FCNs (87.3%) utilizing computational intensive 3D convolutions.
With the best configuration of our pointwise FCN (“ProSGv3” with “CSC”), we further investigate the influence of the ratio of the shifted channels on the network performance. When all the input feature channels are allowed to shift ( and ), the segmentation results (mDice = 81.4%) are much worse than that obtained when we only shift half of the input features (mDice = 85.6%). Therefore, we conclude that both local (preserved by the un-shifted channel groups) and spatial information (extracted through the shifted channel groups) are important for the final prostate segmentation.
3.2 Results on BraTS18
Surprisingly, for the two baselines, the results of pointwise FCNs (mDice = 79.6%) are only slightly worse than those of 3D FCNs (mDice = 80.5%) on BraTS18 as shown in Table 1, which is quite different from the results achieved on PROMISE12. We suspect that this phenomenon is caused by the different properties of the two datasets. The target objects of BraTS18 data (brain tumors) are much smaller than those of PROMISE12 data (prostate regions). The local information within the limited RF of pointwise FCNs is enough to achieve satisfactory segmentation results on BraTS18.
We investigate the influence of insert positions of GS on the final performance with BraTS18 data when utilizing a spatial group setting of (Stage 1-5: spatial groups of (2,2,2), (2,2,2), (2,2,2), (4,4,4), and (5,5,5)) (See supplementary material). A similar conclusion can be drawn that ”CSC” achieves the best average result (mDice = 81.2%) among the four GS configurations (80.7%, 80.1%, and 80.2% for CCS, CSCS, and CSCSUpShift), which is even slightly better than that given by the 3D FCN (80.5%). This indicates the effectiveness of our pointwise FCNs with GS (GSP-Conv) for small object segmentation tasks.
|Kao et al. ||9.45||203.96||78.75||90.47||81.35||3.81||4.32||7.56|
|No New-Net ||10.36||202.25||81.01||90.83||85.44||2.41||4.27||6.52|
With this dataset, we treat the encoder and the decoder of the network differently and add the GS operations to one of them at a time. Results reflect that adding GS to the decoder (82.6%) is more effective for the brain tumor segmentation task than adding GS to the encoder (81.5%) or to both (81.2%). We speculate that when adding GS only to the decoder, we can keep more local detailed information un-shifted, which is essential for small object segmentation.
Comparisons to state-of-the-art methods [4, 7, 8, 14, 15], including factorization-based methods [4, 15], are performed on the test set of BraTS18 through the online server (Table 3). Following the best practices, we use the same data preprocessing, training strategies, and training hyper-parameters as . Overall, our method achieves competitive results when compared to these methods with much fewer parameters and FLOPs. With less than 8% parameters and less than 11% FLOPs, our methods can still generate very accurate brain tumor segmentation, which is crucial for acute situations when fast diagnoses are important.
Two major limitations exist with our current experimental design. First, we only experimented with the tiny 3D U-Net architecture. Second, our model contains a number of hyper-parameters that might need to be tuned for different applications. Therefore, we believe that we have not made the most of the capability of the proposed GS operation. In our following work, we will investigate the effects of the data (imaging modality, spacing, volume size, and target object size) on the choice of the best model configurations. We will also try to design dedicated network architecture according to these properties of the data. Particularly, the number of stages, the number of channels in each stage, the number of convolution operations in each “Conv Block” (Fig. 2), and the number of “Conv Block” in both the encoder and the decoder will be accordingly optimized. Adding the different settings of the proposed GS operation, all these factors will build a large search space. We are considering introducing the neural architecture search (NAS) method to automate the process.
Nevertheless, equipped with the current version of our proposed GS, the pointwise FCNs can already achieve comparable or even better performance than the corresponding 3D FCNs. To the best of our knowledge, this is the first attempt to segment volumetric images with only pointwise convolutions. We provide a new perspective on model compression. Our proposed GSP-Conv operation can be of high application value when fast and accurate imaging diagnoses are needed. In addition, we believe that the proposed method can be easily extended to other image processing tasks, including image classification, object detection, image synthesis, and image super-resolution.
This research is partially supported by the National Key Research and Development Program of China (No. 2020YFC2004804 and 2016YFC0106200), the Scientific and Technical Innovation 2030-”New Generation Artificial Intelligence” Project (No. 2020AAA0104100 and 2020AAA0104105), the Shanghai Committee of Science and Technology, China (No. 20DZ1100800 and 21DZ1100100), Beijing Natural Science Foundation-Haidian Original Innovation Collaborative Fund (No. L192006), the funding from Institute of Medical Robotics of Shanghai Jiao Tong University, the 863 national research fund (No. 2015AA043203), the National Natural Science Foundation of China (No. 61871371 and 81830056), the Key-Area Research and Development Program of GuangDong Province (No. 2018B010109009), the Key Laboratory for Magnetic Resonance and Multimodality Imaging of Guangdong Province (2020B1212060051), the Basic Research Program of Shenzhen (No. JCYJ20180507182400762), and the Youth Innovation Promotion Association Program of Chinese Academy of Sciences (No. 2019351).
-  Bakas, S., et al.: Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 4, 170117, (2017)
-  Bakas, S., et al.: Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. arXiv preprint arXiv:1811.12506 (2018)
-  Chen, C., Liu, X., Ding, M., Zheng, J., Li, J.: 3d dilated multi-fiber network for real-time brain tumor segmentation in MRI. In: Shen D. et al. (eds) MICCAI 2019, LNCS, vol. 11766, pp. 184–192. Springer, Cham (2016). 10.1007/978-3-030-32248-9_21
-  Chen, W., Liu, B., Peng, S., Sun, J., Qiao, X.: S3D-UNet: separable 3D U-Net for brain tumor segmentation. In: Crimi A., Bakas S., Kuijf H., Keyvan F., Reyes M., van Walsum T. (eds) BrainLes 2018, LNCS, vol. 11384, pp. 358–368. Springer, Cham (2018). 10.1007/978-3-030-11726-9_32
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on CVPR, pp. 1251–1258 (2017)
-  Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, 0.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin S., Joskowicz L., Sabuncu M., Unal G., Wells W. (eds) MICCAI 2016, LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). 10.1007/978-3-319-46723-8_49
-  Isensee, F., Kickingereder, P., Wick, W., Bendszus, M., Maier-Hein, K.H.: No new-net. In: Crimi A., Bakas S., Kuijf H., Keyvan F., Reyes M., van Walsum T. (eds) BrainLes 2018, LNCS, vol. 11384, pp. 234–244. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-11726-9_21
Kao, P.Y., Ngo, T., Zhang, A., Chen, J.W., Manjunath, B.S.: Brain tumor segmentation and tractographic feature extraction from structural mr images for overall survival prediction. In: Crimi A., Bakas S., Kuijf H., Keyvan F., Reyes M., van Walsum T. (eds) BrainLes 2018, LNCS, vol. 11384, pp. 128–141. Springer, Cham (2018).https://doi.org/10.1007/978-3-030-11726-9_12
Khened, M., Kollerathu, V.A., Krishnamurthi, G.:Fully convolutional multi-scale residual DenseNets for cardiac segmentation and automated cardiac diagnosis using ensemble of classifiers. Med. Image Anal.51, 21–45 (2019)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)
-  Litjens, G., et al.: Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Med. Image Anal. 18(2), 359–373, (2014)
-  Menze, B.H., et al.: The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34(10), 1993–2024, (2014)
-  Milletari, F., Navab, S., Ahmadi, S.-A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: Proceedings of the 4th International Conference on 3DV, pp. 565–571 (2016)
Myronenko, A.: 3D MRI brain tumor segmentation using autoencoder regularization. In: Crimi A., Bakas S., Kuijf H., Keyvan F., Reyes M., van Walsum T. (eds) BrainLes 2018, LNCS, vol. 11384, pp. 311–320. Springer, Cham (2018).https://doi.org/10.1007/978-3-030-11726-9_28
-  Nuechterlein N., Mehta S.: 3D-ESPNet with pyramidal refinement for volumetric brain tumor image segmentation. In: Crimi A., Bakas S., Kuijf H., Keyvan F., Reyes M., van Walsum T. (eds) BrainLes 2018, LNCS, vol. 11384, pp. 245–253. Springer, Cham (2018). 10.1007/978-3-030-11726-9_22
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015, LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). 10.1007/978-3-319-24574-4_28
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
-  Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on CVPR, pp. 2818–2826 (2016)
-  Tang, H., et al.: Clinically applicable deep learning framework for organs at risk delineation in CT images. Nat. Mach. Intell. 1(10), 480–491, (2019)
-  Tran, D., Wang, H., Torresani, L., Ray, J., Lecun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on CVPR, pp. 6450–6459 (2018)
-  Xian, M., et al.:Automatic breast ultrasound image segmentation: A survey. Pattern Recogn. 79, 340–355 (2018)
-  Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of ECCV, pp. 305–321 (2018)