Head and neck (HaN) cancers, such as oral cavity and nasopharynx, are one of the most prevalent cancer types worldwide . Treatment of HaN cancers relies primarily on radiation therapy. To prevent possible post-treatment complications, accurate segmentation of organs-at-risks (OARs) is vital during the treatment planning . In the clinic, computed tomography (CT)-based treatment planning is routinely conducted because of its high efficiency, high spatial resolution, and the ability to provide relative electron density information. Manual delineation of OARs in CT images is still the primary choice regardless of the time-consuming and tedious process. Several hours are required to process the images of only one patient . Besides, it subjects to high inter- and intra-observer variations, which can significantly influence the prognosis of the treatment . Automatic segmentation methods are in urgent need to speed up the process and achieve robust outcomes.
The low contrast of soft tissues in HaN CT images and the large volume size variations of different organs make it challenging to achieve automatic and accurate segmentation of all OARs in an end-to-end fashion. Conventional learning approaches often rely on one or multiple atlases or require the extraction of hand-crafted image features [2, 21], which is difficult to be enough comprehensive and distinctive for the segmentation task. Deep neural networks, especially convolutional neural networks (CNNs), have proved to be highly effective for medical image segmentation in different applications [9, 13]. Many efforts have been devoted to CNN-based segmentation of OARs in HaN CT images. To deal with the class imbalance issue caused by the differently sized organs, image patches based on certain prior knowledge were extracted before conducting CNN-based segmentation [8, 15]. Two-step CNNs consisting of a region detector and a segmentation unit were also employed [12, 17]. To make full use of the image information, a joint localization and segmentation network with a multi-view spatial aggregation framework was proposed . The inputs to these models were either 3D image patches lacking the global features or 2D images without the depthwise information. AnatomyNet was designed to specifically process whole-volume 3D HaN CT images 
. The major contributions of AnatomyNet include a novel network architecture for effective feature extraction and a combined loss function to combat the class imbalance problem. Following AnatomyNet, FocusNet was proposed to better handle the segmentation of both large and small organs with a delicate network structure design.
Despite the inspiring results achieved, several issues exist in the developed approaches. First, some studies dealt with only 2D inputs and thus, did not fully exploit the 3D image information [8, 10, 12]. Others conducted 3D convolutions but without paying attention to the different in-plane and depth resolutions [4, 15, 22]
. The in-plane resolution of 3D HaN CT images is normally several times higher than the depth resolution. The direct employment of 3D convolutions can probably lead to the extraction of distorted image features, which might not be optimal for the segmentation task. Anisotropic convolutions have been proposed to solve this issue but without distinguishing the low-level and high-level features. Second, for networks processing whole volume 3D CT images (AnatomyNet and FocusNet), only one downsampling layer was used to preserve the information of small anatomies. Consequently, the receptive fields of these networks are limited. To increase the receptive field, DenseASPP with four dilation rates (3, 6, 12, and 18) was introduced to FocusNet . However, when the dilation rates of cascaded dilated convolutions have a common factor relationship, the gridding issue may appear that influence the segmentation accuracy . Besides, pure 3D networks also suffer from increased parameters and computational burden issues, which also limit the network depth and performance.
To address these issues, a hybrid convolutional neural network, OrganNet2.5D, is proposed in this work to improve the segmentation performance of OARs in HaN CT images. OrganNet2.5D integrates 2D convolutions with 3D convolutions to simultaneously extract clear low-level edge features and rich high-level semantic features. The hybrid dilated convolution (HDC) module is introduced to OrganNet2.5D as a replacement for the DenseASPP in FocusNet. HDC module is able to increase the network receptive field without decreasing the image resolutions and at the same time, avoid the gridding issue. OrganNet2.5D has three blocks: the 2D convolution block for the extraction of clear edge image features, the coarse 3D convolution block for the extraction of coarse high-level semantic features with a limited receptive field, and the fine 3D convolution block for the extraction of refined high-level semantic features with an enlarged receptive field through the utilization of HDC. Similar to AnatomyNet and FocusNet [4, 22], a combined loss of Dice loss and focal loss is employed to handle the class imbalance problem. The effectiveness of the proposed OrganNet2.5D is evaluated on two datasets. On the publicly available MICCAI Head and Neck Auto Segmentation Challenge 2015 dataset (MICCAI 2015 challenge dataset), promising performance is achieved by OrganNet2.5D compared to state-of-the-art approaches.
We evaluate the performance of our proposed model on two datasets. The first dataset is collected from two resources of 3D HaN CT images (the Head-Neck Cetuximab collection (46 samples)  and the Martin Vallières of the Medical Physics Unit, McGill University, Montreal, Canada (261 samples)111https://wiki.cancerimagingarchive.net/display/Public/Head-Neck-PET-CT). This first dataset is utilized to validate the effectiveness of the different blocks of our model. Segmentation annotations of 24 OARs are provided by experienced radiologists with quality control management. We randomly grouped the 307 samples into a training set of 240 samples, a validation set of 20 samples, and a test set of 47 samples. To compare the performance of our proposed method to the existing approaches, we utilize the MICCAI 2015 challenge dataset . There are 48 samples, among which 33 samples are provided as the training set, 10 as the offset test set, and the remaining 5 as the onsite test set. Manual segmentation of 9 OARs is available for the 33 training samples and 10 offset test samples. Similar to previous studies, we optimize our model with the training samples and report the model performance on the 10 offset test samples.
2.2 Network architecture
The overall network architecture of our proposed OrganNet2.5D is shown in Fig. 1. OrganNet2.5D follows the classic encoder-decoder segmentation network structure . The inputs to our network are the whole volume 3D HaN CT images and the outputs are the segmentation results of the 25 categories for the first dataset (24 OARs and background) or 10 categories (9 OARs and background) for the MICCAI 2015 challenge dataset. OrganNet2.5D contains three major blocks, the 2D convolution block, the coarse 3D convolution block, and the fine 3D convolution block.
2.2.1 2D convolution block.
The 2D convolution block is designed for the extraction of clear edge image features. It is widely accepted that during image encoding, the low-level features extract the geometric information and the high-level features extract the semantic information. Therefore, in our model, only the first two convolutions near the inputs and the corresponding last two convolutions near the outputs are replaced with 2D convolutions. Without the direct application of 3D convolutions, distorted image edge feature extraction can be avoided. Meanwhile, considering the different in-plane and depth image resolutions, in-plane downsampling is conducted with the 2D convolution block to calibrate the image features for the following 3D convolution operations.
2.2.2 Coarse 3D convolution block.
The 2D convolution block is followed by the coarse 3D convolution block. To prevent information loss, especially for the small anatomies, only one downsampling is preserved. The coarse 3D convolution block is designed to extract rich semantic features that are important for the pixel-wise distinction task. Following the successful practice of existing methods, the basic unit of our coarse 3D convolution block is composed of two standard 3D convolution layers and one squeeze-and-excitation residual module (ResSE module, Fig. 2). The ResSE module is responsible for feature filtering to highlight the important features and suppress the irrelevant ones. With the filtered image features, the final segmentation step can concentrate more on the important features and better results can be expected.
2.2.3 Fine 3D convolution block.
With the 2D convolution block and coarse 3D convolution block, clear edge and rich semantic image features are extracted. However, since only two downsampling layers are used (one 2D downsampling and one 3D downsampling), the receptive field of the network is limited. Without the global image information, the segmentation accuracy may be compromised. As such, a series of hybrid dilated convolution (HDC) modules is employed to integrate the global image information with the semantic features and at the same time, to prevent the gridding issue . Moreover, by using different dilation rates, multi-scale image features are extracted, which can better process the OARs of different sizes.
2.3 Loss function
A combination of focal loss and Dice loss is employed to prevent the model from biasing the large objects.
Focal loss forces the network to focus on the hard samples, which refers to the samples predicted by the network with high uncertainty. It is improved from the cross-entropy loss with both fixed and dynamic loss weighting strategies. The focal loss is calculated as:
where refers to the sample size, refers to the different categories (25 for the first dataset and 10 for the second), is the fixed loss weight of the OAR, is the network prediction, is the dynamic loss weight, and is the manual label.
Dice loss deals with the class imbalance problem by minimizing the distribution distance between the network prediction and the manual segmentation. For multi-class segmentation, one Dice loss should be calculated for each class and the final Dice loss is the average over all the classes. In this work, the average Dice loss is calculated as:
The final loss function for our network training is a weighted summation of the two losses:
For our experiments, we empirically set and . The fixed weights in the focal loss for the first dataset are 0.5, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 4.0, 1.0, 1.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 1.0, and 1.0 for the 25 categories (background, brain stem, eye left, eye right, lens left, lens right, optic nerve left, optic nerve right, optic chiasma, temporal lobes left, temporal lobes right, pituitary, parotid gland left, parotid gland right, inner ear left, inner ear right, middle ear left, middle ear right, tongue, temporomandibular joint left, temporomandibular joint right, spinal cord, mandible left, and mandible right), and for the second dataset are 0.5, 1.0, 4.0, 1.0, 4.0, 4.0, 1.0, 1.0, 3.0, and 3.0 for the 10 categories (background, brain stem, optic chiasma, mandible, optic nerve left, optic nerve right, parotid gland left, parotid gland right, submandibular left, submandibular right.)
2.4 Implementation details
All our models are implemented with PyTorch on an NVIDIA GeForce GTX 1080Ti GPU (11G) with a batch size of 2. The inputs to the networks are resized to
. Adam optimizer is utilized to train the models. The step decay learning rate strategy is used with an initial learning rate of 0.001 that is reduced by a factor of 10 every 50 epochs until it reaches 0.00001. Two evaluation metrics are calculated to characterize the network performance, the Dice score coefficient (DSC) and the 95% Hausdorff distance (95HD).
3 Experimental results
3.1 Results on the collected public dataset
Ablation studies regarding our network design are conducted. Average DSC and 95HD on the test set are listed in Table 1. DSC values of the 10 small organs are presented in Table 2. See supplementary material for results on all 24 organs. Four network configurations are involved. 3DUNet-SE refers to the baseline where 3D UNet is combined with the ResSE module. 3DUNet-SE includes only the coarse 3D convolution block in Fig. 1. Introducing the 2D convolution block to 3DUNet-SE, we obtain the 3DUNet-SE-2D model. 3DUNet-SE-2D-C replaces the HDC module in the proposed OrganNet2.5D (Fig. 1) with standard 3D convolutions, and 3DUNet-SE-2D-DC replaces the HDC module with dilated convolutions of the same dilation rate of 2.
Overall, our proposed model achieves the highest mean DSC and lowest mean 95HD. Statistical analysis confirms that our model performs significantly better than the other network configurations ( with paired -tests of the DSC values). These results reflect that both the 2D convolution block and the fine 3D convolution block can enhance the segmentation results. Furthermore, our proposed OrganNet2.5D gives excellent performance on small organ segmentation by generating the best results for 7 of the 10 small organs (Table 2).
|Opt. Ner. L|
|Opt. Ner. R|
|Mid. Ear L|
|Mid. Ear R|
|Models||MICCAI 2015||AnatomyNet ||FocusNet ||SOARS ||SCAA ||Proposed|
|Opt. Ner. L||64.4|
|Opt. Ner. R||63.9|
3.2 Results on MICCAI 2015 challenge dataset
We compare the performance of our proposed model to the state-of-the-art methods on the MICCAI 2015 challenge dataset (Table 3). It should be noted that in the table, the MICCAI 2015 results were the best results obtained for each OAR possibly by different methods. AnatomyNet was trained with additional samples except for the MICCAI 2015 challenge dataset. All the results of existing methods are adopted from the respective papers without method re-implementation to avoid implementation biases.
Segmentation results show that our proposed model achieves better performance than the three most prevalent methods in the field (MICCAI 2015, AnotomyNet, and FocusNet) indicated by the mean DSC, which confirms the effectiveness of the proposed network. Compared to the two recently published methods, SOARS and SCAA, our method is slightly worse. However, it should be noted that SOARS utilized neural network search to find the optimal network architecture , which is more computationally intensive. SCAA combined the 2D and 3D convolutions with a very complicated network design . Nevertheless, with the simple and easy-to-implement architecture, our OrganNet2.5D still performs the best when segmenting the smallest organ, optic chiasma. This observation reflects the suitability of our network modifications and training strategy for our task. Visual results lead to similar conclusions as to the quantitative results (See supplementary material for details).
In this study, we present a novel network, OrganNet2.5D, for the segmentation of OARs in 3D HaN CT images, which is a necessity for the treatment planning of radiation therapy for HaN cancers. To fully utilize the 3D image information, deal with the different in-plane and depth image resolutions, and solve the difficulty of simultaneous segmentation of large and small organs, OrganNet2.5D consists of a 2D convolution block to extract clear edge image features, a coarse 3D convolution block to obtain rich semantic features, and a fine 3D convolution block to generate global and multi-scale image features. The effectiveness of OrganNet2.5D was evaluated on two datasets. Promising performance was achieved by our proposed OrganNet2.5D compared to the state-of-the-art approaches, especially on the segmentation of small organs.
This research is partially supported by the National Key Research and Development Program of China (No. 2020YFC2004804 and 2016YFC0106200), the Scientific and Technical Innovation 2030-”New Generation Artificial Intelligence” Project (No. 2020AAA0104100 and 2020AAA0104105), the Shanghai Committee of Science and Technology, China (No. 20DZ1100800 and 21DZ1100100), Beijing Natural Science Foundation-Haidian Original Innovation Collaborative Fund (No. L192006), the funding from Institute of Medical Robotics of Shanghai Jiao Tong University, the 863 national research fund (No. 2015AA043203), Shenzhen Yino Intelligence Techonology Co., Ltd., Shenying Medical Technology (Shenzhen) Co., Ltd., the National Natural Science Foundation of China (No. 61871371 and 81830056), the Key-Area Research and Development Program of GuangDong Province (No. 2018B010109009), the Basic Research Program of Shenzhen (No. JCYJ20180507182400762), and the Youth Innovation Promotion Association Program of Chinese Academy of Sciences (No. 2019351).
-  Brouwer, C.L., Steenbakkers, R.J.H.M., Heuvel, E.v.d., et al.: 3D variation in delineation of head and neck organs at risk. Radiation Oncology. 7(1), 32 (2012)
-  Chen, A., Niermann, K.J., Deeley, M.A., Dawant, B.M.: Evaluation of multiple-atlas-based strategies for segmentation of the thyroid gland in head and neck CT images for IMRT. Phys. Med. Biol. 57(1), 93–111 (2012)
-  Clark, K., Vendt, B., Smith, K., et al.: The Cancer Imaging Archive (TCIA): Maintaining and operating a public information repository. J. Digit. Imaging 26(6), 1045–1057 (2013)
-  Gao, Y., Huang, R., Chen, M., Wang, Z., Deng, J., Chen, Y., Yang, Y., Zhang, J., Tao, C., Li, H.: FocusNet: Imbalanced large and small organ segmentation with an end-to-end deep neural network for head and neck CT images. In: Shen, D. (eds.) MICCAI 2019, LNCS, vol. 11766, pp. 829–838. Springer, Cham (2019). 10.1007/978-3-030-32248-9_92
-  Guo, D., Jin, D., Zhu, Z., Ho, T., Harrison, A.P., Chao, C., Xiao, J., Lu, L.: Organ at risk segmentation for head and neck cancer using stratified learning and neural architecture search. In: 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4223–4232. Virtual Conference (2020)
-  Han, X., Hoogeman, M.S., Levendag, P.C., Hibbard, L.S., Teguh, D.N., Voet, P., Cowen, A.C., Wolf, T.K.: Atlas-based auto-segmentation of head and neck CT images. In: Metaxas D., Alex L., Fichtinger G., Szekely G. (eds.) MICCAI 2008, LNCS, vol. 5242, pp. 434–441. Springer, Berlin, Heidelberg (2008). 10.1007/978-3-540-85990-1_52
-  Harari, P.M., Song, S., Tome, W.A.: Emphasizing conformal avoidance versus target definition for IMRT planning in head-and-neck cancer. Int. J. Radiation Oncology Biol. Phys. 77(3), 950–958 (2010)
-  Ibragimov, B., Xing, L.: Segmentation of organs-at-risks in head and neck CT images using convolutional neural networks. Med. Phys. 44(2), 547–557 (2017)
-  Li, C., Sun, H., Liu, Z., Wang, M., Zheng, H., Wang, S.: Learning cross-modal deep representations for multi-modal MR image segmentation. In: Shen, D. (eds.) MICCAI 2019, LNCS, vol. 11765, pp. 57–65. Springer, Cham (2019). 10.1007/978-3-030-32245-8_7
-  Liang, S., Thung, K.-H., Nie, D., Zhang, Y., Shen, D.: Multi-view spatial aggregation framework for joint localization and segmentation of organs at risk in head and neck CT images. IEEE Trans. Med. Imaging 39(9), 2794–2805 (2020)
-  Liu, S., Xu, D., Zhou, S.K., Pauly, O., Grbic, S., Mertelmeier, T., Wicklein, J., Jerebko, A., Cai, W., Comaniciu, D.: 3D anisotropic hybrid network: Transferring convolutinoal features from 2D images to 3D anisotropic volumes. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) MICCAI 2018, LNCS, vol. 11071, pp. 851–858. Springer, Cham (2018). 10.1007/978-3-030-00934-2_94
-  Men, K., Geng, H., Cheng, C., et al.: More accurate and efficient segmentation of organs-at-risk in radiotherapy with convolutional neural networks cascades. Med. Phys. 46(1), 286–292 (2019)
-  Qi, K., Yang, H., Li, C., Liu, Z., Wang, M., Liu, Q., Wang, S.: X-Net: Brain stroke lesion segmentation based on depthwise separable convolution and long-range dependencies. In: Shen, D. (eds.) MICCAI 2019, LNCS, vol. 11766, pp. 247–255. Springer, Cham (2019). 10.1007/978-3-030-32248-9_28
-  Raudaschl, P.F., Zaffino, P., Sharp, G.C., et al.: Evaluation of segmentation methods on head and neck CT: Auto-segmentation challenge 2015. Med. Phys. 44(5), 2020–2036 (2017)
-  Ren, X., Xiang, L., Nie, D., Shao, Y., Zhang, H., Shen, D., Wang, Q.: Interleaved 3D-CNNs for joint segmentation of small-volume structures in head and neck CT images. Med. Phys. 45(5), 2063–2075 (2018)
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) MICCAI 2015, LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). 10.1007/978-3-319-24574-4_28
Tang, H., Chen, X., Liu, Y., et al.: Clinically applicable deep learning framework for organs at risk delineation in CT images. Nat. Mach. Intell.1(10), 480–491 (2019)
Tang, H., Liu, X., Han, K., et al.: Spatial context-aware self-attention model for multi-organ segmentation. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 939–949. Virtual Conference (2021)
-  Torre, L.A., Bray, F., Siegel, R.L., Ferlay, J., Lortet-Tieulent, J., Jemal, A.: Global cancer statistics, 2012. Ophthalmology 65(2), 87–108 (2015)
-  Wang, P., Chen, P., Yuan, Y., et al.: Understanding convolution for semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1451–1460. Lake Tahoe, NV, USA (2018)
-  Wang, Z., Wei, L., Wang, L., Gao, Y., Chen, W., Shen, D.: Hierarchical vertex regression-based segmentation of head and neck CT images for radiotherapy planning. IEEE Trans. Image. Process. 27(2), 923–937 (2018)
-  Zhu, W., Huang, Y., Zeng, L., Chen, X., Liu, Y., Qian, Z., Du, N., Fan, W., Xie, X.: AnatomyNet: Deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy. Med. Phys. 46(2), 576–589 (2019)