In this paper, we attempt to solve the segmentation problem of imbalanced large and small organs in head and neck (HaN) CT images, where the key challenge is to precisely segment small organs whose volume is much smaller than the average. In HaN radiotherapy planning task, it is of vital importance to accurately determine the locations and volumes of organs-at-risks (OARs). Oncologists would design radiotherapy plans such that the radiation can be concentrated on the lesion area without damaging normal organs. Currently, all OARs are manually annotated by oncologists, which is time consuming, tedious and prone to have high inter- and intra-observer variations. Therefore, a computer-aided head organ segmentation system could significantly lower the work load of doctors.
The main difficulty of the task is the severe imbalance between large and small organs (e.g., the smallest organ, lens, only occupy 0.0028% of the whole 3D volume, while the parotid gland is over 250 times larger than lens). State-of-the-art segmentation neural networks trained based on samples’ natural frequencies would have poor performance on the small organs. In addition, due to the limitation of CT technology and the complex anatomical structure of the human head, the contrast between organs and their surroundings is often low. All these factors coupled together make it difficult to develop a method for segmenting both small and large organs simultaneously and accurately.
Over the past decade, many approaches were proposed to resolve the challenging problem of HaN organ segmentation. Early approaches include atlas-based methods, active contours, graph cut and etc. Atlas-based methods were commonly used where there is only a small number of annotated images available. However, atlas-based methods are based on image registration techniques and might generate incorrect organ maps if the organs are occupied by tumors.
Recently, convolutional neural networks (CNN), with its powerful feature representation capability, have made revolutionary progress in many tasks. 2D CNN models such as U-Net, and its 3D variants have achieved large performance gain in 2D and 3D segmentation than traditional methods. For OAR segmentation, Ibragimov et al. 
proposed the first deep learning-based algorithm. Ren et al. proposed a interleaved 3D-CNN for segmenting small organs in HaN, where the region of interest is obtained by registration. Zhu et al.  proposed a 3D Squeeze-and-Excitation U-Net for fast segmentation. However, existing segmentation CNNs are not optimized for imbalanced organ segmentation, these networks generally produce accurate segmentation maps for large organs, while the accuracy of small organs is often sacrificed.
To address the issue of imbalanced large and small organ segmentation, we observed how oncologists annotate OARs. For organs with large volume, they usually annotate them at the normal scale, while for the small organs, they first find the location and then zoom in to accurately mark them. According to this observation, we propose a novel end-to-end 3D convolutional neural network, FocusNet, which is delicately designed for accurate segmentation of both large organs and small organs.
The whole network has three parts: main segmentation network (S-Net), Small-Organ Localization branch (SOL-Net), Small Organ Segmentation branch (SOS-Net). S-Net is a strong backbone network, which is responsible for the segmentation of all large organs and also provides features for small-organ segmentation. SOL-Net is trained for highlighting the center locations of the small organs. Based on the results of SOL-Net, high-resolution feature volumes concatenated with multi-scale features volumes are ROI-pooled and fed into to the SOS-Net for fine segmentation of small organs. All the three networks share feature volumes and are jointly optimized. In addition, a weighted focal loss  combined with generalized dice loss is also used to better deal with the severe sample imbalance problem. The proposed method was evaluated on two datasets: a self-collected HaN dataset with 50 CT scans, and MICCAI 2015 Head and Neck Auto Segmentation Challenge dataset. Our proposed algorithm outperforms state-of-the-art methods on HaN organ segmentation with a large margin.
The overall structure of the proposed FocusNet is illustrated in Fig. 1. The FocusNet first segments large organs with main segmentation network (S-Net) and localizes the small-organ center locations with Small-Organ Localization branch (SOL-Net). Multi-scale features and high resolution feature are ROI-pooled for small organs to generate small-organ label maps. Therefore, the network decouples localization and segmentation of small organs, and solves the sample imbalance problem, which makes the segmentation of small organs much more easier. The results of large and small organs are then fused to generate the final segmentation label maps.
2.0.1 Main Segmentation Network (S-Net).
U-Net is a commonly used 2D CNN, many studies show that its 3D variants have better representation capability in 3D images as they can better capture volumetric contextual information. However, vanilla 3D U-Net has poor performance in OAR segmentation, we address this problem from two aspects. First, U-Net embeds high-resolution information into feature maps by four down-sampling operations, while the decoder reconstructs spatial resolution and obtain dense predictions by deconvolution operations. However, too many times of down-sampling leads to the loss of high-resolution information, which will have catastrophic effects on the small organs that only occupy a few voxels. The shortcut connection between encoder and decoder can only slightly mitigate this problem. Second, UNet can only capture features in fixed scales by downsampling, which limits its representation capability.
The S-Net is elaborately designed to solve the problem mentioned above. As shown in in Fig. 2
, S-Net has a strong backbone, which is a variant of 3D U-Net with residual connections. Squeeze-and-excitation modules are used for channel-wise attention. To solve the first problem, the S-Net only performs down-sampling once. However, such structure has a disadvantage that the receptive field of convolution kernel is limited, which makes it difficult to integrate global image patterns to learn high-level features. Therefore, dilated convolution and densely connected atrous spatial pyramid pooling (DenseASPP)  are utilized in our S-Net. DenseASPP can be seen as the combination of the serial connected and parallel connected counterpart, which has the ability of combining arbitrary scales of features through adjusting dilation rate, and better feature reuse.
2.0.2 Small-Organ Localization Network (SOL-Net).
We then mimic the way oncologists annotate the small organs, we propose to design an SOL-Net to first localize the center locations of small organs. As shown in Fig. 1
, the feature volumes from the last layer of decoder of our S-Net is used as the input of SOL-Net. The training targets are the small-organ center location heat maps, which are created as 3D Gaussian distributions located at the center locations and each small organ has a separate map. The SOL-Net is trained to predict such location maps with a Mean-Square-Error loss. The SOL-Net consists of 2 Squeeze-and-Excitation Residual Blocks (SEResBlock) and a final
convolution layer with sigmoid layer to output the small-organ location probability maps.
2.0.3 Small-Organ Segmentation Network (SOS-Net).
Given the center locations of the small organs from the SOL-Net outputs, we further improve the segmentation accuracy by focusing on the surrounding regions of small organs. Specifically, we first identify the voxel with the highest location probability value from SOL-Net as the small-organ center location, and ROI-pool a 3D feature volume around it. An SOS-Net, which contains 2 SEResBlock and a convolution layer, is created for each small organ for outputting the binary segmentation maps. The side-length of the ROI is determined as a fixed value, which is three times of the average diameter of the small organ. In this way, the unbanlanced negative and positive sample problem is solved.
In order to make the best use of all available information, multi-scale feature volumes from the last layer of the S-Net decoder, raw image, and high resolution feature volumes from the first layer of the S-Net encoder are ROI-pooled from the small-organ ROI and concatenated together as the input of SOS-Net. Furthermore, the small-organ location probability heatmap is also concatenated as the spatial location prior. Intuitively, the multi-scale feature volumes from S-Net already encode small organs’ segmentation results and the high-resolution feature volumes can help refine the segmentation results. Finally, we integrate the small-organ segmentation results and the large-organ segmentation results to obtain the final prediction for all organs.
2.0.4 Loss function.
In our task, the ratio between the background and the smallest organ can reach nearly , which makes the loss dominated by large numbers of easy background samples. We use a focal loss for multi-class classification to solve this problem, and further use weights to balance between organs,
where C is the number of categories, is the probability of class , is the weight of each organ, which is inversely proportional to each organ’s average size. is the modulating factor which weights less on easy samples (voxels), whose prediction confidence is close to 1. In our experiment, is set as 2.
where the and are the label and probability of class t. In our experiment, the combination of focal loss and dice loss results in best segmentation accuracy, the total loss is as follow:
The proposed FocusNet was evaluated on two datasets of HaN CT images. The first dataset, denoted as our dataset, consists of 50 collected samples from hospitals, and 18 organs were delineated manually by doctors. We randomly shuffled our dataset and selected 40 samples for training and 10 samples for testing. For fair comparison with state-of-art methods in OARs segmentation, we further evaluated FocusNet on a public dataset. The MICCAI Head and Neck Auto Segmentation Challenge 2015 dataset, denoted as MICCAI’15 dataset, consists of 38 samples for training and 10 samples for testing, and has 9 organ annotations. Two evaluation metrics are used in this study: Dice score coefficient (DSC) and 95% Hausdorff Distance (95HD).
Training is performed in three stages. We first train the S-Net, and then train the SOL-Net while fixing the trained parameters of S-Net. The SOS-Net is trained at last because it needs the resulting feature volumes and the location probability maps from the SOL-Net. At last, we end-to-end finetune the whole network for joint optimization.
3.1 Experiments on our collected dataset
The average number of voxels of each organ is shown in supplementary material. The sample unbalance problem is severe, where large organs occupies over tens of thousands of voxels while small organs only occupies hundreds or even tens of voxels. Organs with voxels fewer than 1000 are considered as small organs, which results in 10 small organs in total.
3.1.1 Comparison with other methods.
We compare our proposed method with 3D U-Net , a 3D variant of DeepLab-v3+ , and an atlas-based method . U-Net  is popular in 2D medical image segmentation. We modify the U-Net (denoted as SERes U-Net) to replace original convolution with 3D Squeeze-and-Excitation Residual Blocks (SEResBlock) to increase its segmentation accuracy. DeepLab-v3+  is successful for 2D semantic segmentation, we extend their network structure to 3D for volumetric segmentation. All the compared deep learning based methods were randomly initialized and trained using the same loss function as our proposed FocusNet. For the atlas-based method, Symmetric Normalization (SyN) , its implementation in ANTs software package is used to generate template image (atlas) and the template label from training set. When given a CT to be segmented, the optimal transformation between the atlas and target was obtained by registration, and then this transformation is applied to the template label.
Comparative results are shown in Table 2. For large organs, deep learning methods have slightly better results, but for small organs, deep learning methods have large advantages than atlas-based method SyN. It is because that small organs have more complex anatomical structures, atlas-based method has limited capability of dealing with complicated and diverse anatomy variations. Among deep learning based methods, our proposed FocusNet outperforms other methods by large margins. This is because it has specific mechanisms for handling small organs and could therefore generate more accurate resulting label maps.
For qualitative comparison, as seen in Fig. 3, for the optic nerves, the prediction given by DeepLab-v3+ is much larger than the ground truth, FocusNet gives a precise segmentation result. For the optic chiasm, the result of other methods are in a mess, it is difficult to see a clear shape, while FocusNet gives the best segmentation with X-like shape. SERes U-Net has poor results in lens, it is because lens only occupy tens of voxel, too much information is lost due to down-sampling.
3.1.2 Ablation Study of FocusNet.
The ablation results are shown in Table 2. Our baseline is 3D SERes U-Net, which utilizes the SEResBlock but has 4 downsampling operations, and adopts the cross-entropy loss. The baseline results in decent segmentation results for large organs while poor for small organs. We then test the baseline, SERes U-Net, with 1, 2 and 3 down-sampling operation respectively, and the one with 1 down-sampling results in highest performance. Utilization of the focal loss (FL) and then combine the dice loss (DL) improve the segmentation accuracy. Combining the DenseASPP module into the network, which is our S-Net, slightly increases the performance. Our proposed FocusNet boosts the final performance by adopting specifically designed small-organ networks.
We increase the number of channels of all layers of S-Net by a ratio, named Fat U-Net, so that it has the same parameters with FocusNet, result shows that without solving the organ imbalance problem, more parameters cannot boost the performance. We also conducted experiments to find the best ROI size, which is obtained when the side length of the ROI is 3 times than that of the organ.
3.2 Experiments on MICCAI’15 dataset
We also test our FocusNet on MICCAI 2015 Head and Neck dataset. All the settings of the FocusNet are the same as those used in experiments on our collected dataset, except the number of small organ branch is set as 3 since only 3 organs meet our previous definition of small organs: left and right optic nerve and optic chiasm. Visualization can be seen in supplementary material.
The evaluation results are shown in Table 4 and Table 5. We compared the highest score from the top four teams in MICCAI 2015 challenge . For the result of Zhu et al. , it should be noted that they used 38 samples provided by the MICCAI 2015 Challenge combined with additional 216 samples for training.
Our backbone S-Net achieves state-of-the-art performance. It reaches comparable performance in Dice score to Zhu et al.  but with only
of training data, which shows that S-Net has stronger feature representation capability. Moreover, S-Net has much better result in terms of 95HD, because outliers are alleviated by enlarging the receptive field. After adding SOL-Net and SOS-Net, FocusNet achieves further improvement, especially for the three small organs.
Wang et al.  proposed a vertex regression-based method, which has good performance in brain stem and mandible, however, has relatively poorer performance in parotid glands, and they did not provide results of other organs. Compared with the registration-based region proposal and patch-based segmentation method used by Ren et al. , our approach integrates the localization and segmentation of small organs into a unified deep learning framework, which is much faster, has no redundant computation, and results in better performance.
We proposed an end-to-end deep neural network, FocusNet, which outperforms state-of-the-art methods on segmentation of imbalanced OARs in HaN CT images. By reducing the number of down-sampling and utilizing multi-scale features learned by DenseASPP, our S-Net can guarantee the accuracy of the segmentation of large organs. Trained for predicting small-organ center location maps, SOL-Net can generate accurate small-organ central locations. SOS-Net can solve the sample imbalance problem, and high-resolution feature volumes can be utilized to accurately segment small organs and thus can further boost the performance. A weighted focal loss combined with dice loss is introduced to mitigate the sample imbalance problem. Extensive experiments on real patients’ data and the MICCAI 2015 dataset show the effectiveness of our proposed FocusNet.
This work has been supported in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14208417 and CUHK14239816, in part by the Hong Kong Innovation and Technology Support Programme (No. ITS/312/18FX), in part by National Key R&D Program of China (2017YFC0113201) and Zhejiang Key R&D Program (2019C03003).
-  Avants, B.B., et al.: Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis 12(1), 26–41 (2008)
Chen, L.C., et al.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Computer Vision – ECCV 2018. Springer (2018)
-  Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks
-  Ibragimov, B., Xing, L.: Segmentation of organs-at-risks in head and neck ct images using convolutional neural networks. Medical physics 44(2), 547–557 (2017)
-  Lin, T.Y., et al.: Focal loss for dense object detection. In: Proceedings of the IEEE Conference on CVPR. pp. 2980–2988 (2017)
-  Raudaschl, P.F., et al.: Eval. of seg. methods on han ct: Auto-seg. challenge 2015. Medical physics 44(5), 2020–2036 (2017)
-  Ren, X., et al.: Interleaved 3D-CNN s for joint segmentation of small-volume structures in head and neck ct images. Medical physics 45(5), 2063–2075 (2018)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image seg. In: International Conf. on MICCAI. pp. 234–241. Springer (2015)
-  Wang, Z., Wei, L., Wang, L., Gao, Y., Chen, W., Shen, D.: Hierarchical vertex regression-based segmentation of head and neck ct images for radiotherapy planning. IEEE Transactions on Image Processing 27(2), 923–937 (2018)
-  Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp for semantic segmentation in street scenes. In: Proceedings of the IEEE Conf. on CVPR. pp. 3684–3692 (2018)
-  Zhu, W., et al.: Anatomynet: Deep 3d squeeze-and-excitation u-nets for fast and fully automated whole-volume anatomical segmentation. Medical Physics 46(2), 576–589 (2019)