Head and Neck (H&N) tumour is the fifth most prevalent cancer worldwide.
Improving the accuracy and efficiency of disease diagnosis and treatment is the rationale behind the developments of computer-aided systems in medical imaging [1, 2].
However, obtaining manual segmentations, which can be used for diagnosing and treatment purposes, is time consuming and suffers from intra- and inter-observer biases.
Furthermore, segmentation of H&N tumours is a challenging task compared to other parts of the body as the tumour displays similar intensity values to the adjacent tissues making it non distinguishable to the human eye in Computed Tomography (CT) images. Previous attempts at developing deep learning models to segment head and neck tumours suffered from a relatively high number of false positives [3, 4].
Currently, in the normal clinical pathway, a combination of Positron Emission Tomography (PET) and CT images plays a key role in the diagnosis of H&N tumors.
This multi-modal approach has dual benefits: the metabolic information is provided by PET and anatomical information is available in CT.
Furthermore, accurate segmentation of H&N tumors could also be used in automating pipelines for extraction of quantitative imaging features (e.g. radiomics) in prediction of patient survival.
The 3D UNet  is one of the most widely employed encoder-decoder architecture for medical segmentation inspired by Fully Convolutional Networks . Promising results have been obtained using 3D UNet based architecture and attention mechanisms with early fusion of PET/CT images . While the performance of the model proposed in  was significantly improved when compared to a baseline 3D UNet, a number of false positives was reported where the model was not only segmenting the primary tumour but also other isolated areas such as the soft palate due to tracer overactivity in that region.
In this paper, we propose to segment 3D H&N tumor volume from multimodal PET/CT imaging using a full scale 3D UNet3+ architecture 
with attention mechanism. Our model, NormResSE-UNet3+, is trained with a hybrid loss function of Log Cosh Dice and Focal loss . The segmentation maps, predicted by our model, are further refined by Conditional Random Fields post-processing [12, 13] to reduce number of false positives and to improve tumour boundary segmentation. For the progression free survival prediction task, we propose a Cox proportional hazard regression model using a combination of clinical, radiomic, and deep learning features from PET/CT images.
The paper is organised as follows. Section 2 outlines the data set and pre-processing steps (Section 2.1), the methods used for 3D H&N tumor segmentation and for prediction of progression free survival tasks (Section 2.2.4), and the evaluation criteria (Section 2.3). The experimental set-up and results are described in Section 3. Finally, the discussion and conclusion are found in Section 4.
2 Methods and Data
2.1.1 Data for Segmentation.
PET and CT images used in this challenge were provided by the organisers of the HECKTOR challenge at the 24th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). The total number of training cases is 224 from 5 centers: CHGJ, CHMR, CHUS, CHUP, and CHUM. The ground-truth annotations are provided by expert clinicians for primary gross tumor volume. For testing, additional 101 cases from two centers, namely CHUP and CHUV, are provided. However, no expert annotations are available to the participants.
2.1.2 Data Preprocessing for Segmentation Task.
For segmentation task, we used trilinear interpolation to resample PET and CT images. Bounding boxes ofxx
voxels were provided by the organizers and used for patch extraction. PET intensities (given in standard uptake value) were normalised with Z-score, while CT intensities (given in Hounsfield unit) were clipped to the range.
2.1.3 Data for Progression Free Survival.
Patient clinical data are provided for prediction of progression free survival in days. The covariates (a combination of categorical and continuous variables) are as follows: center ID, age, gender, TNM 7/8th edition staging and clinical stage, tobacco and alcohol consumption, performance status, HPV status, treatment (radiotherapy only, or chemoradiotherapy). Dummy variables were used to encode the categorical variables i.e. the ones mentioned above except for age (continuous). Among the 224 patients for training, the median age was 63 years (range: 34-90 years) with progression event occurred in 56 patients and an average progression survival of 1218 days (range: 160-3067 days). The testing cohort comprised 129 patients with a median age of 61 years (range: 40-84 years).
2.1.4 Data Preprocessing for Progression Free Survival Task.
For the prediction of progression free survival, we used multiple imputation for missing values. The multiple imputation models each feature, which contains missing values, as a function of the other features. Then, it uses this estimate in a round-robin fashion for imputing the missing values. At each iteration, a feature is designated as outputand the other features are treated as inputs . A regressor is fit on on the known used to predict the missing values of
. This process is repeated for each feature and for ten imputation rounds. Feature selection was performed using Lasso regression with 5-fold cross-validation using a combination of features i.e. clinical features, radiomic features, and features extracted from 3D UNet (see Fig.1
). The correlation between those features was evaluated with Spearman’s correlation coefficient in order to assess potential redundancy. A threshold of 0.80 was set to filter out highly correlated features. The feature selection process reduced the number of features from 275 to 70 most relevant features. In particular, we kept 7 clinical features (i.e. Age, Chemotherapy, Tstage, Nstage, TNMgroup, TNM edition, and HPV status), 14 radiomics (5 intensity based histogram features, 3 shape features, and 6 texture features from metrics: Gray Level Co-occurrence Matrix (GLCM), Gray Level Run Length Matrix (GLRLM), Gray Level Size Zone (GLSZM)), and 49 deep learning features. Radiomic features were extracted from PET and CT images using the pyradiomics package. Convolutional Neural Networks (CNNs) by using stacks of filtering layers together with pooling and activation layers, are becoming increasingly popular in the field of radiomics
. This can be explained by the fact that CNNs do not require prior knowledge and kernels are learned automatically as opposed to hand-crafted features. In this study,deep learning features were extracted at the 5th convolutional layer of our model by averaging feature maps. A vector was created for each feature map, then concatenated to form a single vector of deep learning features. The power of CNNs is their ability to automatically learn multiple filters in parallel thus extracting low and high-level features such as edges, intensity but also texture. Each filter captures different characteristics of the image ultimately enabling CNNs to capture relevant edge, intensity and texture features (see a schematic overview of the survival pipeline in Fig.1).
2.2 Models Description
2.2.1 Models for Segmentation Task.
The UNet model , an encoder-decoder architecture, is one of the most widely employed segmentation models in medical imaging. Skip connections are used to couple high-level feature maps obtained by the decoder and corresponding low-level feature maps by the encoder. UNet++ is an extension of UNet that introduce nested and dense skip connections to reduce the merging of dissimilar features from plain skip connections in UNet . However, since UNet++ still fails to capture relevant information from full scales and to recover lost information in down- and up-sampling,  proposed UNet3+ to take full advantage of multi-scale features. The design of inter- and intra-connection between the encoder and decoder pathways at full scale enables us to explore both fine and coarse level details (see a schematic overview in Fig. 2). Low level details contain information about the spatial and boundary information of the tumour, while high-level details encode information about the location of tumour. The integration of deep supervision in the decoder pathway is used to reduce false positives. To further reduce false positives and improve segmentation, we make use of attention mechanisms achieving state-of-the-art segmentation results . In particular, we use 3D normalised squeeze-and-excitation residual blocks proposed by  and evaluated on PET/CT H&N dataset from MICCAI 2020 challenge .
2.2.2 Loss function for Segmentation Task.
The Dice Coefficient is a widely used loss function for segmentation tasks, and is defined as follows:
In addition, 1 is added in the numerator and denominator to ensure that the function is not undefined in cases when y = = 0, i.e. the tumour is not present.
The Focal loss  is a variation of the Cross-Entropy loss. The Focal loss is well-suited for imbalance problems as it down-weights easy examples to focus on hard ones, and is defined as follows:
where in the modulating factor is optimised at 2. The Log-Cosh is also popular for smoothing the curve in regression problems .
For the data in the HECKTOR challenge, we tested the abovementioned loss functions and their combinations, and in our best performing model, we used a hybrid loss function, the Cosh Log Dice loss combined with the Focal loss defined as follows:
2.2.3 Refining segmentation maps.
We used 3D Conditional Random Fields (CRF) to refine segmentation maps [12, 13]. The segmentation output produced by CNNs tend to be too smooth because of neighbouring voxels sharing spatial information. CRF is a graphical model that captures contextual, shape and region connectivity information thus becoming a popular refinement procedure to improve segmentation performance, for example,  used CRF to refine the segmentation outputs as a post-processing step.
2.2.4 Models for Progression Free Survival Task.
Cox proportional hazard (CoxPH) regression model is the most commonly used hazard model in the medical field because it effectively deals with censoring. Random Survival Forest is also a popular model for survival time prediction working better for big sample sizes [19, 20]. It builds an ensemble of trees on different bootstrap samples of the training data before aggregating the predictions. DeepSurv 
showed improvements over traditional CoxPH model as it better captures the complex relationship between a patient’s features and effectiveness of different treatments. DeepSurv is a Cox proportional hazards deep neural network, which estimates the individuals’ effect based on parametrized weights of the neural network. The architecture is a multi-layer perceptron configurable with the number of hidden layers. In this study, we used 32 hidden layers which are fully-connected nonlinear activation layers. Dropout layers are added to reduce over-fitting. The output layer of DeepSurv has a single node with linear activation function to give estimates of log-risk hazard. Compared to traditional Cox regression, which is optimized with the Cox partial likelihood, DeepSurv uses the negative log partial likelihood with the addition of a regularization term. DeepSurv achieved state-of-the-art results for cancer prognosis prediction with concordance index close or higher than 0.8[22, 23].
2.3 Evaluation Metrics
2.3.1 Evaluation Metrics for Segmentation Task.
The Dice Similarity Coefficient (DSC) is a region-based measure to evaluate the overlap between the prediction (P) and the ground truth (G). DSC is given as follows:
The DSC ranges between 0 and 1, with a larger DSC denoting better performance.
The average Hausdorff distance (HD) between the voxel sets of ground truth and segmentation is defined as:
where GtoP is the directed average HD from the ground-truth to the segmentation, PtoG is the directed average HD from the segmentation to the ground truth, G is the number of voxels in the ground truth, and P is the number of voxels in the segmentation. The 95th percentile of the distances between voxel sets of ground truth and segmentation (HD95) is used in this work to reduce the impact of outliers.
2.3.2 Evaluation Metrics for Survival Task.
Harrell’s concordance index (C-index) is the most widely used measure of goodness-of-fit in survival models. It is defined as the ratio of correctly ordered (concordant) pairs divided by the total number of possible evaluation pairs. The C-index is used in this study to evaluate survival prediction outcome as it takes into account censoring. The C-index quantifies how well an estimated risk score is able to discriminate among subjects who develop an event from those who do not. In this work, the event of interest is progression. The C-index ranges between 0 and 1 with 1 denoting perfect predicted risk.
3.1 Segmentation Task
The model was trained on 2 NVIDIA A100 GPUs for 1000 epochs. The optimizer used is Adam (0.9,0.999). The scheduler is cosine annealing with warm restarts with the input learning rate value of, and reducing the learning rate every 25 epochs. A batch size of 2 was used for training and validation. Data augmentation, namely random flipping and random rotation, is used during training to reduce over-fitting. Lifelines and Pycox packages were used for all statistical analyses.
We trained the 3D NormResSE-Unet3+ on a leave-out one center, and we performed model ensembling by averaging the predictions on the test set of the 5 models trained (see Fig. 3 and Tab. 1). An example of a good quality segmentation map predicted by our model is shown in Fig. 3 (the first row). An example of the predicted segmentation map, which benefited from the CRF post-processing to reduce false positives is shown in Fig. 3 (the second row). An example of failure of our pipeline to discard false positives from true primary tumour is shown in Fig. 3 (the third row). The quantitative results are summarised in Tab. 1. For each fold, the segmentation results are presented in terms of DSC. We obtained an average DSC of 0.753 and an average Hausdorff Distance at 95% (HD95) of 3.28 with post-processing and ensembling techniques. On the test set provided by HECKTOR2021, our model achieved an average DSC of 0.7595 and HD95 of 3.27, showing good generalisability.
|Cross-validation fold||NormResSE-UNet3+||NormResSE-UNet3+ + CRF|
3.2 Survival Task
We trained three models: CoxPH regression, Random Survival Forest, and DeepSurv on 5-fold cross-validation splits. Each of the above models were trained with different configurations of clinical, PET/CT radiomic and deep learning features. CoxPH regression was trained with a combination of clinical, CT radiomics, and deep learning features, and achieved the best significant c-index of 0.82 (p-value
0.05) using a corrected paired two-tailed t-test at the 5% significance level to compare each pair of models (see Tab. 2). The second best c-index of 0.75 was obtained with CoxPH regression trained with clinical and deep learning features (see Tab. 2).
The performance on the test set provided by HECKTOR 2021 with the CoxPH regression using clinical, CT radiomics and deep learning features was significantly lower i.e. 0.62 suggesting over-fitting issues.
|CoxPH Regression (clinical)||0.70|
|CoxPH Regression (clinical + PET radiomics)||0.67|
|CoxPH Regression (clinical + CT radiomics)||0.68|
|CoxPH Regression (clinical + PET/CT radiomics)||0.72|
|CoxPH Regression (clinical + deep learning features)||0.76|
|CoxPH Regression (clinical + CT radiomics + deep learning features)||0.82|
|Random Survival Regression (clinical)||0.59|
|Random Survival Regression (clinical + PET radiomics)||0.60|
|Random Survival Regression (clinical + CT radiomics)||0.61|
|Random Survival Regression (clinical + PET/CT radiomics)||0.59|
|Random Survival Regression (clinical + CT radiomics + deep learning features)||0.58|
|DeepSurv (clinical + PET radiomics)||0.68|
|DeepSurv (clinical + CT radiomics)||0.69|
|DeepSurv (clinical + PET/CT radiomics)||0.73|
|DeepSurv (clinical + PET/CT radiomics + deep learning features||0.65|
We proposed a multimodal 3D H&N tumor segmentation model, NormResSE-UNet3+, combining the squeeze-and-excitation layers  in a UNet3+  architecture to take advantage of full scale features allowing the model to simultaneously focus on the relevant regions of interest. The combination of both local information and global (e.g. context) information aimed to improve the accuracy of the segmentation. We investigated the proposed neural network architecture with different training schemes and different loss functions (the Lovasz-Softmax loss and the Tversky loss ), however they did not significantly improve the overall segmentation performance when compared to the hybrid loss function of Log Cosh Dice  and Focal loss , which was used in our final model. In turn, a method to post-process the predicted segmentation outputs based on uncertainty using Conditional Random Fields [12, 13] to filter out false-positives and refine boundaries improved segmentation accuracy.
Future work will include Bayesian uncertainty measurements followed by a tailored post-processing technique based active-contour algorithms  for multimodal PET and CT images.
Graph-based methods and volume clustering  and multi-task learning  have been shown to improve segmentation tasks and will also be considered in future work.
Increasing the multi-centre sample size for training and validation is expected to strengthen model inferences in order to demonstrate the robustness of the model and its ability to generalise.
A larger sample size would also enable it to make stronger inferences to improve the prediction of progression free survival.
Further work is required to reduce over-fitting issues in progression free survival e.g. by adding regularization to the model.
The addition of mid-level deep learning features effectively improved progression free survival predictions compared to baseline models with only clinical and radiomic features both on training and test sets.
The extraction of relevant features is an active area of research and will be the focus of future work along with work on model architectures and custom loss functions.
This work was supported by the EPSRC grant number EP/S024093/1 and the Centre for Doctoral Training in Sustainable Approaches to Biomedical Science: Responsible and Reproducible Research (SABS: R³) Doctoral Training Centre, University of Oxford. The authors acknowledge the HECKTOR 2021 challenge for the free publicly available PET/CT images and clinical data used in this study .
-  Head and Neck Tumor Segmentation in PET/CT: The HECKTOR Challenge, Valentin Oreiller et al., Medical Image Analysis, 2021 (under revision)
-  Vincent Andrearczyk, Valentin Oreiller, Sarah Boughdad, Joel Castelli, Catherine Chez Le Rest, Hesham Elhalawani,Mario Jreige, John O. Prior, Martin Vallières, Dimitris Visvikis, Mathieu Hatt, Adrien Depeursinge, Overview of the HECKTOR challenge at MICCAI 2021: Automatic Head and Neck Tumor Segmentation and Outcome Prediction in PET/CT images. LNCS challenges, 2021
-  Bin Huang, Zhewei Chen, Po-Man Wu, Yufeng Ye, Shi-Ting Feng, Ching-Yee Oliver Wong, Liyun Zheng, Yong Liu, Tianfu Wang, Qiaoliang Li, Bingsheng Huang, ”Fully Automated Delineation of Gross Tumor Volume for Head and Neck Cancer on PET-CT Using Deep Learning: A Dual-Center Study”, Contrast Media; Molecular Imaging, vol. 2018, Article ID 8923028, 12 pages, 2018
Andrearczyk, V., Oreiller, V., Vallières, M., Castelli, J., Elhalawani, H., Jreige, M., Boughdad, S., Prior, J.O. Depeursinge, A.. (2020). Automatic Segmentation of Head and Neck Tumors and Nodal Metastases in PET-CT scans.Proceedings of the Third Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 121:33-43
-  Ronneberger O, Fischer P, and Brox T. (2015). U-net: Convolutional networks for biomedical image segmentation. in International Conference on Medical image computing and computer-assisted intervention (MICCAI). Springer. pp. 234–241.
-  J. Long, E. Shelhamer and T. Darrell, ”Fully convolutional networks for semantic segmentation,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431-3440, doi: 10.1109/CVPR.2015.7298965.
-  Andrearczyk, Vincent and Oreiller, Valentin and Jreige, Mario and Vallières, Martin and Castelli, Joël and Elhalawani, Hesham Boughdad, Sarah and Prior, John and Depeursinge, Adrien. (2021). Overview of the HECKTOR Challenge at MICCAI 2020: Automatic Head and Neck Tumor Segmentation in PET/CT.
-  Iantsen, A., Visvikis, D., Hatt, M.: Squeeze-and-excitation normalization for automated delineation of head and neck primary tumors in combined PET and CTimages. In: Andrearczyk, V., et al. (eds.) HECKTOR 2020. LNCS, vol. 12603, pp.37–43. Springer, Cham (2021)
-  Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.; Wu, J. UNet 3+:A Full-Scale Connected UNet for Medical Image Segmentation. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059.
-  S. Jadon, ”A survey of loss functions for semantic segmentation,” 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2020, pp. 1-7.
-  TY Lin, P Goyal, R Girshick, K He, and P Dollar. Focal loss for dense object detection. arxiv 2017. arXiv preprint arXiv:1708.02002, 2002.
-  Yuri Boykov and Vladimir Kolmogorov, ”An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision”, IEEE TPAMI, 2004.
-  Kamnitsas, Konstantinos, Ledig, Christian, Newcombe, Virginia, Simpson, JP, Kane, Andrew, Menon, David, Rueckert, D, Glocker, Ben. (2017). Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. 10.17863/CAM.6936.
-  Baek S, He Y, Allen BG, et al. Deep segmentation networks predict survival of non-small cell lung cancer. Sci Rep. 2019;9(1):17286. Published 2019 Nov 21.
-  P. Afshar, A. Mohammadi, K. N. Plataniotis, A. Oikonomou and H. Benali, ”From Handcrafted to Deep-Learning-Based Cancer Radiomics: Challenges and Opportunities,” in IEEE Signal Processing Magazine, vol. 36, no. 4, pp. 132-160, July 2019.
-  Z.W. Zhou, M.M.R. Siddiquee, N. Tajbakhsh and J.M. Liang, “UNet++: A Nested U-Net Architecture for Medical Image Segmentation,” Deep Learning in Medical Image Anylysis and Multimodal Learning for Clinical Decision Support, pp: 3-11, 2018.
-  Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks, CoRR, vol. abs/1709.01507 (2017)
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, Semantic image segmentation with deep convolutional nets and fully connected crfs, in: Proceedings of the International Conference on Learning Representations (ICLR), 2015.
-  Akai H, Yasaka K, Kunimatsu A, Nojima M, Kokudo T, Kokudo N, Hasegawa K, Abe O, Ohtomo K, Kiryu S. Predicting prognosis of resected hepatocellular carcinoma by radiomics analysis with random survival forest. Diagn Interv Imaging. 2018 Oct;99(10):643-651. Epub 2018 Jun 14. PMID: 29910166.
-  Qiu X, Gao J, Yang J, et al. A Comparison Study of Machine Learning (Random Survival Forest) and Classic Statistic (Cox Proportional Hazards) for Predicting Progression in High-Grade Glioma after Proton and Carbon Ion Radiotherapy. Front Oncol. 2020;10:551420. Published 2020 Oct 30.
-  Katzman, J.L., Shaham, U., Cloninger, A. et al. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol 18, 24 (2018).
-  Kim, D.W., Lee, S., Kwon, S. et al. Deep learning-based survival prediction of oral cancer patients. Sci Rep 9, 6994 (2019).
Sae-Ryung Kang, Seungwon Oh, In-Jae Oh, Jung-Joon Min, Hee-Seung Bom, Hyung-Jeong Yang, Guee-Sang Lee, Soo-Hyung Kim, Min Soo Kim. Survival prediction of non-small cell lung cancer by deep learning model integrating clinical and positron emission tomography data [abstract]. In: Proceedings of the AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; 2021 Jan 13-14. Philadelphia (PA): AACR; Clin Cancer Res 2021;27(5 Suppl):Abstract nr PO-029.
-  Nadeau, C., and Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52, 239–281.
-  Abraham, Nabila and N. Khan. “A Novel Focal Tversky Loss Function With Improved Attention U-Net for Lesion Segmentation.” 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) (2019): 683-687.
-  Swierczynski, Piotr, et al. ”A level-set approach to joint image segmentation and registration with application to CT lung imaging.” Computerized Medical Imaging and Graphics 65 (2018): 58-68.
-  Irving, Benjamin, et al. ”Pieces-of-parts for supervoxel segmentation with global context: Application to DCE-MRI tumour delineation.” Medical image analysis 32 (2016): 69-83.
-  Z. Zhong et al., ”3D fully convolutional networks for co-segmentation of tumors on PET-CT images,” 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), 2018, pp. 228-231.