Combined positron emission tomography / computed tomography (PET/CT) imaging is broadly used in clinical practice for radiotherapy treatment planning, initial staging and response assessment. In radiomics analyses, quantitative evaluation of radiotracer uptake in PET images and tissues density in CT images, aims at extracting clinically relevant features and building diagnostic, prognostic and predictive models. The segmentation step of the radiomics workflow is the most time-consuming bottleneck and variability in usual semi-automatic segmentation methods can significantly affect the extracted features, especially in case of manual segmentation, which is affected by the highest magnitude of inter- and intra-observer variability. Under these circumstances, a fully automated segmentation is highly desirable to automate the whole process and facilitate its clinical routine usage.
The MICCAI 2020 Head and Neck Tumor segmentation challenge (HECKTOR) 
aims at evaluating automatic algorithms for segmentation of Head and Neck (H&N) tumors in combined PET and CT images. A dataset of 201 patients from four medical centers in Québec (CHGJ, CHMR, CHUM and CHUS) with histologically proven H&N cancer in the oropharynx is provided for a model development. A test set comprised of 53 patients from a different center in Switzerland (CHUV) is used for evaluation. All images were re-annotated by an expert for the purpose of the challenge in order to determine primary gross tumor volumes (GTV) on which the methods are evaluated using the Dice score (DSC), precision and recall.
This paper describes our approach based on convolutional neural networks supplemented with Squeeze-and-Excitation Normalization (SE Normalization or SE Norm) layers to address the goal of the HECKTOR challenge.
2 Materials & Methods
2.1 SE Normalization
The key element of our model is SE Normalization layers  that we recently proposed in the context of the Brain Tumor Segmentation Challenge (BraTS 2020) . Similarly to Instance Normalization , for an input with
channels, SE Norm layer first normalizes all channels of each example in a batch using the mean and standard deviation:
where and with as a small constant to prevent division by zero. After, a pair of parameters are applied to each channel to scale and shift the normalized values:
In case of Instance Normalization, both parameters , fitted in the course of training, stay fixed and independent on the input during inference. By contrast, we propose to model the parameters as functions of the input by means of Squeeze-and-Excitation (SE) blocks , i.e
where and - the scale and shift parameters for all channels, - the original SE block with the sigmoid, and
is modeled as the SE block with the tanh activation function to enable the negative shift (see Fig.1). Both of SE blocks first apply global average pooling (GAP) to squeeze each channel into a single descriptor. Then, two fully connected (FC) layers aim at capturing non-linear cross-channel dependencies. The first FC layer is implemented with the reduction ratio to form a bottleneck for controlling model complexity. Throughout this paper, we apply SE Norm layers with the fixed reduction ration .
2.2 Network Architecture
convolutions and ReLU activations followed by SE Norm layers. Residual blocks in the encoder consist of convolutional blocks with shortcut connections (see Fig.1). If the number of input/output channels in a residual block is different, a non-linear projection is performed by adding the convolutional block to the shortcut in order to match the dimensions (see Fig. 1).
In the encoder, downsampling is done by applying max pooling with the kernel size of. To linearly upsample feature maps in the decoder, transposed convolutions are used. In addition, we supplement the decoder with three upsampling paths to transfer low-resolution features further in the model by applying the
convolutional block to reduce the number of channels, and utilizing trilinear interpolation to increase the spatial size of the feature maps (see Fig.2, yellow blocks).
2.3 Data Preprocessing & Sampling
Both PET and CT images were first resampled to a common resolution of with trilinear interpolation. Each training example was a patch of voxels randomly extracted from a whole PET/CT image, whereas validation examples were received from bounding boxes provided by organizers. Training patches were extracted to include the tumor class with the probability of 0.9 to facilitate model training.
CT intensities were clipped in the range of Hounsfield Units and then mapped to
. PET images were transformed independently with the use of Z-score normalization, performed on each patch.
2.4 Training Procedure
The model was trained for 800 epochs using Adam optimizer on two GPUs NVIDIA GeForce GTX 1080 Ti (11 GB) with a batch size of 2 (one sample per worker). The cosine annealing schedule was applied to reduce the learning rate fromto within every 25 epochs.
2.5 Loss Function
The Focal Loss is defined as
In both definitions, - the label for the i-th voxel, - the predicted probability for the i-th voxel, and - the total numbers of voxels. Additionally we add +1 to the numerator and denominator in the Soft Dice Loss to avoid the zero division in cases when the tumor class is not present in training patches. The parameter in the Focal Loss is set at 2.
The performance results on different cross-validation splits. Average results (the row ’Average’) are provided for each evaluation metric across all centers in the leave-one-center-out cross-validation (first four rows). The mean and standard deviation of each metric are computed across all data samples in the corresponding validation center. The row ’Average (rs)’ indicates the average results on the four random data splits.
Our results on the test set were produced with the use of an ensemble of eight models trained and validated on different splits of the training set. Four models were built using a leave-one-center-out cross-validation, i.e, the data from three centers was used for training and the data from the fourth center was held out for validation. Four other models were fitted on random training / validation splits of the whole dataset. Predictions on the test set were produced by averaging predictions of the individual models and applying a threshold operation with a value equal to .
3 Results & Discussion
Our validation results in the context of the HECKTOR challenge are summarized in Table 1. The best outcome in terms of all evaluation metrics was received for the ’CHGJ’ center with 55 patients. The model demonstrated the poorest performance for the ’CHMR’ center that is least represented in the whole dataset. The differences with the two other centers was minor for all evaluation metrics. The small spread between all centers and the average results implies that the model predictions were robust and any center-specific data standardization was not required. This finding is supported by the lack of significant difference in the average results between the leave-one-center-out and random split cross-validations.
The ensemble results on the test set consisting of 53 patients from the ’CHUV’ center are presented in Table 2. On the previously unseen data, the ensemble of eight models achieved the highest results among 21 participating teams with the Dice score of 75.9%, precision 83.3% and recall 74%.
-  V. Andrearczyk, V. Oreiller, M. Jreige, M. Vallières, J. Castelli, H. Elhalawani, S. Boughdad, J. O. Prior, A. Depeursinge, ”Overview of the HECKTOR challenge at MICCAI 2020: Automatic Head and Neck Tumor Segmentation in PET/CT”, 2021.
V. Andrearczyk, V. Oreiller, M. Vallières, J. Castelli, H. Elhalawani, M. Jreige, S. Boughdad, J. O. Prior, A. Depeursinge, ”Automatic Segmentation of Head and Neck Tumors and Nodal Metastases in PET-CT scans”, Medical Imaging with Deep Learning (MIDL), 2020.
-  B. H. Menze et al., ”The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS)” in IEEE Transactions on Medical Imaging, vol. 34, no. 10, pp. 1993-2024, 2015.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky, ”Instance normalization: The missing ingredient for fast stylization”, arXiv preprint arXiv:1607.08022, 2016.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks”, CoRR, vol. abs/1709.01507, 2017. [Online]. Available: http://arxiv.org/abs/1709.01507
-  A. Iantsen, V. Jaouen, D. Visvikis, M. Hatt, ”Squeeze-and-Excitation Normalization for Brain Tumor Segmentation”, International MICCAI Brainlesion Workshop, 2020.
-  O. Ronneberger, P. Fischer, T. Brox, ”U-net: Convolutional networks for biomedical image segmentation”, in MICCAI. Springer, 2015, pp. 234–241.
-  Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, O. Ronneberger, ”3D U-Net: Learning dense volumetric segmentation from sparse annotation”, in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2016, pp. 424–432
F. Milletari, N. Navab, S.-A. Ahmadi, ”V-net: Fully convolutional neural networks for volumetric medical image segmentation”, in International Conference on 3D Vision. IEEE, 2016, pp. 565–571.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, ”Focal Loss for Dense Object Detection”, arXiv preprint arXiv:1708.02002, 2017.