Squeeze-and-Excitation Normalization for Automated Delineation of Head and Neck Primary Tumors in Combined PET and CT Images

02/20/2021 ∙ by Andrei Iantsen, et al. ∙ 0

Development of robust and accurate fully automated methods for medical image segmentation is crucial in clinical practice and radiomics studies. In this work, we contributed an automated approach for Head and Neck (H N) primary tumor segmentation in combined positron emission tomography / computed tomography (PET/CT) images in the context of the MICCAI 2020 Head and Neck Tumor segmentation challenge (HECKTOR). Our model was designed on the U-Net architecture with residual layers and supplemented with Squeeze-and-Excitation Normalization. The described method achieved competitive results in cross-validation (DSC 0.745, precision 0.760, recall 0.789) performed on different centers, as well as on the test set (DSC 0.759, precision 0.833, recall 0.740) that allowed us to win first prize in the HECKTOR challenge among 21 participating teams. The full implementation based on PyTorch and the trained models are available at https://github.com/iantsen/hecktor



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Combined positron emission tomography / computed tomography (PET/CT) imaging is broadly used in clinical practice for radiotherapy treatment planning, initial staging and response assessment. In radiomics analyses, quantitative evaluation of radiotracer uptake in PET images and tissues density in CT images, aims at extracting clinically relevant features and building diagnostic, prognostic and predictive models. The segmentation step of the radiomics workflow is the most time-consuming bottleneck and variability in usual semi-automatic segmentation methods can significantly affect the extracted features, especially in case of manual segmentation, which is affected by the highest magnitude of inter- and intra-observer variability. Under these circumstances, a fully automated segmentation is highly desirable to automate the whole process and facilitate its clinical routine usage.

The MICCAI 2020 Head and Neck Tumor segmentation challenge (HECKTOR) [1]

aims at evaluating automatic algorithms for segmentation of Head and Neck (H&N) tumors in combined PET and CT images. A dataset of 201 patients from four medical centers in Québec (CHGJ, CHMR, CHUM and CHUS) with histologically proven H&N cancer in the oropharynx is provided for a model development. A test set comprised of 53 patients from a different center in Switzerland (CHUV) is used for evaluation. All images were re-annotated by an expert for the purpose of the challenge in order to determine primary gross tumor volumes (GTV) on which the methods are evaluated using the Dice score (DSC), precision and recall.

This paper describes our approach based on convolutional neural networks supplemented with Squeeze-and-Excitation Normalization (SE Normalization or SE Norm) layers to address the goal of the HECKTOR challenge.

2 Materials & Methods

2.1 SE Normalization

The key element of our model is SE Normalization layers [6] that we recently proposed in the context of the Brain Tumor Segmentation Challenge (BraTS 2020) [3]. Similarly to Instance Normalization [4], for an input with

channels, SE Norm layer first normalizes all channels of each example in a batch using the mean and standard deviation:


where and with as a small constant to prevent division by zero. After, a pair of parameters are applied to each channel to scale and shift the normalized values:


In case of Instance Normalization, both parameters , fitted in the course of training, stay fixed and independent on the input during inference. By contrast, we propose to model the parameters as functions of the input by means of Squeeze-and-Excitation (SE) blocks [5], i.e


where and - the scale and shift parameters for all channels, - the original SE block with the sigmoid, and

is modeled as the SE block with the tanh activation function to enable the negative shift (see Fig. 

1). Both of SE blocks first apply global average pooling (GAP) to squeeze each channel into a single descriptor. Then, two fully connected (FC) layers aim at capturing non-linear cross-channel dependencies. The first FC layer is implemented with the reduction ratio to form a bottleneck for controlling model complexity. Throughout this paper, we apply SE Norm layers with the fixed reduction ration .

Figure 1: Layers with SE Normalization: (senorm) SE Norm layer, (resblock1) residual layer with the shortcut connection, and (resblock2) residual layer with the non-linear projection. Output dimensions are depicted in italics.

2.2 Network Architecture

Our model is built upon a seminal U-Net architecture [7, 8] with the use of SE Norm layers [6]. Convolutional blocks, that form the model decoder, are stacks of

convolutions and ReLU activations followed by SE Norm layers. Residual blocks in the encoder consist of convolutional blocks with shortcut connections (see Fig. 

1). If the number of input/output channels in a residual block is different, a non-linear projection is performed by adding the convolutional block to the shortcut in order to match the dimensions (see Fig. 1).

In the encoder, downsampling is done by applying max pooling with the kernel size of

. To linearly upsample feature maps in the decoder, transposed convolutions are used. In addition, we supplement the decoder with three upsampling paths to transfer low-resolution features further in the model by applying the

convolutional block to reduce the number of channels, and utilizing trilinear interpolation to increase the spatial size of the feature maps (see Fig. 

2, yellow blocks).

The first residual block placed after the input is implemented with the kernel size of

to increase the receptive field of the model without significant computational overhead. The sigmoid function is applied to output probabilities for the target class.

Figure 2: The model architecture with SE Norm layers. The input consists of PET/CT patches of the size of voxels. The encoder consists of residual blocks with identity (solid arrows) and projection (dashed arrows) shortcuts. The decoder is formed by convolutional blocks. Additional upsampling paths are added to transfer low-resolution features further in the decoder. Kernel sizes and numbers of output channels are depicted in each block.

2.3 Data Preprocessing & Sampling

Both PET and CT images were first resampled to a common resolution of with trilinear interpolation. Each training example was a patch of voxels randomly extracted from a whole PET/CT image, whereas validation examples were received from bounding boxes provided by organizers. Training patches were extracted to include the tumor class with the probability of 0.9 to facilitate model training.

CT intensities were clipped in the range of Hounsfield Units and then mapped to

. PET images were transformed independently with the use of Z-score normalization, performed on each patch.

2.4 Training Procedure

The model was trained for 800 epochs using Adam optimizer on two GPUs NVIDIA GeForce GTX 1080 Ti (11 GB) with a batch size of 2 (one sample per worker). The cosine annealing schedule was applied to reduce the learning rate from

to within every 25 epochs.

2.5 Loss Function

The unweighted sum of the Soft Dice Loss [9] and the Focal Loss [10] is utilized to train the model. Based on [9], the Soft Dice Loss for one training example can be written as


The Focal Loss is defined as


In both definitions, - the label for the i-th voxel, - the predicted probability for the i-th voxel, and - the total numbers of voxels. Additionally we add +1 to the numerator and denominator in the Soft Dice Loss to avoid the zero division in cases when the tumor class is not present in training patches. The parameter in the Focal Loss is set at 2.

Center DSC Precision Recall
Average 0.745 0.760 0.789
Average (rs) 0.757 0.762 0.820
Table 1:

The performance results on different cross-validation splits. Average results (the row ’Average’) are provided for each evaluation metric across all centers in the leave-one-center-out cross-validation (first four rows). The mean and standard deviation of each metric are computed across all data samples in the corresponding validation center. The row ’Average (rs)’ indicates the average results on the four random data splits.

2.6 Ensembling

Our results on the test set were produced with the use of an ensemble of eight models trained and validated on different splits of the training set. Four models were built using a leave-one-center-out cross-validation, i.e, the data from three centers was used for training and the data from the fourth center was held out for validation. Four other models were fitted on random training / validation splits of the whole dataset. Predictions on the test set were produced by averaging predictions of the individual models and applying a threshold operation with a value equal to .

Center DSC Precision Recall
CHUV () 0.759 0.833 0.740
Table 2: The test set results of the ensemble of eight models.

3 Results & Discussion

Our validation results in the context of the HECKTOR challenge are summarized in Table 1. The best outcome in terms of all evaluation metrics was received for the ’CHGJ’ center with 55 patients. The model demonstrated the poorest performance for the ’CHMR’ center that is least represented in the whole dataset. The differences with the two other centers was minor for all evaluation metrics. The small spread between all centers and the average results implies that the model predictions were robust and any center-specific data standardization was not required. This finding is supported by the lack of significant difference in the average results between the leave-one-center-out and random split cross-validations.

The ensemble results on the test set consisting of 53 patients from the ’CHUV’ center are presented in Table 2. On the previously unseen data, the ensemble of eight models achieved the highest results among 21 participating teams with the Dice score of 75.9%, precision 83.3% and recall 74%.