Log In Sign Up

AutoPET Challenge: Combining nn-Unet with Swin UNETR Augmented by Maximum Intensity Projection Classifier

by   Lars Heiliger, et al.

Tumor volume and changes in tumor characteristics over time are important biomarkers for cancer therapy. In this context, FDG-PET/CT scans are routinely used for staging and re-staging of cancer, as the radiolabeled fluorodeoxyglucose is taken up in regions of high metabolism. Unfortunately, these regions with high metabolism are not specific to tumors and can also represent physiological uptake by normal functioning organs, inflammation, or infection, making detailed and reliable tumor segmentation in these scans a demanding task. This gap in research is addressed by the AutoPET challenge, which provides a public data set with FDG-PET/CT scans from 900 patients to encourage further improvement in this field. Our contribution to this challenge is an ensemble of two state-of-the-art segmentation models, the nn-Unet and the Swin UNETR, augmented by a maximum intensity projection classifier that acts like a gating mechanism. If it predicts the existence of lesions, both segmentations are combined by a late fusion approach. Our solution achieves a Dice score of 72.12% on patients diagnosed with lung cancer, melanoma, and lymphoma in our cross-validation. Code:


page 3

page 7


Automated head and neck tumor segmentation from 3D PET/CT

Head and neck tumor segmentation challenge (HECKTOR) 2022 offers a platf...

Automatic Tumor Segmentation via False Positive Reduction Network for Whole-Body Multi-Modal PET/CT Images

Multi-modality Fluorodeoxyglucose (FDG) positron emission tomography / c...

Whole-body tumor segmentation of 18F -FDG PET/CT using a cascaded and ensembled convolutional neural networks

Background: A crucial initial processing step for quantitative PET/CT an...

Extending nn-UNet for brain tumor segmentation

Brain tumor segmentation is essential for the diagnosis and prognosis of...

Pseudo-Label Guided Multi-Contrast Generalization for Non-Contrast Organ-Aware Segmentation

Non-contrast computed tomography (NCCT) is commonly acquired for lung ca...

Segmentation for Classification of Screening Pancreatic Neuroendocrine Tumors

This work presents comprehensive results to detect in the early stage th...

1 Introduction

Medical imaging is crucial to detect and assess the progress of cancer. In the clinical routine, positron emission tomography / computed tomography (PET/CT) enables the visualization of metabolic processes inside tissues. In the oncologic setting, fluorodeoxyglucose (FDG) is the most widely used PET tracer, which can display the glucose consumption of tissues. Considering malignant solid tumor entities, lesions typically exhibit an increased glucose consumption.

Due to the lack of automated solutions for the analysis of FDG-PET/CT imaging, which would possibly allow for more precise and individualized treatment decisions, experienced physicians analyze the images in a qualitative way.

The AutoPET challenge111 addresses this gap by providing a large and publicly available training data set with the aim of enhancing the development of 3D semantic segmentation models subject to the avoidance of false positive segmentations.

The training dataset consists of 1,014 PET/CT scans from 900 patients from the University Hospital Tübingen, Germany. There are 513 scans without lesions, and 188, 168, and 145 scans are associated with melanoma, lung cancer, and lymphoma, respectively.

The challenge’s test set consists of 200 PET/CT scans collected from the University Hospital Tübingen (100 scans) and the University Hospital of LMU Munich, Germany, (100 scans). The test set is hidden from the competitors, and the containerized models have to be submitted to the Grand Challenge platform222

. The evaluation procedure applies three different metrics: (1) the foreground Dice score of segmented lesions, (2) the volume of false positive that do not overlap with positives (=false positive volume, FPV), and (3) the volume of positive connected components in the ground truth that do not overlap with the estimated segmentation (=false negative volume, FNV).

When evaluating developed models on the hidden test set, all three metrics are considered for non-healthy patients, whereas only FPV is considered for healthy cases. The final leader board position is based on the ranks in these three metrics with the weighting (0.5, 0.25, 0.25).

We present the methodology of our solution in the subsequent section.

2 Methodology

2.1 Pipeline of Our Final Submission

Figure 1: Our final submission pipeline, combining the Swin UNETR and nn-Unet for segmentation with the ResNet18- and ResNet50-based classifiers.

Our pipeline is illustrated in Figure 1. The PET volume is used to compute sagittal and axial maximum intensity projections (MIPs) and their de-brained versions. These MIPs are fed to our ensemble of classifiers whose binary predictions are combined through a logical OR fusion. If all classifiers agree whether a patient is healthy, the segmentation process is skipped to avoid unnecessary computations and an empty prediction is used as an output. Otherwise, the PET and CT volumes are concatenated and used as an input to the nn-Unet and Swin UNETR models. Their predictions are fused by averaging and our post-processing (described in Section 3.5) is applied to obtain the final prediction.

2.2 Segmentation Models

2.2.1 nn-Unet -

The nn-Unet [8]

is a powerful tool that can perform segmentation of various medical images. It is based on deep learning and automatically performs hyperparameter tuning, pre-processing, network architecture adjustment, training and post-processing without much manual intervention.

We explored two different architectures: 3D U-Net cascade and 3D full resolution U-Net.

2.2.2 Swin UNETR -

The recent success of vision transformers in image recognition [3, 10, 9] paved the way for attention-based architectures in the field of semantic segmentation [1]. With proposing the new architectures UNETR (UNEt TRansformers) and Swin UNETR, [6] and [5]

showed the competitiveness of vision transformers in 3D medical image segmentation with respect to the nn-Unet. Compared with fully convolutional neural networks, transformer based models shine at learning long-range information which is especially important for segmentation of tumors with variable size.

To provide further evidence of Swin UNETR’s capabilities, we trained a model that processes FDG-PET/CT scans in a channel-wise manner. Our architecture processes a volume of size and outputs two channels followed by a softmax. The feature size of the network was set to . We optimized a Dice-Cross-Entropy loss (excluding the background) with an AdamW optimizer using a fixed learning rate of and a weight decay of .

During training, we z-transformed the standardized uptake value (SUV) channel and performed percentile clipping and scaling to the CT channel such that its values are between 0 and 1. After foreground cropping, we randomly cropped a volume of size

, whose center is a lesion with probability

. Lastly, we randomly rotate the volume along each axis (with ). When performing inference, we used a sliding window inferer with an overlap of and gave equal weight to all predictions when blending the output of overlapping windows.

Besides segmenting lesions in scans that were diagnosed with cancer, the challenge focuses on the avoidance of false positive predictions in scans of healthy patients. We addressed the latter by attempting to classify whether lesions are existent in a scan. This approach is outlined below.

2.3 Classification Models

To further improve the segmentation performance regarding healthy patients with empty ground truth masks, we incorporated binary tumor classification models into our segmentation pipeline. The core idea is that the classification task is a more simple objective than full pixelwise classification and a good classifier can de-noise an uncertain segmentation output. In essence, the classifier acts as a gating mechanism for the segmentation model. To this end, we investigated several backbones for classification and utilized the Maximum Intensity Projections (MIPs) of the PET volumes to train our classification models. We decided to use 2D MIPs instead of the whole volumes, as MIPs are typically used for tumor identification in standard clinical practice [4] and our experiments showed that using MIPs can lead to sufficient accuracy ().

2.3.1 Objective of Classification

As stated in Section 2.2, our objective in adding the classification models is to avoid the false negative cases, i.e. classify an unhealthy patient as healthy while still identifying obviously healthy patients and gating the segmentation model, i.e. multiplying its output by . Thus, instead of optimizing the accuracy, we opt for minimizing the false negatives with our classification model as a primary objective and minimizing the false positives as a secondary objective. This prioritization is necessary, so that our classification model does not discard correct segmentation predictions for unhealthy patients, which would dramatically decrease the Dice score. Instead, our goal is to build on top of the segmentation model and further enhance its performance. The necessity of a classification approach, which discerns between healthy and non-healthy patients, is amplified even further, given the fact that we use segmentation models trained only on non-healthy patients.

We investigated three dimensions in varying our classifiers: their backbone architecture, the MIP-axis (sagittal and axial) which they are trained on, and whether the classifiers use MIPs with or without the brain removed. We show that forming an ensemble of classifiers trained with different configurations increases the generalizability of the models.

3 Results

3.1 Validation Strategy

To make the best use of all the data, we choose a 5-fold cross-validation (CV) scheme to split the data. Since some scans stem from the same patient our split is grouped by the patient identifier to avoid information leakage. We utilized the provided metadata to make our splits stratified by sex and diagnosis to guarantee similar distributions across the folds.

3.2 Segmentation Models

3.2.1 nn-Unet -

We trained the nn-Unet on the whole training data set as well as on the non-healthy scans only. After training all folds, we found that the full resolution 3D U-net had the best overall Dice score when trained with only non-healthy patients, as seen in Table 1. Since the 3D full resolution exhibits the best cross-validation with a Dice score of , see Table 1, we refer to it as the nn-Unet in the subsequent sections. The false negative volume and false positive volume is listed in Table 2.

Fold 3D FullRes
3D FullRes
3D Cascade
3D Cascade
1 0.587 0.6876 0.6493 0.6357
2 0.6467 0.6829 0.6636 0.6125
3 0.7173 0.7133 0.5617 0.692
4 0.7110 0.7275 0.6067 0.6675
5 0.6211 0.6755 0.6474 0.5983
CV 0.6566 0.6973 0.6258 0.6412
Table 1: Dice score of the nn-Unet for each network (3D Full Resolution and 3D Cascade) and each splitting (all patients and only non-healthy patients) trained and evaluated on non-healthy cases
1 4.3625 21.4024
2 7.3647 17.6611
3 28.3955 7.6103
4 5.3868 13.9205
5 6.994 11.1126
CV 10.5007 14.3414
Table 2: False Negative Volume (FNV) and False Positive Volume (FPV) of the 3D full resolution nn-Unet trained and evaluated on non-healthy cases

3.2.2 Swin UNETR -

We trained the Swin UNETR, for iterations, and evaluated the model each steps after iteration . Inspired by the effect of distilling the signal by excluding healthy patients during training, our training data set was restricted to cases diagnosed with cancer. The cross-validation Dice scores are depicted in Table 3. The Swin UNETR model is our second best model achieving a cross-validation Dice score of .

Fold Dice
1 0.6627
2 0.6574
3 0.6732
4 0.6905
5 0.6540
CV 0.6675
Table 3: Dice Scores of Swin UNETR trained and evalutated on non-healthy cases

3.3 Late Fusion of nn-Unet and Swin UNETR

Motivated by the hypothesis that both models learn complementary features, we applied late fusion by averaging the softmax probability outputs of both models. To evaluate the late fusion of the nn-Unet and the Swin UNETR, we computed the challenge’s metrics on the training set (without healthy cases). The metrics are listed in Table 4. The cross-validation metrics of the ensemble are better than our best single model (nn-Unet) in all considered metrics. Only the false negative volume of fold 1 is worse than its nn-Unet counterpart (cf. Table 2).

Fold Dice FNV FPV
1 0.7146 5.1083 8.5464
2 0.6976 6.5240 7.8235
3 0.7264 7.3805 4.8489
4 0.7545 4.3221 10.7778
5 0.7131 6.3486 7.6357
CV 0.7212 5.9367 7.9265
Table 4: Dice Score, False Negative Volume (FNV), and False Positive Volume (FPV) of (nn-Unet + Swin UNETR) Late Fusion

3.4 Classification Approach

3.4.1 Varying the Backbone Architecture

We trained and evaluated classification models based on three CNN architectures: ResNet [7], EfficientNet [11], and CoAtNet [2]. Our experiments showed that the ResNet model exhibits the most consistent and best performance on the classification task. The results regarding classification accuracy and FN/FP (false negative / false positive) ratio for fold 1 are illustrated in Table 5 (results on other folds are left out for the sake of brevity). The results show that the classification models are able to reach an accuracy between . However, only the ResNet50 and ResNet18 backbones achieve reasonable FN-rates, whereas EfficientNet and CoAtNet were not able to achieve fewer than 4 false positives. Hence, we opt to only use these two backbones for our late fusion ensemble experiments.

3.4.2 Varying Brain/De-Brained Data

We trained additional classification models on MIPs with the brain removed as seen in Figure 2. The MIPs show that tumor (positioned slightly above the heart) is clearly visible and removing the brain increases the global contrast in the image. We also speculate that removing the brain reduces the number of ”distractions” with high metabolic uptake. The other two anatomical regions with high physiological uptake are the heart and the urinary bladder. However, removing them is dangerous, since there are patients whose tumors are in close proximity to these organs and a simple threshold filtering would remove the tumors as well. Table 5 also shows the difference in the performance on the different MIP views (sagittal (X) and axial(Y)). The MIP-Y performance is consistently better than the MIP-X.

Figure 2: Example MIP images with and without the brain.
ResNet EfficientNet CoAtNet
Backbone Version 18 50 101 B0 B4 3 4
Acc. MIP-X [%] 83.66 87.62 87.13 83.66 89.60 87.13 86.63
Acc. MIP-Y [%] 87.13 86.14 88.91 88.61 91.09 87.13 86.63
FN/FP MIP-X 14/33 16/23 8/32 16/22 4/32 6/29 10/33
FN/FP MIP-Y 10/24 2/65 1/49 4/28 10/19 11/12 6/32
Table 5: Results for the architectures on fold 1 with a decision threshold

We further analyze how the models’ performance changes when using de-brained MIPs. The results are illustrated in Table 6. The results are similar to the ones with the brain, however we can observe that the models exhibit a different distribution of FN/FP samples. This indicates that the models have learned diversified features and an ensemble of models trained on MIP data with brains and models trained on de-brained data would be beneficial.

ResNet EfficientNet CoAtNet
Backbone Version 18 50 101 B0 B4 3 4
Acc. MIP-X [%] 88.61 84.65 89.11 88.61 87.13 88.12 87.62
Acc. MIP-Y [%] 87.62 87.62 84.16 88.12 90.10 89.60 88.61
FN/FP MIP-X 12/28 13/27 21/15 31/38 24/12 8/23 18/15
FN/FP MIP-Y 9/30 20/22 15/35 14/28 12/29 15/17 16/22
Table 6: Results for the architectures on fold 1 without the brain with

3.4.3 Ensemble of Classifiers

We utilized the ResNet18 and ResNet50 backbones for our ensemble experiments as they showed the best FN/FP ratio, which is our primary objective. We ensemble based on three dimensions - trained on either MIP-X or MIP-Y data, trained with or without the brain, and using either ResNet18 or ResNet50 as a backbone, i.e. models for the ensemble. We employed decision-based late fusion, where we built the final prediction based on the logical or between the prediction of all classifiers. This means that all classifiers must agree on whether the patient is healthy in order to classify him as such. To further increase the robustness against false negatives, we used a conservative decision threshold for each classifier of . The final results can be seen in Table 7.

Fold 1 2 3 4 5 CV
Accuracy [%] 69.8 70.9 77.9 73.8 79.4 74.3
FN/FP 1/60 2/58 4/39 3/50 2/41 2.4/49.6
Table 7: Results for the ensemble of all classifiers with

3.5 Post-Processing

On the PET scans, sometimes image artifacts occur at the boundaries. Therefore, for the nn-Unet trained on all patients with tumors, we tested some post-processing steps. We tried to set all predictions of the network at the boundaries to zero. To find out which boundary size works best, we set 0.5%, 1%, 1.5%, 5%, 10%, 12% of the scan on both ends of the axis to zero. We also tried to set a fixed number of slices to zero. The results only show marginal improvements. The results can be found in the Appendix A, Table 8, 9, 10. Furthermore, we set all predictions with less that voxels in total to zero. Among all patients, this post-processing step only affected one patient with a tumor, whereas otherwise only false positive predictions in healthy patients were reduced. Nevertheless, the improvement was only marginal (cf. Appendix A, Table 11).

4 Conclusion

We presented our approach to the AutoPET challenge. Our solution consists of two state-of-the-art segmentation models, namely nn-Unet and Swin UNETR, and a maximum intensity projections classifier that acts like a gating mechanism to the segmentation part of our pipeline. By performing a late fusion of both segmentation models we were able to boost the performance of nn-Unet, a challenge-winning automated 3D segmentation framework. We suspect that this performance boost originated in Swin UNETR’s ability to learn long-range dependencies and the resulting complementary features. We hope to further investigate this assumption in the near future. Although we provided more evidence about the competitiveness of attention-based models in 3D semantic segmentation, our exploration of suitable hyperparameters for the Swin UNETR was limited by the time constraints of the challenge and the high runtime that is due to the huge amount of data inherent to full-body PET/CT scans.


  • [1] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang (2021) Swin-unet: unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537. Cited by: §2.2.2.
  • [2] Z. Dai, H. Liu, Q. V. Le, and M. Tan (2021) Coatnet: marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems 34, pp. 3965–3977. Cited by: §3.4.1.
  • [3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.2.2.
  • [4] T. W. Georgi, A. Zieschank, K. Kornrumpf, L. Kurch, O. Sabri, D. Körholz, C. Mauz-Körholz, R. Kluge, and S. Posch (2022) Automatic classification of lymphoma lesions in fdg-pet–differentiation between tumor and non-tumor uptake. PloS one 17 (4), pp. e0267275. Cited by: §2.3.
  • [5] A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. R. Roth, and D. Xu (2022) Swin unetr: swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, pp. 272–284. Cited by: §2.2.2.
  • [6] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, and D. Xu (2022) Unetr: transformers for 3d medical image segmentation. In

    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    pp. 574–584. Cited by: §2.2.2.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §3.4.1.
  • [8] F. Isensee, P.F. Jaeger, S.A.A. Kohl, and et al. (2021) NnU-net: a self-configuring method for deep learning-based biomedical image segmentation.. Nat Methods 18, pp. 203–211. External Links: Link Cited by: §2.2.1.
  • [9] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, et al. (2022) Swin transformer v2: scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12009–12019. Cited by: §2.2.2.
  • [10] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. Cited by: §2.2.2.
  • [11] M. Tan and Q. Le (2019)

    Efficientnet: rethinking model scaling for convolutional neural networks


    International conference on machine learning

    pp. 6105–6114. Cited by: §3.4.1.


Appendix A:

The following tables show the results of the post-processing steps.

Fold 1 2 3 4 5
3 slices
5 slices
7 slices
Table 8: Dices Scores if or slices at the lower boundary of the axis are set to zero
Fold 1 2 3 4 5
3 slices
5 slices
7 slices
Table 9: Dices Scores if or slices at the upper boundary of the axis are set to zero
Fold 1 2 3 4 5
1 %
2 %
5 %
10 %
12 %
15 %
Table 10: Dices Scores if or of the image size was set to zero at both boundaries of the axis
Fold 1 2 3 4 5
post-processed 22.3448 21.6601 8.2047 15.3556 29.1160
Table 11: False Positive Volume after predictions less than or equal to set to zero, including healthy and non-healthy patients in the validation sets