Combining CNNs With Transformer for Multimodal 3D MRI Brain Tumor Segmentation With Self-Supervised Pretraining

by   Mariia Dobko, et al.
Ukrainian Catholic University

We apply an ensemble of modified TransBTS, nnU-Net, and a combination of both for the segmentation task of the BraTS 2021 challenge. In fact, we change the original architecture of the TransBTS model by adding Squeeze-and-Excitation blocks, an increasing number of CNN layers, replacing positional encoding in Transformer block with a learnable Multilayer Perceptron (MLP) embeddings, which makes Transformer adjustable to any input size during inference. With these modifications, we are able to largely improve TransBTS performance. Inspired by a nnU-Net framework we decided to combine it with our modified TransBTS by changing the architecture inside nnU-Net to our custom model. On the Validation set of BraTS 2021, the ensemble of these approaches achieves 0.8496, 0.8698, 0.9256 Dice score and 15.72, 11.057, 3.374 HD95 for enhancing tumor, tumor core, and whole tumor, correspondingly. Our code is publicly available.



There are no comments yet.


page 8


nnU-Net for Brain Tumor Segmentation

We apply nnU-Net to the segmentation task of the BraTS 2020 challenge. T...

BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation

Convolutional neural networks (CNNs) have recently achieved remarkable s...

Brain Tumor Segmentation and Radiomics Survival Prediction: Contribution to the BRATS 2017 Challenge

Quantitative analysis of brain tumors is critical for clinical decision ...

Generalized Wasserstein Dice Loss, Test-time Augmentation, and Transformers for the BraTS 2021 challenge

Brain tumor segmentation from multiple Magnetic Resonance Imaging (MRI) ...

Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation

Due to a high heterogeneity in pose and size and to a limited number of ...

Qutrit-inspired Fully Self-supervised Shallow Quantum Learning Network for Brain Tumor Segmentation

Classical self-supervised networks suffer from convergence problems and ...

Bottleneck Supervised U-Net for Pixel-wise Liver and Tumor Segmentation

Convolutional neural network (CNN) has been widely used for image proces...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Glioma is one of the most common types of brain tumors, it can severely affect brain function and be life-threatening depending on its location and rate of growth. Fast, automated, and accurate segmentation of these tumors helps to decrease doctor’s time, while also providing a second opinion to make a more confident clinical diagnosis. Magnetic Resonance Imaging (MRI) is a well-known scanning procedure for brain tumor analysis. It is usually acquired with several complementary modalities: T1-weighted and T2-weighted scans, Fluid Attenuation Inversion Recovery (FLAIR). T1-weighted images are produced by using short TE (echo time) and TR (repetition time) times, the contrast and brightness of the image are determined by the T1 properties of tissue. In the Flair, sequence abnormalities remain bright but normal CSF fluid is attenuated and made dark, thus Flair is sensitive to pathology. In general, all these modalities indicate different tissue properties and areas of tumor spread.

In this paper, we present our solution for The Brain Tumor Segmentation (BraTS) Challenge, which is held every year and aims to evaluate state-of-the-art methods for the segmentation of brain tumors. In 2021 it is jointly organized by the Radiological Society of North America (RSNA), the American Society of Neuroradiology (ASNR), and the Medical Image Computing and Computer-Assisted Interventions (MICCAI) society. The main task of this challenge is to develop a method for semantic segmentation of the different glioma sub-regions (the ”enhancing tumor” (ET), peritumoral edematous/invaded tissue (ED), and the ”necrotic” (NCR)) in mpMRI scans. All experiments in this work have been trained and evaluated on BraTS 2021 dataset [1, 11, 4, 2, 3].

Proposed method is inspired by two approaches: nnU-Net[8] and TransBTS[13] model. We incorporate several modifications to their architectures as well as the training process. For example, for TransBTS we add Squeeze-and-Excitation blocks, change the positional encoding in Transformer, incorporate self-supervised pretraining. We also evaluated different postprocessing procedures to improve results and increase generalizability. For instance, we applied connected components, thresholding for noise filtering via class replacements, and Test Time Augmentations (TTA). We use our modified TransBTS to replace a model architecture inside a nnU-Net library, keeping the original preprocessing and training of nnU-Net while also including the useful features of the Transformer in combination with CNN.

2 Methods

Our solution is based on TransBTS architecture proposed by Wang et al. [13]. However, we also train nnU-Net [8] and combine both approaches using ensembling. We also tested the incorporation of our modified TransBTS inside nnU-Net, see Section 2.3.

In the following sections, we describe all of our custom components introduced for data preprocessing, training, and inference postprocessing including architecture modifications.

2.1 Data Preprocessing and Augmentations

We have different strategies for training each model that is used for the final ensemble, this includes alterations to data preprocessing.

Modified TransBTS:

We combine all MRI modalities of a patient into one 4 channel 3D voxel for the input. The normalization used in our experiments is rescaling according to estimates of the mean and standard deviation of the variable per channel. Every scan is randomly cropped to the shape of 128x128x128. During training we also apply Random Flip for every dimension including depth and Random Intencity Shift according to this formulation:

where factor value was set to 0.1.

nnU-Net & modified TransBTS: For this training, we use recommended by nnU-Net authors preprocessing which includes per-sample normalization and non-zero mask cropping. The augmentations that were applied during training include Elastic transformation, scaling with a range of 0.85 to 1.25, rotations for all dimensions, gamma correction with a range from 0.7 up to 1.5, mirroring for all axes.

2.2 Self-pretraining

Training deep architectures from scratch on 3D data is extremely time-consuming. Transfer learning allows the model to converge faster by incorporating knowledge (weights) acquired for one task to solve a related one. However, the use of any models pretrained on external datasets is forbidden in BraTS Challenge. This is why we perform self-pretraining on the same dataset, with the same model, but for a different task - image reconstruction. We train an autoencoder with an identical encoder from our segmentation model to reconstruct 3D scans. Mean absolute error (MAE) loss was used for this stage.

Since this step is mainly needed to ensure quicker convergence we train the model for 10 epochs. When segmentation starts we load pretrained weights for the encoder part of our TransBTS.

2.3 Models

TransBTS shows best or comparable results on both BraTS 2019, and BraTS 2020 datasets. The model is based on the encoder-decoder structure, where the encoder captures local information using 3D CNN, these inputs are then passed to Transformer, which learns global features and feeds them to a decoder for upsampling and segmentation prediction.

The idea behind this architecture is to use 3D CNN to generate compact feature maps capturing spatial and depth information while applying a Transformer following the encoder to handle long-distance dependency in a global space.

Our custom modifications (see Figure1 and Figure2 for comparison with original TransBTS):

  • We add Squeeze-and-Excitation blocks [5] to every layer of an encoder. SE blocks help perform dynamic channel-wise feature recalibration.

  • The depth of the model was increased compared to TransBTS by adding one layer in encoder and correspondingly in decoder.

  • We also replaced positional encoding from TransBTS with a learnable MLP block, for more details please refer to Section 2.4.

nnU-Net[8] proposes a robust and self-adapting framework based on three variations of UNet architecture, namely U-Net, 3D U-Net, and Cascade U-Net. The proposed method dynamically adapts to the dataset’s image geometry and, more critically, emphasizes the stages that many researchers underestimate. At these stages, a model can get a significant increase in performance. These are the following steps: preprocessing (e.g., normalization), training (e.g., loss, optimizer setting, and data augmentation), inference (e.g., patch-based strategy and ensembling across test time augmentations), and a potential post-processing. It shows that non-model changes in the solution are just as important as model architecture. So we decided to exploit the nnU-Net pipeline to train modified TransBTS.

Figure 1: Architecture of Original TransBTS, visualization inspired by [13]. Best viewed in color and zoomed in.
Figure 2: Architecture of Our Modified TransBTS, visualization inspired by [13]. Best viewed in color and zoomed in.

2.4 MLP for Positional Encoding

In TransBTS the learnable position embeddings, which introduce the location information, have fixed size. This results in limited input shape for inference since test images can not deviate in scale from the trained set if positional code is fixed. To address this issue we include a data-driven positional encoding (PE) module in form of Multilayer Perceptron (MLP). By directly using a 3D Convolution 1x1 we eliminate the problem with fixed resolution and add extra learnable parameters useful for positional embeddings. The MLP architecture consists of three consecutive blocks, where a single block has 3d convolution, relu activation followed by batch norm. This operation is formulated as follows:

where MLP block is displayed in Figure 3.

Figure 3: The architecture of MLP 3D-coordinate module for positional embeddings.

2.5 Loss

Our training objective for the TransBTS model is the linear combination of Dice and cross-entropy (CE) losses [6, 7] while the original TransBTS model was trained with solely softmax Dice loss. The loss operates on the three-class labels GD-enhancing tumor (ET — label 4), the peritumoral edematous/invaded tissue (ED — label 2), and the necrotic tumor core (NCR — label 1). The best weight between the two loss components was experimentally chosen to be 0.4 for Dice and 0.6 for cross-entropy.

When we combine nnU-Net with TransBTS architecture, we also use Dice with CE, but in this case, the weight for both of them is the same, so they have an equal contribution.

2.6 Optimization

We apply mixed-precision training, which enables both computational speedup and memory optimization. It is achieved by performing operations in half-precision format and requires two steps: setting the model to use the float16 data type where possible and adding loss scaling to keep small gradient values.

2.7 Postprocessing and Ensembling

First, to reduce memory consumption, we divide the original volume with shape 240x240x155 into eight overlapping regions with the shape 128x128x128. After each part has gone through the model, we form our final prediction by combining all regions. In places where regions intersect, there will be a prediction of the latter.

Secondly, we perform connected-component analysis (CCA) of ground truth labels on the whole training set. The connected-component analysis applies graph theory, where the input data is labeled based on the given heuristics. The algorithm disassembles the segmentation mask into components according to the given connectivity. CAA can have 4-connected-neighborhood or 8-connected-neighborhood. Eventually, we remove components that are smaller than a 15-voxel threshold. This postprocessing, however, didn’t give any severe improvements, so we don’t include it in final submissions.

We adopted the idea of another paper [14], which states that after analyzing the training set, authors noticed that some cases do not have an enhancing tumor class. Therefore, if the number of voxels of this class does not exceed the experimentally selected threshold of 300 pixels in our prediction, we replace it with necrosis.

To increase the performance and robustness of our solution, we create an ensemble of our trained models. One such submission included modified TransBTS trained for 700 epochs and nnU-Net with default configuration trained for 1,000 epochs, in Table 1

this experiment is named ’nnU-Net + Our TransBTS’. The weights for probabilities were selected separately for each class (the first coefficient corresponds to nnU-Net, while second to our custom TransBTS): 0.5 and 0.5 for NCR, 0.7 and 0.3 for ED, 0.6 and 0.4 for ET.

Our final solution is also an ensemble, which averages probabilities of three models: nnU-Net with default training, nnU-Net trained with our custom TransBTS, and our custom TransBTS trained independently for 700 epochs. The coefficients for these models (following the same order as they were named) are: 0.359, 0.347, 0.294 for NCR class, 0.253, 0.387, 0.36 for ED, 0.295, 0.353, 0.351 for ET class. The results can be viewed in Table 1 under the name ’Our Final Solution’.

3 Results

We evaluate our proposed method on BraTS 2021 dataset and provide a short ablation study for our postprocessing customization. We compare several combinations of ensembling with training other approaches on the same data.

3.1 Metrics

This year’s assessment follows the same configuration as previous BraTS challenges. Dice Similarity Coefficient and Hausdorff distance (95%) are used, a result of aggregation of all of these metrics per class determines the winners. The challenge leaderboard also shows Sensitivity and Specificity per each tumor class.

3.2 Training phase and Evaluation

We split training data into two sets: training (1,000 patients) and local validation (251 patients). This is done to have an opportunity to evaluate our customizations locally and see their impact before submitting to the challenge. Many of our additional modifications have shown negative or no impact locally, so they weren’t used in the final method. These include but are not limited to 2D segmentation model TransUNet[9], an ensemble of our 3D model and 2D TransUNet, gamma correction tta, etc.

To see which architecture or ensemble shows the best performance we computed metrics locally and/or on BraTS2021 validation. On local validation TransBTS trained for 500 epochs, for instance, shows 0.56284, 0.85408, 0.8382 Dice for NCR, ED, and ET correspondingly, and 21.858, 4.3625, 3.6681 HD score. While the model with our modifications trained for a same number of epochs helps us achieve 0.7737, 0.8444, 0.8424 Dice and 5.017, 4.57, 3.125 HD. Comparison of our solution with other models is displayed in Table 1. We trained nnU-Net with default configurations to analyze our results and proposed model, we also tested the ensemble of our method together with nnU-Net.

For postprocessing, we applied different hyperparameters and evaluated them on local validation. Best configurations were also estimated on leaderboard validation data, refer to Table

3. We tested several tta combinations on the local validation set to determine the best fit, see results in Table 2.


Method Dice ET Dice TC Dice WT HD ET HD TC HD WT


Our TransBTS 500 epochs 0.78676 0.82102 0.89721 19.826 15.1904 6.725
Our TransBTS 700 epochs 0.81912 0.82491 0.90083 15.858 16.7709 5.8139
nnU-Net default 0.81536 0.8780 0.92505 21.288 7.76043 3.6352
nnU-Net + our TransBTS 0.84565 0.87201 0.92394 17.364 7.78478 3.6339
Our TransBTS inside nnU-Net 0.79818 0.86844 0.9244 24.8750 7.7489 3.6186
Our Final Solution 0.8496 0.86976 0.9256 15.723 11.0572 3.3743
Table 1: Comparison of different methods on BraTS2021 validation set. Dice is computed per class, HD corresponds to Hausdorff Distance. nnU-Net + our TransBTS stands for an ensemble (averaging class probabilities) of both models trained separately (our for 700 epochs), ’Our TransBTS inside nnU-Net’ is a proposed model based on TransBTS wrapped in nnU-Net training pipeline, lastly, Our Final Solution is an ensemble of three models: default nnU-Net, nnU-Net trained with custom TransBTS and our modified TransBTS.


TTA Type Dice NCR Dice ED Dice ET


w/o TTA 0.75079 0.84068 0.82727
All flips & Rotation 0.72535 0.83284 0.81143
W/H/D flips & Rotation & Gamma 0.74752 0.84871 0.82896
W/H/D flips 0.75201 0.84297 0.82917
W/H/D flips & Rotation 0.75230 0.84518 0.83033
Table 2: Comparison of TTA techniques on local validation set using our modified TransBTS trained for 500 epochs with self-supervised pretraining. Dice is computed per class, HD corresponds to Hausdorff Distance, W/H/D signifies that three flips were used seperately one dimension flip per TTA component.


Post Processing Type Dice ET Dice TC Dice WT HD ET HD TC HD WT


Original 0.78676 0.82102 0.89721 19.82 15.19 6.72
Replacing 0.81515 0.81868 0.89403 17.74 19.84 6.95
C.C + Replacing 0.81517 0.81869 0.89406 17.75 19.84 6.96
C.C per class + Replacing 0.81514 0.81868 0.89403 17.76 19.89 6.97
Table 3: Comparison of post processing techniques on challenge validation set. HD stands for Hausdorff Distance metric, while C.C is connected component.

We show some qualitative results of our segmentation predictions in Figure4.

Figure 4: Qualitative local validation set results. Upper row represents ground truth masks while lower row contains predictions from our custom TransBTS model trained for 700 epochs.

3.3 GPU Resources

We implemented our methods in PyTorch

[12] and trained it on a single NVIDIA GeForce RTX 3090 GPU with 24GB graphics RAM size. At inference time we use one NVIDIA GeForce RTX 2080 Ti 11GB GPU. For a model input to fit in these memory constraints we had to decrease the volume size, so we split the whole 3D sample with 240x240x155 dimensionality into 8 smaller overlapping voxels of approximately 128x128x128 pixels and merge the output into one prediction.

4 Discussion and Conclusion

Our proposed solution to BraTS 2021 challenge includes an aggregation of predictions from several models and achieves better performance than any of those methods separately. We selected a CNN with Visual Transformer (based on TransBTS), added customization to its architecture, and incorporated it into the nnU-Net pipeline. This model was used in ensemble with default nnU-Net and our single modified TransBTS. There is still room for improvements in our method and we discuss some ideas below.

In our solution, the data augmentations weren’t explored in depth. This creates a window of opportunity to improve current results for every model in the proposed ensemble. The easiest way to approach this would be to use augmentations described by winners from last-year challenge [7].

We also suggest experimenting with computing Hausdorff loss during training and optimizing it alongside Dice and Cross-entropy. This should improve the Hausdorff distance metric and possibly overall performance on dice as well. However, this loss is very time-consuming and is usually implemented on CPU, so we recommended using the version based on applying morphological erosion on the difference between the true and estimated segmentation maps [10], which saves computations.

The knowledge about labels combination into whole tumor and tumor core could be also used during training, perhaps even a separate model trained on most challenging class.


Authors thank Avenga, Eleks, and Ukrainian Catholic University for providing necessary computing resources. We also express gratitude to Marko Kostiv and Dmytro Fishman for their help and support in the last week of competition.


  • [1] U. Baid et al. (2021) The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint arXiv:2107.02314. Cited by: §1.
  • [2] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, et al. (2017) Segmentation labels for the pre-operative scans of the tcga-gbm collection. The Cancer Imaging Archive. External Links: Document, Link Cited by: §1.
  • [3] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, et al. (2017) Segmentation labels for the pre-operative scans of the tcga-lgg collection. The Cancer Imaging Archive. External Links: Document, Link Cited by: §1.
  • [4] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, et al. (2017-09) Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Nature Scientific Data 4 (1). External Links: Document, Link Cited by: §1.
  • [5] J. Hu, L. Shen, and G. Sun (2018-06) Squeeze-and-excitation networks. In

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    External Links: Document, Link Cited by: 1st item.
  • [6] F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and K. H. Maier-Hein (2020-12)

    nnU-net: a self-configuring method for deep learning-based biomedical image segmentation

    Nature Methods 18 (2), pp. 203–211. External Links: Document, Link Cited by: §2.5.
  • [7] F. Isensee, P. F. Jäger, P. M. Full, P. Vollmuth, and K. H. Maier-Hein (2021) nnU-net for brain tumor segmentation. pp. 118–132. External Links: Document Cited by: §2.5, §4.
  • [8] F. Isensee, J. Petersen, A. Klein, D. Zimmerer, P. F. Jaeger, S. Kohl, J. Wasserthal, G. Koehler, T. Norajitra, S. Wirkert, and K. H. Maier-Hein (2019) Abstract: nnU-net: self-adapting framework for u-net-based medical image segmentation. In Informatik aktuell, pp. 22–22. External Links: Document Cited by: §1, §2.3, §2.
  • [9] C. Jieneng, L. Yongyi, Y. Qihang, L. Xiangde, A. Ehsan, W. Yan, L. Le, Y. A. L., and Z. Yuyin (2021) TransUNet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306. Cited by: §3.2.
  • [10] D. Karimi and S. E. Salcudean (2020-02)

    Reducing the hausdorff distance in medical image segmentation with convolutional neural networks

    IEEE Transactions on Medical Imaging 39 (2), pp. 499–513. External Links: Document, Link Cited by: §4.
  • [11] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, et al. (2015-10) The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Transactions on Medical Imaging 34 (10), pp. 1993–2024. External Links: Document, Link Cited by: §1.
  • [12] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. pp. 8024–8035. External Links: Link Cited by: §3.3.
  • [13] W. Wang, C. Chen, M. Ding, J. Li, H. Yu, and S. Zha (2021) TransBTS: multimodal brain tumor segmentation using transformer. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), Cited by: §1, Figure 1, Figure 2, §2.
  • [14] Y. Wang, Y. Zhang, F. Hou, Y. Liu, J. Tian, C. Zhong, Y. Zhang, and Z. He (2020) Modality-pairing learning for brain tumor segmentation. External Links: 2010.09277 Cited by: §2.7.