Transformers, the default model of choices in natural language processing, have drawn scant attention from the medical imaging community. Given the ability to exploit long-term dependencies, transformers are promising to help atypical convolutional neural networks (convnets) to overcome its inherent shortcomings of spatial inductive bias. However, most of recently proposed transformer-based segmentation approaches simply treated transformers as assisted modules to help encode global context into convolutional representations without investigating how to optimally combine self-attention (i.e., the core of transformers) with convolution. To address this issue, in this paper, we introduce nnFormer (i.e., Not-aNother transFormer), a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. In practice, nnFormer learns volumetric representations from 3D local volumes. Compared to the naive voxel-level self-attention implementation, such volume-based operations help to reduce the computational complexity by approximate 98 ACDC datasets, respectively. In comparison to prior-art network configurations, nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC. For instance, nnFormer outperforms Swin-UNet by over 7 percents on Synapse. Even when compared to nnUNet, currently the best performing fully-convolutional medical segmentation network, nnFormer still provides slightly better performance on Synapse and ACDC.READ FULL TEXT VIEW PDF
Over the past decade, Deep Convolutional Neural Networks have been widel...
In this paper, we describe the use of recurrent neural networks to captu...
Liquid Chromatography coupled to Mass Spectrometry (LC-MS) based methods...
Transformer neural networks have achieved state-of-the-art results for
Region proposal based methods like R-CNN and Faster R-CNN models have pr...
While Transformer architectures have show remarkable success, they are b...
Convolutional neural networks (CNNs) are ubiquitous in computer vision, ...
Transformers , which are the de-facto choices for natural language processing (NLP) problems, have recently been widely exploited in vision-based applications [8, 21]. The core idea behind is to apply the self-attention mechanism to capture long-range dependencies. Compared to convnets (i.e., convolutional neural networks ), transformers relax the inductive bias of locality, making them more capable of dealing with non-local interactions . It has also been investigated that the prediction errors of transformers are more consistent with those of humans than convnets .
Given the fact that transformers are naturally more advantageous than convnets, there are a number of approaches trying to apply transformers to the field of medical image analysis.  first time proposed TransUNet to explore the potential of transformers in the context of medical image segmentation. The overall architecture of TransUNet is similar to that of U-Net , where convnets act as feature extractors and transformers help encode the global context. In fact, a major feature of TransUNet and most of its followers [43, 33, 4, 5] is to treat convnets as main bodies, on top of which transformers are further applied to capture long-term dependencies. However, such characteristic may cause a problem: advantages of transformers are not fully exploited. In other words, we believe one- or two-layer transformers are not enough to entangle long-term dependencies with convolutional representations that often contain precise spatial information and provide hierarchical concepts.
first time introduced a convolution-free segmentation model by forwarding flattened image representations to transformers, whose outputs are then reorganized into 3D tensors to align with segmentation masks. Recently, Swin Transformer showed that by referring to the feature pyramids used in convnets, transformers can learn hierarchical object concepts at different scales by applying appropriate down-sampling to feature maps. Inspired by this idea, Swin-UNet  utilized hierarchical transformer blocks to construct the encoder and decoder within a U-Net like architecture, based on which DS-TransUNet  added one more encoder to accept different-sized inputs. Both Swin-UNet and DS-TransUNet have achieved consistent improvements over TransUNet. Nonetheless, they did not explore how to appropriately combine convolution and self-attention for building an optimal medical segmentation network.
The main contribution of nnFormer (i.e., not-another transFormer) is its hybrid stem where convolution and self-attention are interleaved to give full play to their strengths. Figure 1 presents the effects of different components used in the encoder of nnFormer. Firstly, we put a light-weight convolutional embedding layer ahead of transformer blocks. In comparison to directly flattening raw pixels and applying 1D pre-processing in , the convolutional embedding layer encodes precise (i.e., pixel-level) spatial information and provide low-level yet high-resolution 3D features. After the embedding block, transformer and convolutional down-sampling blocks are interleaved to fully entangle long-term dependencies with high-level and hierarchical object concepts at various scales, which helps improve the generalization ability and robustness of learned representations.
The other contribution of nnFormer lies in proposing a computational-efficient way to capture inter-slice dependencies. To be specific, nnFormer introduces volume-based multi-head self-attention (V-MSA) to learn representations on 3D local volumes, which are then aggregated to produce whole-volumetric predictions. Compared to the naive multi-head self-attention (MSA) , V-MSA is able to reduce the computational complexity by about 98% and 99.5% in transformer blocks on Synapse and ACDC datasets, respectively.
In the experiment section, we compare nnFormer with a wide range of baseline segmentation approaches. The proposed nnFormer surpasses Swin-UNet by over 7 percents in the task of multi-organ segmentation on Synapse. When performing automated cardiac diagnosis on ACDC dataset, nnFormer outperforms Swin-UNet by nearly 2 percents in average. Considering the average dice score on ACDC is over 90 percents, we believe 2-percent improvements on ACDC are as impressive as the 7-percent improvements on Synapse.
In this section, we mainly review methodologies that resort to transformers to improve segmentation results of medical images. Since most of them employ hybrid architectures of convolution and self-attention , we divide them into two categories based on whether the majority of the stem is convolutional or transformer-based.
Convolution-based stem. TransUNet  first time applied transformer to improve the segmentation results of medical images. TransUNet treats the convnet as a feature extractor to generate a feature map for the input slice. Patch embedding is then applied to patches of feature maps in the bottleneck instead of raw images in ViT . Concurrently, similar to TransUNet, Li et al. proposed to use a squeezed attention block to regularize the self-attention modules of transformers and an expansion block to learn diversified representations for fundus images, which are all implemented in the bottleneck within convnets. TransFuse  introduced a BiFusion module to fuse features from the shallow convnet-based encoder and transformer-based segmentation network to make final predictions on 2D images. Compared to TransUNet, TransFuse mainly applied the self-attention mechanism to the input embedding layer to improve segmentation models on 2D images. Yun et al. employed transformers to incorporate spectral information, which are entangled with spectral information encoded by convolutional features to address the problem of hyperspectral pathology. Xu et al. extensively studied the trade-off between transformers and convnets and proposed a more efficient encoder named LeViT-UNet. Li et al. presented a new up-sampling approach and incorporated it into the decoder of UNet to model long-term dependencies and global information for better reconstruction results. TransClaw U-Net  utilized transformers in UNet with more convolutional feature pyramids. TransAttUNet  explored the feasibility of applying transformer self attention with convolutional global spatial attention. Xie et al. adopted transformers to capture long-term dependencies of multi-scale convolutional features from different layers of convnets. TransBTS 
first utilized 3D convnets to extract the volumetric spatial features and down-sample the input 3D images to produce hierarchical representations. The outputs of the encoder in TransBTS are then reshaped into a vector (i.e. token) and fed into transformers for global feature modeling, after which an ordinary convolutional decoder is appended to up-sample feature maps for the goal of reconstruction. Different from these approaches that directly employ convnets as feature extractors, our nnFormer functionally relies on convolutional and transformer-based blocks, which are interleaved to take advantages of each other.
Transformer-based stem. Valanarasu et al.
proposed a gated axial-attention model (i.e., MedT) which extends the existing convnet architectures by introducing an additional control mechanism in the self-attention. Karimiet al. removed the convolutional operations and built a 3D segmentation model based on transformers. The main idea is to first split the local volume block into 3D patches, which are then flattened and embedded to 1D sequences and passed to a ViT-like backbone to extract representations. Swin-UNet  built a U-shape transformer-based segmentation model on top of transformer blocks in , where observable improvements were achieved. DS-TransUNet  further extended Swin-UNet by adding one more encoder to handle multi-scale inputs and introduced a fusion module to effectively establish global dependencies between features of different scales through the self-attention mechanism. Compared to these transformer-based stems, nnFormer inherits the superiority of convolution in encoding precise spatial information and producing hierarchical representations that help model object concepts at various scales.
The overall architecture of nnFormer is presented in Figure 2, which maintains a similar U shape as that of U-Net  and mainly consists of two branches, i.e., the encoder and decoder. Concretely, the encoder involves one embedding block, seven transformer blocks and three down-sampling blocks. Symmetrically, the decoder branch includes seven transformer blocks, three up-sampling blocks and one patch expanding block for making final predictions. Inspired by U-Net 
, we add long residual connections between corresponding feature pyramids of the encoder and decoder in a symmetrical manner, which helps to recover fine-grained details in the prediction.
The input of nnFormer is a 3D patch (usually randomly cropped from the original image), where , and denote the height, width and depth of each input patch, respectively.
Embedding block. The embedding block is responsible for transforming each input scan into a high-dimensional tensor , where represents the number of the patch tokens and C represents the sequence length. In practice, we set to 192 and 96 on Synapse and ACDC datasets, respectively. Different from ViT  and Swin Transformer  that use large convolutional kernels in the embedding block to extract features, we found that applying successive convolutional layers with small convolutional kernels bring more benefits in the initial stage, which could be explained from two perspectives, i.e., i) why applying successive convolutional layers and ii) why using small-sized kernels. For i), we use convolutional layers in the embedding block because they encode pixel-level spatial information, more precisely than patch-wise positional encoding used in transformers. For ii), compared to large-sized kernels, small kernel sizes help reduce computational complexity while providing equal-sized receptive field. As shown in Figure 2b, the embedding block consists of four convolutional layers whose kernel size is 3. After each convolutional layer (except the last one), one GELU  and one layer normalization (i.e., LayerNorm) 
layers are appended. In practice, depending on the size of input patch, strides of convolution in the embedding block may accordingly vary.
Transformer block. After the embedding block, we pass the high-dimensional tensor to interleaved transformer blocks. The main point behind is to fully entangle the captured long-term dependencies with hierarchical object concepts at various scales provided by following down-sampling convolution and high-resolution spatial information encoded by the initial embedding block. Compared to Swin Transformer , we employ a hierarchical way to conduct self-attention but compute self-attention within 3D local volumes (i.e., V-MSA, volume-based multi-head self-attention) instead of 2D local windows.
Supposing being the input of transformer blocks, it would be first reshaped to , where is the number of 3D local volumes and denotes the number of patch tokens in each volume. stand for the size of local volume. In nnFormer, to adapt to various shape of MRI/CT scans, we design
to make it cover all patch tokens of the output of the last transformer block in the encoder. The intuition behind is that it may not be desirable to brute-forcely pad the data in order to satisfy fixed. Thus, the size of the input cropped patch needs to adaptively adjusted in order to accord with the size of local volumes. In practice, we set on Synapse and ACDC to and , respectively.
We follow  to conduct two successive transformer blocks, where the main difference lies in that our computation is built on top of 3D volumes instead of 2D windows. The computational procedure can be summarized as follows:
stands for the layer index. MLP is an abbreviation for multi-layer perceptron. V-MSA and SV-MSA denote the volume-based multi-head self-attention and its shifted version. The computational complexity of V-MSA on a volume ofpatches is:
V-MSA reduces the computational complexity by approximate 98% and 99.5% on Synapse and ACDC datasets, respectively.
SV-MSA displaces the 3D local volume used in V-MSA by to introduce more interactions between different local volumes.
The query-key-value (QKV) attention  in each 3D local volume can be computed as follows:
where denote the query, key and value matrices of dimension . is the relative position encoding. In practice, we first initialize a smaller-sized position matrix and take corresponding values from to build a larger position matrix .
It is worth noting that the last transformer block of the encoder (the downmost block in Figure 2) only employs V-MSA as we found introducing SV-MSA would deteriorate the overall segmentation results.
Down-sampling block. We found that by replacing the neighboring concatenation operation in  with direct strided convolution, nnFormer provides more improvements to volumetric segmentation. The intuition behind is that convolutional down-sampling produces hierarchical representations that help model object concepts at multiple scales. As displayed in Figure 2b, in most cases, the down-sampling block involves a strided convolution operation where the stride is set to 2 in all dimensions. However, in practice, the stride with respect to specific dimension (refer to Table 1b) is set to 1 as the number of slices is limited in this dimension and over-down-sampling (i.e., a large stride) might be harmful.
Architectures of transformer blocks of the decoder are highly symmetrical to those of the encoder. In contrast to the down-sampling blocks, we employ strided deconvolution to up-sample low-resolution feature maps to high-resolution ones, which in turn are merged with representations from the encoder via long-range residual connections to capture both semantic and fine-grained information. Similar to up-sampling blocks, the last patch expanding block also takes the deconvolutional operation to produce final predictions.
To fairly compare nnFormer with previous Transformer-based architectures, we conduct experiments on Synapse  and Automatic Cardiac Diagnosis Challenge (ACDC)  datasets. For each experiment, we repeat it for three times and report their average results.
Synapse for multi-organ CT segmentation. This dataset includes 30 cases of abdominal CT scans. Following the split used in , 18 cases are extracted to build the training set while the rest 12 cases are used for testing. We report the model performance evaluated with the average Dice Similarity Coefficient (DSC) on 8 abdominal organs, which are aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas and stomach.
ACDC involves 100 patients, with the cavity of the right ventricle, the myocardium of the left ventricle and the cavity of the left ventricle to be segmented. Each case’s labels involve left ventricle (LV), right ventricle (RV) and myocardium (MYO). The dataset is split into 70 training samples, 10 validation samples and 20 testing samples.
We run all experiments based on Python 3.6, PyTorch 1.8.1 and Ubuntu 18.04. All training procedures have been performed on a single NVIDIA 2080 GPU with 11GB memory. The initial learning rate is set to 0.01 and we employ a “poly” decay strategy as described in Equation5
. The default optimizer is SGD where we set the momentum to 0.99. The weight decay is set to 3e-5. We utilize both cross entropy loss and dice loss by simply summing them up. The number of training epochs (i.e., max_epoch in Equation5) is 1000 and one epoch contains 250 iterations. The numbers of heads of multi-head self-attention used in different encoder stages are [6, 12, 24, 48] and [3, 6, 12, 24] on Synapse and ACDC, respectively.
Pre-processing and augmentation strategies. All images will be first resampled to the same target spacing. Augmentations such as rotation, scaling, gaussian noise, gaussian blur, brightness and contrast adjust, simulation of low resolution, gamma augmentation and mirroring are applied in the given order during the training process.
Deep supervision. We also add deep supervision during the training stage. Specifically, the output of each stage in the decoder is passed to the final expanding block, where cross entropy loss and dice loss would be applied. In practice, given the prediction of one typical stage, we down-sample the ground truth segmentation mask to match the prediction’s resolution. Thus, the final training objective function is the sum of all losses at three resolutions:
Here, denote the magnitude factors for losses in different resolutions. In practice, halve with each decrease in
resolution, leading to and . Finally, all weight factors are normalized to 1.
Pre-trained model weights.
Pre-training can be vastly important to provide generalized and transferable representations for downstream tasks. Given the fact that most operations in nnFormer operate on 1D sequences, we explore the possibility of transfering pre-trained weights on natural images to the medical imaging field. More concretely, we aim to reap the benefit of pre-trained weights of MLP layers and QKV attention on ImageNet pre-training. To this goal, we align channel numbers of transformer blocks to those of pre-trained models so that we load the weights of MLP layers and QKV attention. Besides, considering architectures of the encoder and decoder are highly symmetrical, we proposesymmetrical initialization to reuse the pre-trained weights of the encoder in the decoder. Specifically, transformer blocks with the same input and output resolution are initialized using the same set of model weights (i.e., symmetrical transformer blocks of the encoder and decoder in Figure 2).
Network configurations. In Table 1, we display network configurations of experiments on Synapse and ACDC. Compared to nnUNet, in nnFormer, better segmentation results can be achieved with smaller-sized input patches.
|R50 U-Net ||74.68||87.74||63.66||80.60||78.19||93.74||56.90||85.87||74.16|
|R50 Att-UNet ||75.57||55.92||63.91||79.20||72.71||93.56||49.37||87.19||74.95|
|VIT None ||61.50||44.38||39.59||67.46||62.94||89.21||43.14||75.45||68.78|
|VIT CUP ||67.86||70.19||45.10||74.70||67.40||91.32||42.00||81.75||70.44|
|R50 VIT CUP ||71.29||73.73||55.13||75.80||72.20||91.51||45.99||81.99||73.95|
|TransClaw U-Net ||78.09||85.87||61.38||84.83||79.36||94.28||57.65||87.74||73.55|
|nnUNet (3D) ||86.99||93.01||71.77||85.57||88.18||97.23||83.01||91.86||85.25|
As shown in Table 2
, we make experiments on Synapse and to compare our nnFormer against a variety of both transformer- and convnet-based baselines. The major evaluation metric is dice score.
Apart from nnUNet, the best performing convnet-based method is DualNorm-UNet  that achieves an average dice score of 80.37. In comparison, WAD reports the best transformer-based results whose average is 80.30, slightly lower than DualNorm-UNet. Our nnFormer is able to outperform WAD and DualNorm-UNet by over 8 percents and 7 percents in average, respectively, which are quite impressive improvements on Synapse. Besides, we found that the performance of nnUNet are severely underestimated. When being carefully tuned, nnUNet reaches an average dice score of 86.99, which is much better than DualNorm-UNet and WAD but still worse than proposed nnFormer.
|R50-Attn UNet ||86.75||87.58||79.20||93.47|
|nnUNet (3D) ||91.59||90.25||89.10||95.41|
Table 3 presents the experimental results on ACDC, where the overall performance of transformer-based baselines are better than those of convnet-based ones. The underlying reason is that images from ACDC have many fewer slices on -axis (i.e., the spacing on -axis is quite large in Figure 1b), which is the exactly the case where transformer has more advantages as they are designed to deal with 2D inputs with less interaction on -axis. From Table 3, we can see that the best transformer model is LeViT-UNet-384s, which average dice is slightly higher than SwinUNet but much higher than convnet-based Dual-Attn. In contrast, nnFormer surpasses LeViT-UNet-384s by nearly 1.5 percents on average, again displaying its advantages over transformer-based baselines.
|More transformer blocks||85.98||89.02||71.74||86.76||87.06||96.37||82.30||89.04||85.51|
In this section, we present the significance of the embedding block and convolutional down-sampling blocks. Besides, we also investigate the influences of adding more transformer blocks to the encoder and employing natural images based pre-training.
Significance of the embedding block. To investigate the influence of our embedding block, we replace it with patch-wise convolution, as shown in Table 4. Patch-wise convolution [8, 21] only contains one convolutional layer that has large kernel size and stride. For instance, on Synapse, both the kernel size and convolutional stride are set to [4,4,2]. We can see that the successive convolutional layers of the embedding block surpasses the atypical patch-wise convolution by nearly 3 percents in average. Actually, similar phenomena were first time observed in , where small-sized kernels are found to be more effective than large ones.
Influences of convolutional down-sampling blocks. In Table 5, we report the results of using neighboring concatenation  in nnFormer to replace convolutional down-sampling blocks. The main operation of neighboring concatenation is to concatenate neighboring patches of features maps in the channel dimension, after which a fully-connected layer is added to reduce the number of channels. From Table 5, we can see that convolutional down-sampling blocks provide over 3 percents improvements over neighboring concatenation in average, demonstrating that applying interleaved convolutional down-sampling blocks are more helpful to build hierarchical object concepts at various scales.
Investigation of adding more transformer blocks. We add one more SV-MSA block to the last encoder stage and report its results in Table 6. Generally speaking, more transformer blocks help nnFormer to encode more long-term dependencies into representations. Somewhat interestingly, from Table 6, we can find that capturing more long-term dependencies may not be an optimal choice for nnFormer. For instance, although introducing more transformer blocks achieve a higher segmentation dice score on Gallbladder (the most difficult organ on Synapse), it deteriorates the segmentation performance on other organs. We will explore the reason behind in the future.
Influences of using pre-trained models on natural images. In Table 7, we show that it is crucial to make use of pre-trained weights on natural images, where removing pre-training deteriorates the overall segmentation performance by over 3 percents. The underlying reason is that Synapse does not have enough labeled 3D scans to fully realize the potential of nnFormer.
In Figures 3 and 4, we compare the segmentation results of nnUNet and nnFormer in some hard samples. On Synapse, it seems that nnFormer has quite apparent advantages on stomach where nnUNet often fails to generate an integrated delineation mask. Meanwhile, compared nnUNet, nnFormer has the ability to reduce false positive predictions of spleen, which are also consistent with the performance reported in Table 2.
Similar phenomena can also be observed in Figure 4, where nnFormer can greatly reduce false positive predictions of right ventricle (RV) and myocardium (MYO), especially myocardium. These segmentation results help to verify the fact that nnFormer can produce more robust and discriminative representations than nnUNet.
In this paper, we present a new medical image segmentation network named nnFormer. nnFormer is constructed on top of an interleaved stem of convolution and self-attention, where convolution helps encode precise spatial information into high-resolution low-level features and build hierarchical object concepts at multiple scales. On the other hand, self-attention in transformer blocks entangles long-term dependencies with convolutional representations to capture global context. Based on such hybrid architecture, nnFormer achieves tremendous progress over previous transformer-based segmentation methodologies. Even when compared to nnUNet, currently the best performing segmentation network, nnFormer still provides consistent yet observable improvements. In the future, we hope nnFormer could draw more attention from the medical imaging community to make efforts on developing more efficient segmentation models.
Proceedings of the European conference on computer vision (ECCV), pp. 3–19. Cited by: Table 3.