Log In Sign Up

Demystify Transformers Convolutions in Modern Image Deep Networks

by   Jifeng Dai, et al.

Recent success of vision transformers has inspired a series of vision backbones with novel feature transformation paradigms, which report steady performance gain. Although the novel feature transformation designs are often claimed as the source of gain, some backbones may benefit from advanced engineering techniques, which makes it hard to identify the real gain from the key feature transformation operators. In this paper, we aim to identify real gain of popular convolution and attention operators and make an in-depth study of them. We observe that the main difference among these feature transformation modules, e.g., attention or convolution, lies in the way of spatial feature aggregation, or the so-called "spatial token mixer" (STM). Hence, we first elaborate a unified architecture to eliminate the unfair impact of different engineering techniques, and then fit STMs into this architecture for comparison. Based on various experiments on upstream/downstream tasks and the analysis of inductive bias, we find that the engineering techniques boost the performance significantly, but the performance gap still exists among different STMs. The detailed analysis also reveals some interesting findings of different STMs, such as effective receptive fields and invariance tests. The code and trained models will be publicly available at


Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

There still remains an extreme performance gap between Vision Transforme...

MetaFormer is Actually What You Need for Vision

Transformers have shown great potential in computer vision tasks. A comm...

LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Vision transformers (ViTs) are usually considered to be less light-weigh...

HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Recent progress in vision Transformers exhibits great success in various...

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

The past year has witnessed a rapid development of masked image modeling...

SDWNet: A Straight Dilated Network with Wavelet Transformation for Image Deblurring

Image deblurring is a classical computer vision problem that aims to rec...

Can We Gain More from Orthogonality Regularizations in Training Deep CNNs?

This paper seeks to answer the question: as the (near-) orthogonality of...

Code Repositories

1 Introduction

Vision transformers [dosovitskiy2020image] have revolutionized the visual backbone models and inspired a series of deep networks with attention [vaswani2021scaling, liu2021swin, wang2021pyramid], convolution [liu2022convnet, ding2022scaling, liu2022more] and hybrid [dai2021coatnet, rao2022hornet] blocks. These networks report steady performance gains with novel feature transformation operators. However, the networks may also benefit from the development of more advanced engineering techniques such as the macro architecture design [DBLP:conf/cvpr/YuLZSZWFY22] and training recipe [liu2022convnet, liu2021swin, pmlr-v139-touvron21a]. Besides, the discrepancy of benchmark settings on up- and down-stream tasks makes some backbone networks incomparable with only the reported results. This makes us doubt: whether the performance gains come from novel operator designs or from advanced engineering techniques?

Figure 1: Development of spatial token mixers. Using the original implementation or reported results, steady performance gains are obtained. When unifying the engineering techniques (overall architecture and training recipe), however, early STM such as halo attention achieves state-of-the-art performance.
Figure 2: The schematic illustration of (a) our unified block (following vanilla Transformer [vaswani2017attention] block design) and the spatial token mixer designs (top), and the original block designs (bottom) of (b) HaloNet, (c) PVT, (d) Swin Transformer, (e) ConvNeXt, and (f) InternImage.

To answer this question, we survey the recent vision backbones with novel operators [vaswani2021scaling, liu2021swin, wang2021pyramid, liu2022convnet, dong2022cswin], and observe that the difference among these operators and their claimed novelty mainly lie in the way of spatial feature aggregation, i.e., the so-called spatial token mixer (STM). We are not the only one that cares about the real gain of STM, a recent work, MetaFormer [DBLP:conf/cvpr/YuLZSZWFY22] finds that the transformer block design contributes a lot to the performance, and the STM in transformer block, i.e., the attention layer, can be replaced with a simple mixing operation such as pooling without a significant performance drop. This further motivates us to explore the real performance gain of various STMs without unfair engineering discrepancy.

To achieve this, we first construct a modern and unified architecture for comparison, where we sequentially determine the stage-level design (i.e., stem and transition layers) and block-level design (i.e., the topology of basic blocks) by a series of experiments and analyses. We find that some macro designs such as different downsampling operations may improve performances for a certain type of STM and the proper block design can also contribute a lot to performance gain. We combine the findings and beneficial designs into a unified architecture. By instantiating different STMs on this architecture, we achieve comparable or even better results than the reported performance in their original papers, as shown in Fig. 1. We consider four types of popular STMs covering both transformer- and CNN-based methods: local attention [vaswani2021scaling, liu2021swin], global attention [wang2021pyramid], depth-wise convolution [liu2022convnet] and dynamic convolution [2023intern]; see Fig. 2 for the detailed operations.

With the instantiated models, we further investigate and compare STMs from multiple aspects. First, we evaluate different models on upstream (classification) and downstream (object detection) tasks. We find that gaps exist between the performance of different STMs, but the trends are different from the reported ones with original implementations, as shown in Fig. 1

. Early STMs but with advanced design principles can obtain comparable or even better performance with recent counterparts on both upstream and downstream tasks. From our results on ImageNet-1k 


classification and COCO 

[lin2014microsoft] object detection, STMs with vision-specific inductive bias and the dynamic information modeling mechanism, such as local attention and dynamic convolution obtain the best performance. Static convolution works better on low-capacity models (4.5M parameters), while falling behind other models on larger scales. We also find that the inter-window information transfer strategy of local-attention STMs impacts the performance significantly.

Moreover, we aim to further reveal the characteristics of different STMs beyond performance comparison. The STM design reflects the prior knowledge and restrictions on the hypothesis space, i.e., the inductive bias. We further analyze how the inductive bias affects the characteristics of learned models, e.g., locality, shift invariance, rotation invariance, and scale invariance. We find that different STMs vary significantly in the range of information they tend to capture when aggregating features. We investigate the relation between the effective receptive field (ERF) [luo2016understanding] and downstream tasks, and find that larger ERF not necessarily leads to better downstream performance, and the benefits of larger ERF may be saturated when the model scales up.

For the invariance test, we find models with stronger performance also show better robustness against different variations, indicating that most invariance can be learned from data and various data augmentations. Besides, static convolution with weight-sharing and locality shows stronger translation invariance. With the flexible sampling strategy, where the features can be aggregated dynamically, dynamic convolution (DCNv3 [2023intern]) shows better rotation and scaling invariance compared with other STMs.

STM Type Abbr. Sampling Point Set Aggregation Weight Method
Halo Attention Local Attention Halo-Attn Local Dynamic HaloNet [vaswani2021scaling]
Spatial Reduction Attention Global Attention SR-Attn Global Dynamic PVT [wang2022pvt]
Shifted Window (Swin) Attention Local Attention SW-Attn Local Dynamic Swin Transformer [liu2021swin]
Depth-Wise Convolution Convolution DW-Conv Local Static ConvNeXt [liu2022convnet]
Deformable Convolution v3 Dynamic Convolution DCNv3 Global Dynamic InternImage [2023intern]
Table 1: Comparison of different spatial token mixer designs.

Our contributions can be summarized as follows:

  • []

  • We compare typical spatial token mixers with various experiments to demystify the characteristics of local attention, global attention, depth-wise convolution, and dynamic convolution.

  • We distill the successful engineering techniques from the modern backbone designs and form a unified architecture that improves the performance of spatial token mixers, compared with their original implementations.

  • We conduct various experiments of STMs with the unified architecture on image classification and object detection tasks and provide a detailed analysis of their performance. Experiment results show that the engineering techniques boost performance significantly while the performance gap among different STMs still exists. Also, an early STM, i.e., halo attention, can achieve state-of-the-art performance with advanced engineering techniques.

  • We analyze how inductive bias affects the characteristics of STMs. We find the effective receptive field should not be a metric to evaluate the quality of STMs, and the STMs with strong performance show relatively high invariance, but the design of STMs also contributes to the invariance.

2 Related Work

Convolution-based spatial token mixers.

Equipped with image-specific prior, i.e., sparse interactions, parameter sharing, and translation equivalence [Goodfellow-et-al-2016], convolution dominates the early vision backbones [krizhevsky2017imagenet, simonyan2014very, szegedy2015going]. In the convolution era, the development of vision backbones mainly focuses on the architecture aspects such as residual learning [he2016deep], dense connection [huang2017densely], grouping techniques [howard2017mobilenets, xie2017aggregated] and channel attention [hu2018squeeze]. In contrast, the spatial feature aggregation mechanisms are less studied. Deformable convolution [dai2017deformable] and non-local [wang2018non] propose to sample points flexible and build long-range dependency, which breaks the constraint of the locality. After the advent of vision transformers [dosovitskiy2020image], convolution also fits into more advanced engineering design [liu2022convnet, ding2022scaling, liu2022more].

Attention-based spatial token mixers.

Ever since the transformer is introduced into vision [dosovitskiy2020image], whose core operator, attention, has greatly challenged the status of convolution. Compared with standard convolution, attention features a global receptive field and dynamic spatial aggregation. However, the dense attention calculation introduces a huge computation overhead, which scales quadratically with the size of input maps. Hence, attention operators in vision usually reduce the number of locations to attend during attention calculation. PVT [wang2021pyramid, wang2022pvt] and Linformer [wang2020linformer] adopt global attention on the down-sampled feature maps while Pale Transformer [wu2022pale] uses pale operation to sample the features on the whole feature map. Learning from the locality prior of convolution, local attention is introduced [vaswani2021scaling, liu2021swin], where attention is constrained in local windows. As these local windows are non-overlapped, inter-window information transfer mechanisms are developed such as “haloing” [vaswani2021scaling] and shifted windows [liu2021swin].

Engineering optimization and backbone evaluation.

Instead of introducing new spatial token mixers, some work aims to identify beneficial engineering techniques. ResNet-RSB [wightman2021resnet] shows the advanced training recipe can significantly improve the performance compared with the naïve setting. ConvNeXt [liu2022convnet] modernizes a standard ResNet by discovering the recent technologies in both training recipe and network architecture design. MetaFormer [DBLP:conf/cvpr/YuLZSZWFY22] finds that the abstracted architecture of the Transformer block plays a significant role in achieving competitive performance. Recently, Yu et al. [yu2022metaformer] further proves the effectiveness of Transformer-style block design by using the most basic or common mixers to achieve gratifying performance. Although Metaformer emphasizes the importance of transformer block and takes the eyes off the STMs, we find the performance gap exists among different STMs. As shown in Fig. 1, the advanced STM design, e.g., halo-attention, with a modern network architecture outperforms other STMs with the same architecture by a large margin. Besides, we find engineering techniques are not limited to the block-level design, more ablation studies on stage-level design and comparisons with ConvNeXt’s block design are conducted. Note that, our primary goal is not to build an extremely-optimized framework, but to compare and analyze different STMs with a unified setting from various aspects.

3 Spatial Token Mixer

3.1 Definition of Spatial Token Mixer

The spatial token mixer (STM) aims to aggregate spatial features around each position to mix and transfer context information. Given an input feature map , we define the token mixing function for the mixing target point as:


where denotes the sampling point set that records the positions to attend; denotes the aggregation weight w.r.t the sample point ; and is the output projection matrix. Note that in practice, STM usually works with a multi-head strategy, where the input feature is projected or split into different heads.

STMs differ in the generation of the sampling point set and the aggregation weight, as show in Table 1. In this paper, we consider four popular types of STM for comparison, i.e., depth-wise convolution, dynamic convolution, local attention, and global attention.

3.2 Taxonomy of Spatial Token Mixer

Depth-wise convolution.

Depth-wise convolution aggregates features in an input-independent way. The sampling point set is a series of sliding windows centered at the mixing target point , and the aggregation weight is a learnable weight matrix. Albeit simple, recent “modern” convolution networks [liu2022convnet, ding2022scaling] have proved that depth-wise convolution with the advanced architecture and training recipe can attain comparable performance with the top-performing vision transformers. Note that we adopt the depth-wise convolution used in ConvNeXt [liu2022convnet] for comparison. Specifically, we add input and output projection layers before and after the depth-wise convolution in our unified block design.

Dynamic convolution.

Dynamic convolution is input-dependent, whose sampling point set and aggregation weight are generated conditioned on inputs. Under the context of spatial token mixing, we adopt the widely-used deformable convolution [dai2017deformable, zhu2019deformable, 2023intern]. Specifically, the sampling point set and aggregation weights are inferred from the input feature with several convolution layers. We adopt the deformable convolution v3 (DCNv3) from InternImage [2023intern] for comparison. Note that we use nine sampling points following the original implementation. The offsets and weights for the sampling points are obtained with a depth-wise convolution followed by a linear projection. Weights are normalized via a Softmax function.

Global attention.

Global attention-based STM mixes features across the whole spatial domain. The aggregation weights are dynamically generated with the inner product between the mixing target point and each sampling point. Global attention enjoys a theoretically global receptive field and introduces no image-specific inductive bias, such as locality. However, it suffers from large computation overhead, which scales quadratically with increasing input resolutions. Hence, to fit into pyramid-like vision backbones, global attention STMs are optimized for efficiency by constraining the cardinality of sampling point set [wang2021pyramid, dong2022cswin, wu2022pale]. We adopt the spatial reduction attention from PVT [wang2021pyramid] for experiments. When calculating the attention, key feature maps are downsampled.

Local attention.

Another line of optimizing global attention is to introduce the locality into attention, i.e., constraining the sampling point set into local window [vaswani2021scaling, liu2021swin] instead of the whole spatial domain. Due to the complexity of attention, compared with the densely-overlapped sliding windows in convolution, local attention employs non-overlapped local windows, i.e., pixels in each window share the same sampling point set. Hence, to transfer information among different windows, HaloNet [vaswani2021scaling] enlarges each window with an extra band, which looks like a “halo”. Swin Transformer [liu2021swin] shifts the windows to produce overlapped sampling points, thus transferring information among different blocks. To better investigate the designing principle of local attention STMs, we adopt both halo attention from HaloNet and shifted window attention from Swin Transformer for comparison.

Figure 3: The unified architecture design. We use overlapped downsampling for the input stem and transition layer and adopt the standard transformer [vaswani2017attention] block design.

4 Unified Architecture and Training Recipe

In this section, we introduce our unified architecture and training strategies. We sequentially ablate the architecture design on the stage level and block level. Then we instantiate different STMs discussed in Sec. 3 into the unified architecture for comparison.

4.1 Stage-Level Design

First, we determine the stage-level design of the unified architecture, i.e., the input stem and transition layers. Recent backbones [liu2021swin, liu2022convnet, wang2021pyramid] tend to perform downsampling separately in the transition layer, while the downsampling operation can be categorized into overlapped [DBLP:conf/cvpr/YuLZSZWFY22] and non-overlapped [liu2021swin, liu2022convnet] patch merging. Here, we compare the performance of these two ways.

As shown in Table 2 (left part), using overlapped downsampling improves the classification accuracy of Swin-Transformer by , and yields a minor improvement with for ConvNeXt. A plausible explanation is that the overlapped downsampling helps to transfer information among windows for local attention STMs, which are complementary to the inherent weakness of non-overlapped window partition: lacking inter-window information transfer. However, for the depth-wise convolution used in ConvNeXt, minor gains are obtained as the sliding windows for convolution are densely overlapped. Hence, we use

convolutions with a stride of two for the input stem and the transition layers; see Fig. 

3 for details.

4.2 Block-Level Design

Block-level design defines the topology of operators, i.e., their positions and how they are connected. Before the advent of vision transformers, most methods (e.g., HaloNet [vaswani2021scaling]) follow the block design of standard ResNet [he2016deep], which adopts a single skip connection and combines the spatial token mixing with the channel feature transformation (refer to the HaloNet block design in Fig. 2). Transformer block [vaswani2017attention, dosovitskiy2020image]

separates STM and channel feature transformation into two residual connections, each with a normalization layer (refer to the unified block design in Fig. 

2). The transformer-style block design can be traced back to vanilla Transformer [vaswani2017attention] and ViT [dosovitskiy2020image]. This block design has been proven to be the success of Transformer [DBLP:conf/cvpr/YuLZSZWFY22].

Hence, we adopt a standard transformer block as shown in Fig. 2, and replace the attention with different types of STMs. As shown in the right part of Table 2, our unified block design largely improve the performance of HaloNet (), which originally uses the ResNet-style block design. However, compared with the well-optimized implementation for depth-wise convolution, ConvNeXt, using transformer block design does not bring any benefits. This suggests that the transformer block may not necessarily be a golden rule. In our work, to keep the uniformity of architecture for comparison, we adopt this unified block design for all the compared STMs, including depth-wise convolution. We further compare the ConvNeXt block and transformer block in Sec. 5.3 on different STMs.

Stage Design Block Design
STM Original Ours STM Original Ours
SW-Attn 81.1 82.3 Halo-Attn 80.5 83.0
DW-Conv 82.1 82.2 DW-Conv 82.2 82.2
Table 2: The ablation study of block and stage design.

4.3 Model Configuration and Training Strategies

Model configuration.

For a comprehensive comparison, we instantiate STM on models with four different scales: micro (4.5M), tiny (30M), small (50M), and base (90M). Parameters and MACs of different models can be found in Table 3. Note that micro models only contain around parameters, which are rarely discussed for standard backbones. As different STMs vary in complexity, the blocks per stage and the channel number are different.


We survey the training techniques of recent top-performing networks [liu2021swin, liu2022convnet] and merge their training strategies into a unified one. All models are trained for epochs on ImageNet-1k [deng2009imagenet] with AdamW optimizer [loshchilov2017decoupled]

. We use gradient clip with the maximum norm of

, weight decay of , layer scale [touvron2021going] with an initial value of 1e-6, label smoothing [szegedy2016rethinking] of , and stochastic depth [huang2016deep]. For data augmentation, following the common practice [liu2021swin, liu2022convnet], we use RandAugment [cubuk2020randaugment] with Mixup [zhang2017mixup], Cutmix [yun2019cutmix], Random Erasing [zhong2020random], and color jitter. The strategies along with the hyper-parameters are shared across all models, except for the batch size and drop path rate. We set the batch size for Swin Transformer, HaloNet, and PVT as , and for ConvNeXt and InterImage as , since we find a larger batch size favors convolution STMs. For the drop path rate, we simply follow the original setting in their papers. The learning rate is set as with a linear warm-up of epochs and cosine decay.

Object detection.

We use object detection as the downstream task and fit the models into Mask-RCNN [he2017mask] for evaluation. We train and evaluate models on COCO train2017 and val2017 with single-scale training and evaluation. Images are resized so that the shorter sides equal pixels while the longer sides do not exceed pixels. All models are trained using AdamW [loshchilov2017decoupled] for epochs with a batch size of . The learning rate is and drops by at the eighth and tenth epoch. Drop path rates are set to be , , , and for micro, tiny, small, and base models, respectively.

5 Results and Discussion

5.1 Image Classification

Scale Method #Params FLOPs ImageNet-1k Acc
Micro U-HaloNet 4.4M 0.65G 75.8%
U-PVT 4.3M 0.57G 72.8%
U-Swin Transformer 4.4M 0.71G 74.4%
U-ConvNeXt 4.4M 0.65G 75.1%
U-InternImage 4.3M 0.65G 75.3%
Tiny U-HaloNet 31.5M 4.75G 83.0%
U-PVT 30.8M 4.56G 82.1%
U-Swin Transformer 31.5M 4.91G 82.3%
U-ConvNeXt 31.9M 5.01G 82.2%
U-InternImage 29.9M 4.83G 83.3%
Small U-HaloNet 52.8M 8.92G 84.0%
U-PVT 50.5M 7.33G 83.2%
U-Swin Transformer 52.9M 9.18G 83.3%
U-ConvNeXt 54.4M 9.40G 83.1%
U-InternImage 50.1M 8.24G 84.1%
Base U-HaloNet 93.3M 15.84G 84.6%
U-PVT 91.1M 12.73G 83.4%
U-Swin Transformer 93.4M 16.18G 83.7%
U-ConvNeXt 95.8M 16.64G 83.7%
U-InternImage 97.5M 16.08G 84.5%
Table 3: Classification accuracy on ImageNet-1k. “U-” denotes that the models are re-implemented in our unified architecture.

We compare different STMs on ImageNet-1k classification in Table 3 (implementations under our unified architecture are denoted with a prefix “U-”). U-InternImage achieves the best performance on the tiny and small models, but falls behind U-HaloNet on the micro and base scales. Halo attention shows comparable performance with DCNv3, and outperforms swin attention significantly, which suggests that “haloing” is a practical window information transfer design for local-attention STMs when performance takes priority. It proves that the early STM, e.g., halo attention, can yield significant performance gain with the advanced engineering techniques. Simply using the depth-wise convolution can attain comparable performance with swin attention and spatial reduction attention and surpass swin attention on micro models. U-PVT with spatial reduction (global) attention as STM shows obvious disadvantages under micro models while quickly becoming comparable with U-Swin Transformer on other scales.

5.2 Object Detection

We use object detection as the down-stream task to evaluate different STMs. Table 4 reports the detection results, where we find that: (i) STMs with high accuracy (halo attention and DCNv3) retain the advantages on object detection and surpass the other STMs by large margins; (ii) global attention (U-PVT) performs the worst on micro models but becomes on par with swin attention on other scales; (iii) depth-wise convolution shows better performance on the micro model but worse performance on other scales.

Considering the results from both up-stream and down-stream tasks, we conclude that, both local-attention- and convolution-based models with locality inductive bias perform better compared with the model using global spatial reduction attention.

Scale Method Box Mask
Micro U-HaloNet 40.3 24.5 43.0 53.7 37.3 18.7 39.7 55.2
U-PVT 35.9 21.2 38.4 47.2 34.2 16.8 36.1 50.2
U-Swin Transformer 36.6 22.3 39.1 48.0 34.6 17.2 36.7 50.1
U-ConvNeXt 39.2 23.0 42.6 51.0 36.4 17.8 39.2 52.7
U-InternImage 39.5 22.8 42.9 52.7 36.6 17.4 39.4 53.6
Tiny U-HaloNet 46.9 30.0 50.2 61.4 42.4 22.6 45.3 60.9
U-PVT 44.2 27.5 47.4 59.1 40.6 21.1 43.4 59.5
U-Swin Transformer 44.3 28.0 47.5 57.8 40.5 21.6 43.0 58.2
U-ConvNeXt 44.3 27.8 48.0 57.5 40.5 20.7 43.9 58.6
U-InternImage 47.2 30.4 51.3 61.3 42.5 23.3 45.9 60.9
Small U-HaloNet 48.2 31.2 52.5 63.2 43.3 23.4 47.0 62.3
U-PVT 46.1 28.8 49.3 61.2 41.9 21.9 44.6 61.2
U-Swin Transformer 46.4 28.2 50.0 61.5 42.1 21.8 45.2 60.7
U-ConvNeXt 45.6 28.5 49.8 59.8 41.2 21.9 44.7 59.4
U-InternImage 47.8 30.3 51.8 62.4 43.0 23.1 46.3 61.7
Base U-HaloNet 49.0 31.8 53.0 63.7 43.8 24.4 47.0 62.8
U-PVT 46.4 29.0 49.8 62.2 42.3 22.3 45.1 61.8
U-Swin Transformer 47.0 30.8 50.6 62.2 42.2 23.5 45.1 61.7
U-ConvNeXt 46.7 30.2 50.3 61.4 42.2 22.8 45.1 60.7
U-InternImage 48.7 33.1 52.5 63.4 43.8 25.0 46.9 62.3
Table 4: Object detection results on COCO.
STM A B C D E Original
Halo-Attn 84.0 84.3 84.4 83.8 83.8
SR-Attn 83.2 83.4 83.4 78.6 79.4 81.7
SW-Attn 83.3 83.5 83.7 82.8 82.9 83.0
DW-Conv 83.1 83.1 83.3 83.2 83.6 83.1
DCNv3 84.1 84.1 84.0 83.8 83.6
Table 5: Ablation study of the unified network architecture. Experiments are conducted on small-level models. We gradually modify our unified architecture (denoted by A) into the architecture of ConvNeXt (denoted by E). “Original” results are the reported results in their papers.

5.3 Ablation Study of the Unified Architecture

In this section, we perform the ablation study on the unified architecture design for STMs. As shown in Fig. 1, although our unified architecture obtains significant improvement for most STMs compared with their original implementations, the performance of depth-wise convolution (DW-Conv) is comparable to ConvNeXt [liu2022convnet], which uses the DW-Conv as the STM. To investigate the performance gain from each modification, we set up five models, which show the evolutionary process from A our unified architecture to E the ConvNeXt architecture. For the stage-level design, B moves the layer normalization (LN) after the global average pooling in the decoder head based on A; and C removes LN after each stage based on B. For the block-level design, D adopts the block design of ConvNeXt (Fig. 2(e)) based on A; Note that we keep our stem and transition layer (Fig. 3) in all models.

From the results in Table 5, we can see that: (i) for most of the STMs, using LN after the global average pooling (B) and removing stage norm (C) consistently improve the performance; (ii) except DW-Conv used in ConvNeXt, other STMs using ConvNeXt-style block design degrade the performance by comparing (A) and (D). In contrast, the accuracy of DW-Conv is boosted with ConvNeXt-style block design. It proves that this block design is more suitable for the simple STM, such as depth-wise convolution. Hence, we retain the transformer-style block design for comparison. Also, the stage normalization is still kept, as the additional normalization help to stabilize the training process when the model scales up [liu2022swin, 2023intern], which will be investigated in our future works.

6 Inductive Bias and Invariant Analysis

In this section, we investigate how the inductive bias of different STMs affects the characteristics of learned models by analyzing their effective receptive fields, translation invariance, rotation invariance, and scaling invariance.

6.1 Effective Receptive Field

The effective receptive field (ERF) [luo2016understanding]

of a unit (a feature vector from 2D feature maps) in vision backbones denotes the region, whose pixels affect the output of this unit. Although the inductive bias injected into the different STMs can determine the theoretical upper bound of receptive, the real ERFs are quite different. We use the ERF toolbox 

[luo2016understanding] to evaluate the ERF of the center point in feature maps at multiple stages. We adopt models trained on ImageNet-1k to compute the ERF maps. The ERF tool returns a gradient norm map, whose value represents the impact of different locations on the center point. A gradually expanding square is placed at the center. When the sum of the gradient norm within the square reaches of the full picture, the ratio of the square length to the input size is used as the quantitative metric following [ding2022scaling], termed ERF@50.

Figure 4: The relation between the detection accuracy and the effective receptive field. We plot the box AP for small, medium, and large objects on the COCO validation set against the ERF@50 of micro and base models.

We examine the relation between ERF and downstream tasks in Fig. 4, where we plot the box accuracy of the small, medium, and large objects with the ERF@50 and obtain the following observations: (i) For micro models, STMs with larger ERF attain higher performance on medium and large objects, while the performance of small objects is comparable; (ii) for base models, ERF seems to be uncorrelated with the performance. This suggests that, on the detection task, larger ERFs are more important for the model with limited depth and parameters. However, for base models, the effective receptive field may have grown to a practical size. The performance differences are mainly from the design of STMs, i.e., how STMs determine the aggregation weights. We argue that ERF should not be a “metric” to evaluate the quality of STMs. Smaller ERF does not mean the STM is “inferior” to STM with larger ERFs.

We also observe that the design of STMs affects the “shape” of the effective receptive field. Fig. 5 shows the visualized ERFs of swin attention and halo attention, where the non-overlapped window partitions lead to asymmetric ERFs. But it is more obverse to see the shifted “windows” in the ERF of swin attention. Besides, swin attention tends to aggregate features from the right side of the center point, which is consistent with the direction where the window shifts. Halo attention shows a more symmetrical ERF with the proposed “halo” window information transfer.

Figure 5: The design of spatial token mixer affects the “shape” of ERF. We visualize the ERF map at the last two stages of U-Swin Transformer and U-HaloNet. The asymmetric ERF of swin attention manifests the shifted windows.

6.2 Invariance Analysis

We evaluate the robustness of different STMs under different geometric transformations in this subsection. We consider translation, rotation, and scaling to evaluate the characteristics of different STMs. The convolution-based STMs are with translation equivalence in their design (locality and weight sharing [Goodfellow-et-al-2016]

), while local-attention STMs only maintain the window-level translation equivalence. The translation equivalence is closely related to the translation invariance since the feature map is processed with the average pooling before being fed into the classifier in our case. Both convolution- and attention-based STMs do not impose inductive biases related to rotation or scaling. However, different types of invariance can be learned from training samples or from data augmentation.

Translation invariance.

Translation invariance refers to the ability of the model that retains the original outputs when input images are translated. We evaluate the translation invariance on the classification task by jittering the images from to pixels. Following [zhang2019making]

, the invariance is measured by the probability that the model predicts the same label when the same input images are translated.

The first row in Fig. 6 shows the translation invariance of different STMs, where we find that: (i) Convolution-based STMs (DCNv3 and depth-wise convolution) show better translation invariance (the row in Fig. 6 (a)). (ii) Global attention used in U-PVT performs the worst in terms of translation invariance, possibly due to the lack of translation equivalence. (iii) DCNv3 and halo attention still achieve the best performance even the images are translated (the row in Fig. 6 (b)). (iv) Halo attention, depth-wise convolution, and DCNv3 show superior translation invariance at the beginning, while DCNv3 learns better translation invariance as the training proceeds (the row in Fig. 6 (c)).

Figure 6: Translation, rotation, and scaling invariance of STMs. (a) prediction consistency. (b) classification (detection) accuracy under the input translation, rotation, and scaling. (c) Average prediction consistency at different training epochs.

Rotation invariance.

We evaluate the rotation invariance on the classification task by rotating the images from to with a step of . Similar to the translation invariance, the prediction consistency under different rotation angles to evaluate the rotation invariance of different STMs.

From the second row in Fig. 6, we find that: (i) DCNv3 with the input-dependent sampling strategy shows strong robustness against rotation compared with other STMs. (ii) The other STMs show comparable performance, and halo attention performs slightly better at larger rotation angles (). (iii) The rotation invariance for all STMs are comparable at the beginning of the training process, except for halo attention with better rotation invariance. However, DCNv3 quickly learns strong rotation invariance from the training data.

Scaling Invariance.

The scaling invariance is evaluated on object detection. The input images are scaled with factors ranging from to with a step size of . We define box consistency as an invariance metric on the detection task. The predicted boxes on the scaled images are first transformed back to the original resolutions, and then the boxes predicted at the original resolutions are used as ground truth boxes to calculate the box mAP.

From the third row in Fig. 6, we can observe that: (i) all STMs are sensitive to downscale and show comparable invariance with the input in small sizes. (ii) DCNv3 performs better when upscaling the image. Both box consistency and box mAP are better than others. (iii) Halo attention shows comparable invariance with depth-wise convolution at the early training stage, but learns better invariance at the last three epochs.

Considering the results from the translation, rotation, and scale invariance. We can conclude that STMs with strong performance show relatively high invariance. However, the design of STMs also contributes to the invariance, e.g., depth-wise convolution attains comparable translation invariance with halo attention. Besides, we find that DCNv3, which learns to generate the sampling point set conditioned on inputs adaptively, achieves the best invariance among all the STMs. This suggests the generalization and robustness potential of DCNv3 compared with other STMs using a fixed sampling set.

7 Conclusion and Limitations

In this paper, we compare recent popular spatial token mixers in vision backbones under a unified setting, where the impact of engineering techniques is eliminated. We first distill the successful engineering techniques from the existing models and elaborate a unified architecture by various ablation studies. Five representative spatial token mixers (STMs) are fit into the unified architecture for comparison, and we conduct a series of experiments on both upstream and downstream tasks to compare STMs and further analyze how inductive bias influences the learning of deep models. From the results, we find that: (i) engineering design contributes significantly to the performance gain; (ii) some early STM can yield state-of-the-art performance when equipped with modern design principles.

We hope our work can provide a well-optimized platform to develop new vision backbones, especially advanced spatial feature aggregation mechanisms. The downstream task results and the detailed analyses further help the practitioners to choose suitable backbones. There are some limitations that we leave to future work: (i) we plan to evaluate the STMs by embedding them into larger models with more training data; (ii) we will further evaluate the performance on more downstream tasks.