Log In Sign Up

CAT: Cross Attention in Vision Transformer

by   Hezheng Lin, et al.

Since Transformer has found widespread use in NLP, the potential of Transformer in CV has been realized and has inspired many new approaches. However, the computation required for replacing word tokens with image patches for Transformer after the tokenization of the image is vast(e.g., ViT), which bottlenecks model training and inference. In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps capture global information. Both operations have less computation than standard self-attention in Transformer. By alternately applying attention inner patch and between patches, we implement cross attention to maintain the performance with lower computational cost and build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks. Our base model achieves state-of-the-arts on ImageNet-1K, and improves the performance of other methods on COCO and ADE20K, illustrating that our network has the potential to serve as general backbones. The code and models are available at <>.


page 1

page 2

page 3

page 4


Transformer in Transformer

Transformer is a type of self-attention-based neural networks originally...

Locally Shifted Attention With Early Global Integration

Recent work has shown the potential of transformers for computer vision ...

Visualizing and Understanding Patch Interactions in Vision Transformer

Vision Transformer (ViT) has become a leading tool in various computer v...

TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization

The dominant CNN-based methods for cross-view image geo-localization rel...

Transformer Compressed Sensing via Global Image Tokens

Convolutional neural networks (CNN) have demonstrated outstanding Compre...

Hire-MLP: Vision MLP via Hierarchical Rearrangement

This paper presents Hire-MLP, a simple yet competitive vision MLP archit...

Patches Are All You Need?

Although convolutional networks have been the dominant architecture for ...

Code Repositories


Official implement of "CAT: Cross Attention in Vision Transformer".

view repo

1 Introduction

With the development of deep learning and the application of convolutional neural networks

LeCun et al. (1995)

, computer vision tasks have improved tremendously. Since 2012, CNN has dominated CV for a long time, as a crucial feature extractor in various vision tasks, and as a task branch encoder in other tasks. A variety of CNN-based networks

Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Szegedy et al. (2015); Ioffe and Szegedy (2015); Howard et al. (2017); Zhang et al. (2018); He et al. (2016); Xie et al. (2017); Gao et al. (2019); Tan and Le (2019) have different improvements and applications, and various downstream tasks also have these multiple methods, such as object detectionRen et al. (2015); Lin et al. (2017b, a); Tian et al. (2019); Ultralytics (2021); Bochkovskiy et al. (2020); Redmon and Farhadi (2018); Liu et al. (2016); Ghiasi et al. (2019), semantic segmentationZhao et al. (2017); Sun et al. (2019a); Yuan et al. (2019); Yu and Koltun (2015); Chen et al. (2014, 2017a, 2017b, 2018).

Lately, TransformerVaswani et al. (2017), as a new network structure, has achieved significant results in NLP. Benefiting from its remarkable ability to extract global information, it also solves the problem that sequence models hard to be parallelized such as RNNZaremba et al. (2014) and LSTMHochreiter and Schmidhuber (1997), making the development of the NLP an essential leap, and also inspiring computer vision tasks.

Recent worksWang et al. (2021); Dosovitskiy et al. (2020); Chen et al. (2021); Wu et al. (2021); Liu et al. (2021); Han et al. (2021); Touvron et al. (2020); Zhou et al. (2021); Touvron et al. (2021); Yuan et al. (2021b) introduces Transformer into the computer vision as an image extractor. However, the length of the text sequence is fixed in NLP which leads to a decrease in the ability of the Transformer to process images, since the resolution of inputs are variational in different task. In processing images with Transformer, one naive approach is to treat each pixel as a token for global attention similar to work tokens. The iGPTChen et al. (2020) demonstrates that the computation brought by this is tremendous. Some works(e.g., ViT, iGPT) take a set of pixels in a region as a token, which reduces the computation to a certain extent. However, the computational complexity increases dramatically as the input size increases(Formula 1), and the feature maps generated in these methods are of the same shape(Figure 1(b)), making these methods unsuitable for use as the backbone of subsequent tasks.

In this paper, we are inspired by the local feature extraction capabilities of CNN, we adopt attention between pixels in one patch to simulate the characteristics of CNN, reducing the computation that increases exponentially with the input size to that is exponentially related to the patch size. Meanwhile, as Figure 

3 shown, to consider the overall information extraction and communication of the picture, we devised a method of performing attention on single-channel feature maps. Compared with the attention on all channels, there is a significant reduction in the computation as Formula 1, and 3 demonstrated. Cross attention is performed by alternating the internal attention of the patch and the attention of single-channel feature maps. We can build a powerful backbone with the Cross Attention to generate feature maps of different scales, which satisfies the requirements of different granular features of downstream tasks, as Figure 1 shown. We introduce global attention without increasing computation or a small increase in computation, which is a more reasonable method to joint features of Transformer and CNN.

Our base model achieves 82.8% of top-1 accuracy on ImageNet-1K, which is comparable with the current CNN-based network and Transformer-based network of state-of-the-arts. Meanwhile, in other vision tasks, our CAT as the backbone in object detection and semantic segmentation methods can improve their performance.

The features of Transformer and CNN complement each other and that it is our long-term goal to combine them more efficiently and perfectly to take advantage of both. Our proposed CAT is a step in that direction, and hopefully, there will be better developments in that direction.

Figure 1: Hierarchical network. (a) Hierarchical networks based on CNN, different stage generates feature with variety scale. (b) Hierarchical network based on Transformer(e.g., Vit), all features are same in shape. (c) Hierarchical networks of CAT(ours), with characteristics of CNN hierarchy network.

2 Related work

CNN/ CNN-based network

CNN has the characteristics of shared weights, translation, rotation invariance, and locality, which has made great achievements in computer vision instead of the multi-layer perceptron and has become the standard network in vision tasks in last decade. As the first CNN network to achieve great success in computer vision, AlexNet laid the foundation for the later development of the CNN-based network, and

Simonyan and Zisserman (2014); He et al. (2016); Xie et al. (2017); Gao et al. (2019); Tan and Le (2019); Brock et al. (2021) for performance improvement have become the choice as the backbone in vision tasks. The InceptionsSzegedy et al. (2015); Ioffe and Szegedy (2015); Szegedy et al. (2016, 2017); Chollet (2017), MobileNetsHoward et al. (2017); Sandler et al. (2018); Howard et al. (2019), and ShuffleNetsZhang et al. (2018); Ma et al. (2018) for efficiency improvement are also alternatives in tasks required speed of inference.

Global attention in Transformer-based network

Transformer is proposed in NLP for machine translation, where the core multi-head self-attention(MSAVaswani et al. (2017)) mechanism is vital in extracting the characteristics of relationships between words at multiple levels. As the first few Transformer-based backbones, ViTDosovitskiy et al. (2020) and DeitTouvron et al. (2020) divide the image into patches (patch size is 1616). One patch flattened as a token, and CLS-TokenDevlin et al. (2018) is introduced for classification. Both CvTWu et al. (2021) and CeiTYuan et al. (2021a) introduce the convolutional layer to replace the linear projection of QKVVaswani et al. (2017). CrossViTChen et al. (2021) integrates global features of different granularity through dividing images into different sizes of patches for two branches. However, these methods put all patches together for MSA, only focusing on the relationship between different patches, and as the input size increases, the computational complexity increases dramatically as demonstrated in Formula 1, which is difficult to be applied to the vision tasks requiring large resolution input.

Local attention in Transformer-based network

The relationship between internal information of the patch is vital in visionLowe (1999); Brendel and Bethge (2019). Recently, TNTHan et al. (2021) divides each patch into smaller patches. Through the proposed TNT block, the global information and the information inner the patch are captured. SwinLiu et al. (2021) treats each patch as a window to extract the internal relevancy of the patch and shifted window is used to catch more features. However, the two methods have their problems. Firstly, to combine the global information interaction and local information interaction, the increase of computation could not be underestimated in Han et al. (2021). Second, the interaction between local information interaction and adjacent patches lacks global information interaction in Liu et al. (2021). We propose a cross-patch self-attention block to effectively maintain global information interaction while avoiding the enormous increase of computation with the increase of the resolution of inputs.

Hierarchy networks and downstream tasks

Transformer has been used successfully for vision tasksCarion et al. (2020); Zhu et al. (2020); Zheng et al. (2020); Wang et al. (2020b) and NLP tasksDevlin et al. (2018); Yang et al. (2019); Liu et al. (2019); Lan et al. (2019); Sun et al. (2019b); Joshi et al. (2020); Liu et al. (2020). However, due to the consistent shape of input and output in typical Transformer, it is difficult to achieve the hierarchical structure similar to CNN-based networksHe et al. (2016); Xie et al. (2017); Liu et al. (2016); Howard et al. (2019) which is significant in downstream tasks. FPNsLin et al. (2017a); Liu et al. (2018); Ghiasi et al. (2019); Tan et al. (2020) combined with ResNetHe et al. (2016) have become the standard paradigm in object detection. In semantic segmentationZhao et al. (2017); Sun et al. (2019a); Yuan et al. (2019); Chen et al. (2018), the payramidal features are used to improve the performance. Recent PVTWang et al. (2021); Liu et al. (2021) and SwinLiu et al. (2021) reduce the resolution of feature in different stage similar to ResNetHe et al. (2016), which is also the method we used.

3 Method

3.1 Overall architecture

Our method aims to combine the attention within the patch and the attention between patches and build a hierarchical network by stacking basic blocks, which could be simplely applied in other vision tasks. As shown in Figure 2, firstly, we reduced the input image to / (where in our experiments), and increase the number of channels to by referring to the patch processing mode in ViTDosovitskiy et al. (2020) with patch embedding layer. Then, several CAT layers were used for feature extraction at different scales.

After the pretreatment above, the input image enters the first stage. At this point, the number of patches is , and the shape of the patch is (where N is patch size after patch embedding layer). The shape of feature map output by stage1 is donated as . Then, enter the second stage, patch projection layer to execute space to depth operation, which performs pixel block with the shape of changes from the shape of to the shape of , and then project to through the linear projection layer. After entering several cross-attention blocks in the next stage and generating with a shape of , the length and width of the feature map can be reduced by one time and the dimension is increased to double, similar to the operation in ResNetHe et al. (2016), which is also the practice in SwinLiu et al. (2021). After passing the four stages, we can get , four feature maps of different scales and dimensions. Like typical CNN-based networksHe et al. (2016); Xie et al. (2017), feature maps of different granularity can be provided for other downstream vision tasks.

Figure 2: (a) CAT architecture, at the third stage, the number of CABs varies with the size of model. (b) Cross Attention Block(CAB), stacking IPSA and CPSA, both with LNBa et al. (2016), MLP, and shorcutHe et al. (2016).

3.1.1 Inner-Patch Self-Attention Block

In computer vision, each pixel needs a specific channel to represent its different semantic features. Similar to word tokens in NLP, the ideal is to take each pixel of feature map as a token (e.g., ViT, DeiT), but the computational cost is too enormous. As Formula 1 shows, the computational complexity increases exponentially with the resolution of the input image. For instance, in the conventional RCNN seriesGirshick et al. (2014); Girshick (2015); Ren et al. (2015); Lin et al. (2017b) of methods, the short edge of the input is at least 800 pixels, while the YOLO seriesBochkovskiy et al. (2020); Wang et al. (2020a); Ultralytics (2021) of papers also need images of more than 500 pixels. Most of the semantic segmentation methodsZhao et al. (2017); Sun et al. (2019a); Chen et al. (2014) also need images with 512 pixels side lengths. The computation cost is at least 5 times higher than that of 224 pixels in pre-training phase.


Inspired by the characteristics of local feature extraction of CNN, we introduce the locality of convolution method in CNN into Transformer to conduct per-pixel self-attention in each patch called Inner-Patch Self-Attention(IPSA) as shown in Figure 3. We treat a patch as an attention scope, rather than the whole picture. At the same time, Transformer can generate different attention-maps according to inputs, which has a significant advantage over CNN with fixed parameters, which is similar to dynamic parameters in convolutional method, and it is be proved gainful in Tian et al. (2020). Han et al. (2021) has revealed that attention between pixels is also vital. Our approach significantly reduces computation while taking into account the relationship between pixels in the patch. The formula of computation as follows:


where is patch size in IPSA. Compared with MSA in a standard Transformer, the computational complexity decreased from a quadratic correlation(Fornula 1) with the to a linear correlation with the . Assume that , , , following Formula 1, and following Formula 2, , which is much fewer.

3.1.2 Cross-Patch Self-Attention Block

Adding the attention mechanism between pixels only ensures that the interrelationships between pixels inner one patch be caught, but the information exchange of the whole picture is also quite crucial. In CNN-based networks, a stacked convolution kernel generally practiced expanding the receptive field. Dilated/Atrous ConvolutionYu and Koltun (2015) is proposed for larger receptive field, and the final receptive field expands to the whole picture is expected in practice. Transformer is naturally capable of capturing global information, but efforts like ViTDosovitskiy et al. (2020) and DeitTouvron et al. (2020) are ultimately not the best resolution.

Each single-channel feature map naturally has global spatial information. We propose Cross-Patch Self-Attention, separating each channel feature map and dividing each channel into patches and using self-attention to get global information in the whole feature map. This is similar to the depth-wise separable convolution used in XceptionChollet (2017) and MobileNetHoward et al. (2017). The computation of our method could be computed as follows:


where is patch size in CPSA, , represent height and width of feature map respectively. The computational cost is fewer than ViT(Formula 1) and other global attention based methods. Meanwhile, as shown in Figure 2, we combine with MobileNetHoward et al. (2017) design, stacking IPSA block and CPSA block to extract and integrate features between pixels in one patch and between patches in one feature map. Compared to the shifted window in SwinLiu et al. (2021), which is manually designed, difficult to implement, and has little ability to capture the global information, ours is reasonable and easier to comprehend. The is about computed follow the Formula 3 with the suppose same as the above section, which is much fewer than of MSA.

Figure 3: The pipeline of IPSA and CPSA. (a) IPSA: unfold the all-channel inputs to 2, and stack them, after IPSA block, reshape to original shape. (b) CPSA: unfold the single-channel input to patches and stack them, after CPSA block, reshape to original shape.

The multi-head self-attention mechanism is proposed in Vaswani et al. (2017). Each head can notice different semantic information between words in NLP. In computer vision, each head can notice different semantic information between image patches which is similar to channels in CNN-based networks. In the CPSA, we set the number of heads as patch size making the dimension of one head equal to patch size, which is useless to performance, as presented in Table 4.4. So the single head is the default setting in our experiments.

Position encoding

We adopt relative position encoding in IPSA refer to Hu et al. (2018); Bao et al. (2020); Liu et al. (2021), while for CPSA which conduct self-attention on the complete single-channel feature map, we add absolute position encoding to features which embedded in patch embedding layer, which could be formed as follows:


where ab.pos. indicates that absolute position encoding, and Patch.Emb indicates that patch embedding layer in Table 1. Absolute position encoding is useful in CPSA to improve the performance, the results reported in Table 4.4.

3.1.3 Cross Attention based Transformer

Cross Attention block consists of two inner-patch self-attention blocks and a cross-patch self-attention block, as shown in Figure 2. CAT Layer is composed of several CABs, and each stage of the network is composed of a different number of layers and a patch embedding layer as shown in Figure 2, the pipeline of CAB is as follows:


where is an output of one block(e.g., IPSA, MLP) with LN. We compare convolution for patch embedding layer in Dosovitskiy et al. (2020)

, where the convolution kernel size is set to P and the stride is also P, and slicing the inputs as

Ultralytics (2021), the result is reported in Table 4.4, both have the same performance. Our default setting is former. According to the number of CABs in stage3 and the dimension of patch projection layer, three models of different computational complexity are designed, which are CAT-T, CAT-S, and CAT-B with , , and of computation, respectively. Table 1 details the configuration.


down. rate


stage 1 4
Patch Embedding
R=4, Dim=64
Patch Embedding
R=4, Dim=96
Patch Embedding
R=4, Dim=96
1 1 1
stage 2 8
Patch Projection
R=2, Dim=128
Patch Projection
R=2, Dim=192
Patch Projection
R=2, Dim=192
1 1 1
stage 3 16
Patch Projection
R=2, Dim=256
Patch Projection
R=2, Dim=384
Patch Projection
R=2, Dim=384
3 3 6
stage 4 32
Patch Projection
R=2, Dim=512
Patch Projection
R=2, Dim=768
Patch Projection
R=2, Dim=768
1 1 1


Table 1: Detailed configurations of CATs. down. rate indicates that down-sample rate at each stage. R indicates that dowm-sample rate at specific layer.

4 Experiment

We conduct image classification, object detection, and semantic segmentation experiments on ImageNet-1KDeng et al. (2009), COCO 2017Lin et al. (2014), and ADE20KZhou et al. (2017) respectively. In the following, we compare the three tasks between CAT architecture and state-of-the-arts architectures, then we report ablation experiments of some designs we adopted in CAT.

4.1 Image Classification


For image classification, we report the top-1 accuracy with a single crop on ImageNet-1KDeng et al. (2009), which contains 1.28M training images and 50K validation images from 1000 categories. The setting in our experiments is mostly following Touvron et al. (2020)

. We employ the batch size of 1024, the initial learning rate of 0.001, and the weight decay of 0.05. We train the model for 300 epochs with AdamW

Loshchilov and Hutter (2018) optimizer, cosine decay learning rate scheduler, and linear warm-up of 20 epochs. stochastic depthHuang et al. (2016) is used in our training, rate of 0.1, 0.2, and 0.3 for three variants architecture respectively, and dropoutSrivastava et al. (2014) is adopted in self-attention of CAB with the rate of 0.2 to avoid overfitting. We use most of the regularization strategies and augmentation in Touvron et al. (2020) that similar to Liu et al. (2021) to make our results more comparable and convincing.


Model Resolution Params(M) FLOPs(B) Top-1(%)


CNN-based networks
ResNet50He et al. (2016) 26 4.1 76.6
ResNet101He et al. (2016) 45 7.9 78.2
X50-32x4dXie et al. (2017) 25 4.3 77.9
x101-32x4dXie et al. (2017) 44 8.0 78.7
EffifientNet-B4Tan and Le (2019) 19 4.2 82.9
EffifientNet-B5Tan and Le (2019) 30 9.9 83.6
EffifientNet-B6Tan and Le (2019) 43 19.0 84.0
RegNetY-4GRadosavovic et al. (2020) 21 4.0 80.0
RegNetY-8GRadosavovic et al. (2020) 39 8.0 81.7
RegNetY-16GRadosavovic et al. (2020) 84 16.0 82.9


Transformer-based networks
ViT-B/16Dosovitskiy et al. (2020) 86 55.4 77.9
ViT-L/16Dosovitskiy et al. (2020) 307 190.7 76.5
TNT-SHan et al. (2021) 24 5.2 81.3
TNT-BHan et al. (2021) 66 14.1 82.8
CrossViT-15Chen et al. (2021) 27 5.8 81.5
CrossViT-18Chen et al. (2021) 44 9.0 82.5
PVT-SWang et al. (2021) 24.5 3.8 79.8
PVT-MWang et al. (2021) 44.2 6.7 81.2
PVT-LWang et al. (2021) 61.4 9.8 81.7
Liu et al. (2021) 29 - 80.2
Swin-TLiu et al. (2021) 29 4.5 81.3
Swin-BLiu et al. (2021) 88 15.4 83.3


CAT-T(ours) 17 2.8 80.3
CAT-S(ours) 37 5.9 81.8
CAT-B(ours) 52 8.9 82.8


Table 2: The comparison of CAT with other networks on ImageNet. indecates that Swin-T without shifted window.

In Table 2, we present our experimental results, which demonstrates that our CAT-T could achieve the precision of 80.3% top-1 when FLOPs were 65% less than ResNet101He et al. (2016). Meanwhile, the top-1 of our CAT-S and CAT-B on images with the resolution of were 81.8% and 82.8%, respectively. Such a result is comparable with the results of state-of-the-arts in the Table For instance, compared with Swin-TLiu et al. (2021), which has a similar computation, our CAT-S has improved by 0.5%. In particular, our method has a much stronger ability to catch the relationship between patches than shifted operation in SwinLiu et al. (2021). Swin-T(w. shifted) improves 1.1% top-1 accuracy, and CAT-S surpasses 1.6%.

4.2 Object detection


For object detection, we conduct experiments on COCO 2017Lin et al. (2014) with metric of mAP, which consists of 118k training, 5k validation, and 20k test images from 80 categories. We experiment on the some frameworks to evaluate our architecture. The batch size of 16, the initial learning rate of 1e-4, weight decay of 0.05 are used in our experiments. AdamWLoshchilov and Hutter (2018) optimizer, 1x schedule, and NMSNeubeck and Van Gool (2006) are employed. Other settings are the same as MMDetectionChen et al. (2019). Note that a stochastic depthHuang et al. (2016) rate of 0.2 to avoid overfitting. About multi-scale strategy, we trained with randomly select one scale shorter side from 480 to 800 spaced by 32 while the longer side is less than 1333 same as Carion et al. (2020); Sun et al. (2020).


Method Backbone Params(M) FLOPs(G)


Mask R-CNNHe et al. (2017) ResNet50He et al. (2016) 38.0 58.6 41.4 34.4 55.1 36.7 44 260
ResNet101He et al. (2016) 40.4 61.1 44.2 36.4 57.7 38.8 63 336
41.6 65.1 45.4 38.6 62.2 41.0 57 295
41.8 65.4 45.2 38.7 62.3 41.4 71 356


Method Backbone Params(M) FLOPs(G)


FCOSTian et al. (2019) ResNet50He et al. (2016) 36.6 56.0 38.8 21.0 40.6 47.0 32 201
ResNet101He et al. (2016) 39.1 58.3 42.1 22.7 43.3 50.3 51 277
CAT-S(ours) 40.0 60.7 42.6 24.5 42.7 52.4 45 245
CAT-B(ours) 41.0 62.0 43.2 25.7 43.5 53.8 59 303
ATSSZhang et al. (2020b) ResNet50He et al. (2016) 39.4 57.6 42.8 23.6 42.9 50.3 32 205
ResNet101He et al. (2016) 41.5 59.9 45.2 24.2 45.9 53.3 51 281
CAT-S(ours) 42.0 61.6 45.3 26.4 44.6 54.9 45 243
CAT-B(ours) 42.5 62.4 45.8 27.8 45.2 56.0 59 303
RetinaNetLin et al. (2017b) ResNet50He et al. (2016) 36.3 55.3 38.6 19.3 40.0 48.8 38 234
ResNet101He et al. (2016) 38.5 57.8 41.2 21.4 42.6 51.1 57 315
CAT-S(ours) 40.1 61.0 42.6 24.9 43.6 52.8 47 276
CAT-B(ours) 41.4 62.9 43.8 24.9 44.6 55.2 62 337
Cascade R-CNNCai and Vasconcelos (2018) ResNet50He et al. (2016) 40.4 58.9 44.1 22.8 43.7 54.0 69 245
ResNet101He et al. (2016) 42.3 60.8 46.1 23.8 46.2 56.4 88 311
CAT-S(ours) 44.1 64.3 47.9 28.2 46.9 58.2 82 270
CAT-B(ours) 44.8 64.9 48.8 27.7 47.4 59.7 96 330
45.2 65.6 49.2 30.2 48.6 58.2 82 270
46.3 66.8 49.9 30.8 49.5 59.7 96 330


Table 3: The comparison of CAT with other backbones with various methods on COCO detection. indicates that trained with multi-scale strategy. FLOPs is evalutated on .

As demonstrated in Table 3, we used CAT-S and CAT-B as backbone in some anchor-based and anchor-free frameworks, both have better performance and comparable or fewer computational cost. CAT-S improves FCOSTian et al. (2019) by 3.4%, RetinaNetLin et al. (2017b) by 3.7%, and Cascade R-CNNCai and Vasconcelos (2018) by 4.8% with multi-scale strategy. While for instance segmentation, we use the framework of MASK R-CNNHe et al. (2017), and the mask mAP improves 4.2% with CAT-S. All methods we experimented on have better performance than the original, demonstrating our CAT has a better ability to be feature extraction.

4.3 Semantic Segmentation


For semantic segmentation, we experiment on ADE20KZhou et al. (2017) which has 20k images for training, 2k images for validation, and 3k images for testing. The setting is as follows, the initial learning rate is 6e-5, the batch size is 16 for 160k and 80k iterations in total, the weight decay is 0.05, and the warm-up iteration is 1500. We conduct experiments at the framework of Semantic FPNKirillov et al. (2019) with the input of , and using the basic setting in MMSegmentationContributors (2020). Note that the stochastic depthHuang et al. (2016) rate of 0.2 is used in CAT while training.


Method Backbone Params(M) FLOPs(G) mIoU


DANetFu et al. (2019) ResNet101He et al. (2016) 69 1119 45.0
OCRNetYuan et al. (2019) ResNet101He et al. (2016) 56 923 44.1
OCRNetYuan et al. (2019) HRNet-w48Sun et al. (2019a) 71 664 45.7
DeeplabV3+Chen et al. (2018) ResNet101He et al. (2016) 63 1021 44.1
DeeplabV3+Chen et al. (2018) ResNeSt-101Zhang et al. (2020a) 66 1051 46.9
UperNetXiao et al. (2018) ResNet101He et al. (2016) 86 1029 44.9
SETRXiao et al. (2018) 308 - 50.3
Semantic FPNKirillov et al. (2019) ResNet50He et al. (2016) 29 183 39.1
ResNet101He et al. (2016) 48 260 40.7
CAT-S(ours) 41 214 42.8
CAT-B(ours) 55 276 44.9
ResNet50He et al. (2016) 29 183 36.7
ResNet101He et al. (2016) 48 260 38.8
CAT-S(ours) 41 214 42.1
CAT-B(ours) 55 276 43.6


Table 4: Semantic segmentation performance on ADE20K. indicates that the model is pre-trained on ImageNet-22k. indicates that trained with 80k iterations. FLOPs is evaluated on .

As shown in Table 4, we employ CAT-S and CAT-B as the backbone, with the framework of Semantic FPNXiao et al. (2018). Semantic FPN achieves better performance with CAT-S and CAT-B, especially, we achieve 44.9% mIoU with 160k iterations and CAT-B, 4.2% improved compared ResNet101He et al. (2016) as the backbone, making Semantic FPN obtains comparable performance as other methods, while for 80k iterations, result is enhanced 4.8%, which illustrates that our architecture is more powerful than ResNetHe et al. (2016) to be a backbone.

4.4 Ablation Study

In this section, we report results of the ablation experiments for some designs we made in designing the architecture and in conducting the experiments on ImageNet-1KDeng et al. (2009), COCO 2017Lin et al. (2014), and ADE20KZhou et al. (2017).

Patch Embedding function

We compare the embedding function in patch embedding layer, convolutional method and method in Ultralytics (2021), the former conduct convolutional layer with the kernel size of and stride of 4 to reduce the resolution of input to of origin, the latter slice the input from to , where S in ours is 4 to implement the same as the former. The results in Table 4.4 show that the two methods have same performance. To better compare with other workLiu et al. (2021), we choose the convolutional method as the default setting.

Multi-head and shifted window

Multi-head is proposed in Vaswani et al. (2017), which represents different semantic features among words. We set the number of heads equal to patch size in each CPSA, which is useless to the performance, presented in Table 4.4. To study the shifted window in SwinLiu et al. (2021), we also experimented w./wo. the shifted window at the third block of CAB, the result shows that the shifted operation does not perform better in our architecture.

Table 5: Ablation study on multi-head in CPSA, shifted window in second IPSA block in CAB, and slice or convlutional method in patch embedding layer, using CAT-S architecture on ImageNet-1K.   multi-head shifted slice conv. Top-1(%) 81.7 81.6 81.8 81.8   Table 6: Ablation study on the absolute position encoding and dropout in self-attention of CPSA on three benchmarks with CAT-S architecture. FCOSTian et al. (2019) with 1x schedule on COCO 2017 and Semanticc FPNXiao et al. (2018) with 80k iterations on ADE20K is used. attn.d: dropout of selt-attention. abs.pos.: absolute position encoding.   ImageNet COCO 2017 ADE20k top-1 top-5 AP AP AP mIoU no attn.d 81.5 95.2 39.8 60.5 43.0 42.0 attn.d 0.2 81.8 95.6 40.0 60.7 43.2 42.1 no abs.pos. 81.6 95.3 39.6 60.2 42.9 41.8 abs.pos. 81.8 95.6 40.0 60.7 43.2 42.1  
Absolute position and dropout in self-attention of CPSA

We conduct ablation study on absolute position encoding for CPSA, and it improves the performance on three benchmarks. To better training, we adopted the dropoutSrivastava et al. (2014) of self-attention in CPSA and set the rate of 0.0 and 0.2. The rate of 0.2 achieves the best performance, illustrating there is a little overfitting in CPSA. All results are reported in Table 4.4.

5 Conclusion

In this paper, the proposed Cross Attention is proposed to better combine the virtue of local feature extraction in CNN with the virtue of global information extraction in Transformer, and build a robust backbone, which is CAT. It can generate features at different scales similar to most CNN-based networks, and it can also adapt to different sizes of inputs for other vision tasks. CAT achieves state-of-the-arts performance on various vision task datasets (e.g., ImageNet-1KDeng et al. (2009), COCO 2017Lin et al. (2014), ADE20KZhou et al. (2017)). The key is that we alternate attention inner the feature map patch and attention on the single-channel feature map without quite increasing the computation to capture local and global information. We hope that our work will be a step in the direction of integrating CNN and Transformer to create a multi-domain approach.


  • [1] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Figure 2.
  • [2] H. Bao, L. Dong, F. Wei, W. Wang, N. Yang, X. Liu, Y. Wang, J. Gao, S. Piao, M. Zhou, et al. (2020)

    Unilmv2: pseudo-masked language models for unified language model pre-training


    International Conference on Machine Learning

    pp. 642–652. Cited by: §3.1.2.
  • [3] A. Bochkovskiy, C. Wang, and H. M. Liao (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. Cited by: §1, §3.1.1.
  • [4] W. Brendel and M. Bethge (2019) Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. arXiv preprint arXiv:1904.00760. Cited by: §2.
  • [5] A. Brock, S. De, S. L. Smith, and K. Simonyan (2021) High-performance large-scale image recognition without normalization. arXiv preprint arXiv:2102.06171. Cited by: §2.
  • [6] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6154–6162. Cited by: §4.2, Table 3.
  • [7] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §2, §4.2.
  • [8] C. Chen, Q. Fan, and R. Panda (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899. Cited by: §1, §2, Table 2.
  • [9] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.2.
  • [10] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §1, §3.1.1.
  • [11] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1.
  • [12] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §1.
  • [13] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §1, §2, Table 4.
  • [14] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever (2020) Generative pretraining from pixels. In International Conference on Machine Learning, pp. 1691–1703. Cited by: §1.
  • [15] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: §2, §3.1.2.
  • [16] M. Contributors (2020) MMSegmentation: openmmlab semantic segmentation toolbox and benchmark. Note: Cited by: §4.3.
  • [17] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.1, §4.4, §4, §5.
  • [18] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2, §2.
  • [19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1, §2, §3.1.2, §3.1.3, §3.1, Table 2.
  • [20] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. Cited by: Table 4.
  • [21] S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. H. Torr (2019) Res2net: a new multi-scale backbone architecture. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.
  • [22] G. Ghiasi, T. Lin, and Q. V. Le (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7036–7045. Cited by: §1, §2.
  • [23] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §3.1.1.
  • [24] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §3.1.1.
  • [25] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang (2021) Transformer in transformer. arXiv preprint arXiv:2103.00112. Cited by: §1, §2, §3.1.1, Table 2.
  • [26] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.2, Table 3.
  • [27] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §2, §2, Figure 2, §3.1, §4.1, §4.3, Table 2, Table 3, Table 4.
  • [28] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
  • [29] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    Mobilenets: efficient convolutional neural networks for mobile vision applications

    arXiv preprint arXiv:1704.04861. Cited by: §1, §2, §3.1.2, §3.1.2.
  • [30] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324. Cited by: §2, §2.
  • [31] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei (2018) Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597. Cited by: §3.1.2.
  • [32] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger (2016) Deep networks with stochastic depth. In European conference on computer vision, pp. 646–661. Cited by: §4.1, §4.2, §4.3.
  • [33] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §1, §2.
  • [34] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2020) Spanbert: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8, pp. 64–77. Cited by: §2.
  • [35] A. Kirillov, R. Girshick, K. He, and P. Dollár (2019) Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6399–6408. Cited by: §4.3, Table 4.
  • [36] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §1.
  • [37] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    Albert: a lite bert for self-supervised learning of language representations

    arXiv preprint arXiv:1909.11942. Cited by: §2.
  • [38] Y. LeCun, Y. Bengio, et al. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: §1.
  • [39] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1, §2.
  • [40] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §3.1.1, §4.2, Table 3.
  • [41] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.2, §4.4, §4, §5.
  • [42] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018) Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8759–8768. Cited by: §2.
  • [43] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, §2.
  • [44] W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, and P. Wang (2020)

    K-bert: enabling language representation with knowledge graph


    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 34, pp. 2901–2908. Cited by: §2.
  • [45] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
  • [46] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §1, §2, §2, §3.1.2, §3.1.2, §3.1, §4.1, §4.1, §4.4, §4.4, Table 2.
  • [47] I. Loshchilov and F. Hutter (2018) Fixing weight decay regularization in adam. Cited by: §4.1, §4.2.
  • [48] D. G. Lowe (1999) Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, Vol. 2, pp. 1150–1157. Cited by: §2.
  • [49] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp. 116–131. Cited by: §2.
  • [50] A. Neubeck and L. Van Gool (2006) Efficient non-maximum suppression. In 18th International Conference on Pattern Recognition (ICPR’06), Vol. 3, pp. 850–855. Cited by: §4.2.
  • [51] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár (2020) Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10428–10436. Cited by: Table 2.
  • [52] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1.
  • [53] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §1, §3.1.1.
  • [54] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §2.
  • [55] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §2.
  • [56] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §4.1, §4.4.
  • [57] K. Sun, B. Xiao, D. Liu, and J. Wang (2019)

    Deep high-resolution representation learning for human pose estimation

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703. Cited by: §1, §2, §3.1.1, Table 4.
  • [58] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, et al. (2020) Sparse r-cnn: end-to-end object detection with learnable proposals. arXiv preprint arXiv:2011.12450. Cited by: §4.2.
  • [59] Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu (2019) Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223. Cited by: §2.
  • [60] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Cited by: §2.
  • [61] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1, §2.
  • [62] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §2.
  • [63] M. Tan and Q. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §1, §2, Table 2.
  • [64] M. Tan, R. Pang, and Q. V. Le (2020) Efficientdet: scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790. Cited by: §2.
  • [65] Z. Tian, C. Shen, H. Chen, and T. He (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636. Cited by: §1, §4.2, §4.4, Table 3.
  • [66] Z. Tian, C. Shen, and H. Chen (2020) Conditional convolutions for instance segmentation. arXiv preprint arXiv:2003.05664. Cited by: §3.1.1.
  • [67] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou (2020) Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877. Cited by: §1, §2, §3.1.2, §4.1.
  • [68] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou (2021) Going deeper with image transformers. arXiv preprint arXiv:2103.17239. Cited by: §1.
  • [69] Ultralytics (2021) YOLOv5. Note: Cited by: §1, §3.1.1, §3.1.3, §4.4.
  • [70] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §2, §3.1.2, §4.4.
  • [71] C. Wang, A. Bochkovskiy, and H. M. Liao (2020) Scaled-yolov4: scaling cross stage partial network. arXiv preprint arXiv:2011.08036. Cited by: §3.1.1.
  • [72] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L. Chen (2020) MaX-deeplab: end-to-end panoptic segmentation with mask transformers. arXiv preprint arXiv:2012.00759. Cited by: §2.
  • [73] W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122. Cited by: §1, §2, Table 2.
  • [74] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang (2021) Cvt: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808. Cited by: §1, §2.
  • [75] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun (2018)

    Unified perceptual parsing for scene understanding

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434. Cited by: §4.3, §4.4, Table 4.
  • [76] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §1, §2, §2, §3.1, Table 2.
  • [77] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §2.
  • [78] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §1, §3.1.2.
  • [79] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu (2021) Incorporating convolution designs into visual transformers. arXiv preprint arXiv:2103.11816. Cited by: §2.
  • [80] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. Tay, J. Feng, and S. Yan (2021) Tokens-to-token vit: training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986. Cited by: §1.
  • [81] Y. Yuan, X. Chen, and J. Wang (2019) Object-contextual representations for semantic segmentation. arXiv preprint arXiv:1909.11065. Cited by: §1, §2, Table 4.
  • [82] W. Zaremba, I. Sutskever, and O. Vinyals (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. Cited by: §1.
  • [83] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha, et al. (2020) Resnest: split-attention networks. arXiv preprint arXiv:2004.08955. Cited by: Table 4.
  • [84] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li (2020) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9759–9768. Cited by: Table 3.
  • [85] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6848–6856. Cited by: §1, §2.
  • [86] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §1, §2, §3.1.1.
  • [87] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al. (2020) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840. Cited by: §2.
  • [88] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641. Cited by: §4.3, §4.4, §4, §5.
  • [89] D. Zhou, B. Kang, X. Jin, L. Yang, X. Lian, Q. Hou, and J. Feng (2021) Deepvit: towards deeper vision transformer. arXiv preprint arXiv:2103.11886. Cited by: §1.
  • [90] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. Cited by: §2.