Dense prediction vision tasks, e.g., semantic segmentation, object detection, are critical workloads on modern intelligent computing platforms, e.g., AR/VR devices. Convolutional neural networks (CNNs) have rapidly evolved with significant improvement in dense prediction tasks[18, 16, 4, 21, 24, 1]. Beyond classical CNNs, vision transformers (ViTs) have attracted extensive interests and showed competitive performance in vision tasks [2, 10, 33, 15, 35, 23, 17, 5, 29, 9, 26, 27, 3, 30]. Benefiting from the self-attention operations, ViTs embrace strong expressivity with long-distance information interaction. However, ViTs produce single-scale and low-resolution representations, which are not compatible with dense prediction workloads that require high position sensitivity and fine-grained image details.
Recently, various ViT backbones have been proposed to adapt to dense prediction tasks. Prior ViT backbones proposed various efficient global/local self-attention to extract hierarchical features [25, 17, 32, 26, 5, 29, 9]. A multi-scale ViT (MViT)  has been proposed to learn a hierarchy that progressively expands the channel capacity while reducing the spatial resolution. However, they still follow a classification-like network topology with a sequential or series architecture. For complexity consideration, they gradually downsample the feature maps to extract higher-level low-resolution (LR) representations and directly feed each stage’s output to the downstream framework. Such sequential structures lack enough cross-scale interaction thus cannot generate high-quality high-resolution (HR) representations.
HRNet  was proposed to enhance the cross-resolution interaction with a multi-branch architecture that maintains all resolutions throughout the network. Multi-resolution features are extracted in parallel and fused repeatedly to generate high-quality HR representations with richer semantic information. Such a design concept has achieved great success in various dense prediction tasks. Nevertheless, its expressivity is limited by small receptive fields and strong inductive bias from cascaded convolution operations. Later, a slimmed Lite-HRNet  was put forward with efficient shuffle blocks and channel weighting operators. HR-NAS  inserted a lightweight transformer path into the residual blocks to extract global information and applied the neural architecture search to remove the channel/head redundancies. However, those improved HRNet designs are still mainly based on the convolutional building blocks, and the demonstrated performance of their tiny models is still far behind the SoTA scores of ViT counterparts.
Migrating the success of HRNet to ViT designs is non-trivial. Given the high complexity of multi-branch HR architectures and self-attention operations, simply replacing all residual blocks in HRNet with transformer blocks will encounter severe scalability issues. The inherited powerful representability will be overwhelmed by the prohibitive hardware cost without careful architecture-block co-optimization.
To enhance ViTs with stronger representability to generate semantically-rich and position-precise features, in this work, we present HRViT, an efficient multi-scale high-resolution vision transformer backbone specifically optimized for high-resolution dense prediction tasks. Our goal is facilitating efficient multi-scale representation learning for vision transformers. HRViT
is different from prior sequential ViTs in several aspects: 1) our multi-branch HR architecture extracts multi-scale features in parallel with cross-resolution fusion to enhance the multi-scale representability of ViTs; 2) our augmented local self-attention removes redundant keys and values for better efficiency and enhances the expressivity with extra convolution paths, additional nonlinearity, and auxiliary shortcuts to enhance feature diversity; 3) we adopt mixed-scale convolutional feedforward networks to fortify the multi-scale feature extraction; 4) our HR convolutional stem and efficient patch embedding layers maintain more low-level fine-grained features with reduced hardware cost. Also, distinguished from the HRNet-family, ourHRViT follows a unique heterogeneous branch design to balance efficiency and performance, which is not simply an improved HRNet but a new topology of pure ViTs mainly constructed by self-attention operators. Our main contributions are as follows,
We deeply investigate the multi-scale representation learning in ViTs and integrate high-resolution architecture with vision transformers for high-performance dense prediction vision tasks.
To enable scalable HR-ViT integration with better performance and efficiency trade-off, we leverage the redundancy in transformer blocks and perform joint optimization on key components of HRViT with heterogeneous branch designs.
The proposed HRViT achieves 50.20% mIoU on ADE20K val and 83.16% mIoU on Cityscapes val for semantic segmentation tasks, outperforming state-of-the-art (SoTA) MiT and CSWin with 1.78 higher mIoU, 28% fewer parameters, and 21% lower FLOPs, on average.
2 Proposed HRViT Architecture
Compared with the surge in sophisticated attention operator innovations, the multi-scale representation learning of ViTs is much less explored, which is far behind the recent advance in their CNN counterparts. New topology designs create another dimension to unleash the potential of ViTs with even stronger vision expressivity. An important question that remains to be answered is whether the success of HRNet can be efficiently migrated to ViT backbones to consolidate their leading position in high-resolution dense prediction tasks.
In this section, we delve into the multi-scale representation learning in ViTs and introduce a hardware-efficient integration of the HR architecture and ViTs.
2.1 Architecture overview
We illustrate the architecture of HRViT in Figure 1. It consists of a convolutional stem to reduce spatial dimensions while extracting low-level features. Then we construct four progressive transformer stages where the -th stage contains parallel multi-scale transformer branches. Each stage can have one or more modules. Each module starts with a lightweight dense fusion layer to achieve cross-resolution interaction and an efficient patch embedding block for local feature extraction, followed by repeated augmented local self-attention blocks (HRViTAttn) and mixed-scale convolutional feedforward networks (MixCFN). Unlike sequential ViT backbones that progressively reduce the spatial dimension to generate pyramid features, we maintain the HR features throughout the network to strengthen the quality of HR representations via cross-resolution fusion.
2.2 Efficient HR-ViT integration with heterogeneous branch design
We design a heterogeneous multi-branch architecture for hardware-efficient multi-scale high-resolution ViTs. A straightforward choice is to replace all convolutions in HRNet with self-attentions. However, given the high complexity of multi-branch HRNet and self-attention operators, this brute-force combining will quickly cause an explosion in memory footprint, parameter size, and computational cost. The real challenge is that we want to leverage both of the superior multi-scale representability from HR architectures and the superior modeling capacity of transformers, meanwhile, we have to overcome the enormous complexity and make it even more hardware-efficient than both of them. Hence careful architecture and block co-design is critical to a scalable and efficient HR-ViT integration.
Heterogeneous branch configuration. The first question is how to configure each branch for a scalable HRViT design. Simply assigning the same number of blocks with the same local self-attention window size on each module will make it intractably costly. We give a detailed analysis on the functionality and cost of each branch in Table 1
, based on which we summarize a simple design heuristic.
|Feature/Arch.||HR ()||MR ()||LR ()|
|Eff. on class.||Not quite useful||Important||Important|
|Window size||Narrow (s=1,2)||Wide (s=7)||Wide (s=7)|
|Depth||Shallow (5-6)||Deep (20-30)||Shallow (4)|
We give the parameter count of HRViTAttn and MixCFN blocks on the -th branch (=1,2,3,4),
The amount of floating-point operations (FLOPs) is,
The first and second HR branches (=1,2) can barely generate useful high-level features for classification but have a high memory and computational cost. On the other hand, they are parameter-efficient and can provide fine-grained detail calibration in segmentation tasks. Thus we use a narrow attention window size and use a minimum number of blocks on two HR paths.
The most important branch is the third one with a medium resolution (MR). Given its medium hardware cost, we can afford a deep branch with a large window size on the MR path to provide large receptive fields and well-extracted high-level features.
The lowest resolution (LR) branch contains most parameters and is very useful to provide high-level features as coarse segmentation maps. However, its small spatial sizes lose too many image details. Therefore, we only put a few blocks with a large window size on the LR branch to improve high-level feature quality under parameter budgets.
Nearly-even block assignment. Once we decide the total branch depth, a unique question, which does not exist in the sequential ViT variants, is how to assign those blocks to each module. In our example HRViT, we need to assign 20 blocks to 4 modules on the 3rd path. To maximize the average depth of the network ensemble and help input/gradient flow through the deep transformer branch, we prefer a nearly-even parititioning, e.g., 6-6-6-2, to an extremely unbalanced assignment, e.g., 17-1-1-1.
2.3 Efficient HRViT component design
Then we will give a detailed introduction to the optimized building blocks and key features of HRViT.
Augmented cross-shaped local self-attention. To achieve high performance with improved efficiency, a hardware-efficient self-attention operator is necessary. We adopt one of the SoTA efficient cross-shaped self-attention  as our baseline attention operator. Based on that, we design our augmented cross-shaped local self-attention HRViTAttn. This attention has the following advantages. (1) Fine-grained attention: Compared with globally-downsampled attentions [25, 29], this one has fine-grained feature aggregation that preserves detailed information. (2) Approximated global view: By using two parallel orthogonal local attentions, this attention can collect global information. (3) Scalable complexity: one dimension of the window is fixed, which avoids quadratic complexity to image sizes.
To balance performance and hardware efficiency, we introduce our augmented version, denoted as HRViTAttn, with several key optimizations.
: augmented cross-shaped local self-attention with a parallel convolution path and an efficient diversity-enhanced shortcut. (b) Window zero-padding with attention map masking.
In Figure ((a))(a), we follow the cross-shaped window partitioning approach in CSWin that separates the input into two parts . is partitioned into disjoint horizontal windows, and the other half is chunked into vertical windows. The window is set to or . Within each window, the patch is chunked into -dimensional heads, then a local self-attention is applied,
where are projection matrices to generate query , key , and value tensors for the -th head, is the output projection matrix, and is Hardswish activation. If the image sizes are not a multiple of window size, e.g., , we apply zero-padding to inputs to allow a complete -th window, shown in Figure ((b))(b). Then the padded region in the attention map is masked to 0 to avoid incoherent semantic correlation.
The original QKV linear layers are quite costly in computation and parameters. We share the linear projections for key and value tensors in HRViTAttn to save computation and parameters as follows,
Besides, we introduce an auxiliary path with parallel depth-wise convolution to inject inductive bias to facilitate training. Different from the local positional encoding in CSWin, our parallel path is nonlinear and applied on the entire 4-D feature map without window-partitioning. This path can be treated as an inverted residual module sharing point-wise convolutions with the linear projection layers in self-attention. This shared path can effectively inject inductive bias and reinforce local feature aggregation with marginal hardware overhead.
As a performance compensation for the above key-value sharing, we introduce an extra Hardswish function to improve the nonlinearity. We also append a BatchNorm (BN) layer that is initialized to an identity projection to stabilize the distribution for better trainability. Recent studies revealed that different transformer layers tend to have very similar features where the shortcut plays a critical role . Inspired by the augmented shortcut 
, we add a channel-wise projector as a diversity-enhanced shortcut (DES). The main difference is that our shortcut has higher nonlinearity and does not depend on hardware-unfriendly Fourier transforms. The projection matrix in our DESis approximated by Kronecker decomposition to minimize parameter cost, where is optimally set to . Then we fold as and convert into to save computations. We further insert Hardswish after the projection to increase the nonlinearity,
Mixed-scale convolutional feedforward network. Inspired by the MixFFN in MiT  and multi-branch inverted residual blocks in HR-NAS , we design a mixed-scale convolutional FFN (MixCFN) by inserting two multi-scale depth-wise convolution paths between two linear layers.
After LayerNorm, we expand the channel by a ratio of , then split it into two branches. The 33 and 55 depth-wise convolutions (DWConv) are used to increase the multi-scale local information extraction of HRViT. For efficiency, we exploit the channel redundancy by reducing the MixCFN expansion ratio from 4 [29, 17] to 2 or 3 with marginal performance loss on medium to large models.
Downsampling stem. In dense prediction tasks, images are of high resolution, e.g., 10241024. Self-attention operators are known to be expensive as their complexity is quadratic to image sizes. To address the scalability issue when processing large images, we down-sample the inputs by 4 before feeding into the main body of HRViT. We do not use attention operations in the stem since early convolutions are more effective to extract low-level features than self-attentions [12, 28]
. On the other hand, instead of simply using a stride-4 convolution as in prior ViTs[29, 5, 17]
, we follow the design in HRNet and use two stride-2 CONV-BN-ReLU blocks as a stronger downsampling stem to extract-channel features with more information maintained.
Efficient patch embedding. Before transformer blocks in each module, we put a patch embedding block (CONV-LayerNorm) on each branch. It is used to match channels and extract patch information with enhanced inter-patch communication. Unlike in sequential architectures that only have 4 embedding layers, we found that the patch embedding layers have a non-trivial hardware cost in the HR architecture since each module at stage- will have embedding blocks. We slim it down with a blueprint convolution , i.e., point-wise CONV followed by a depth-wise CONV,
Cross-resolution fusion layer. The cross-resolution fusion layer is critical for HRViT to learn high-quality HR representations, shown in Figure 4. To impose more cross-resolution interaction, we borrow the idea from HRNet [24, 31] to insert repeated cross-resolution fusion layers at the beginning of each module.
To help LR features maintain more image details and precise position information, we merge them with down-sampled HR features. Instead of using a progressive convolution-based downsampling path to match tensor shapes [24, 31], we employ a direct down-sampling path to minimize hardware overhead. In the down-sampling path between the -th input and -th output (), we use a depth-wise separable convolution with a stride of to shrink the spatial dimension and match the output channels. The kernel size used in the DWConv is (+1) to create patch overlaps. Those HR paths inject more image information into the LR path to mitigate information loss and fortify gradient flows
during backpropagation to facilitate the training of LR transformer blocks.
On the other hand, the receptive field is usually limited in the HR blocks as we minimize the window size and branch depth on HR paths. Hence, we merge LR representations into HR paths to help them obtain higher-level features with a larger receptive field. Specifically, in the up-scaling path (
), we first increase the number of channels with a point-wise convolution and up-scale the spatial dimension via a nearest neighbor interpolation with a rate of. When =, we directly pass the features to the output as a skip connection. Note that in HR-NAS , the dense fusion is simpflified by a sparse fusion module where only neighboring resolutions are merged. This technique is not considered in HRViT since it saves marginal hardware cost but leads to a noticeable accuracy drop, which will be shown in the ablation study later.
2.4 Architectural variants
Different HRViT variants scale both in network depth and width. Table 2 summarizes detailed branch designs of 3 variants.
|Variant||Architecture design||Window||MixCFN ratio||Channel||Head dim|
We follow the aforementioned design guidance and put 1 transformer block on HR branches, 20-24 blocks on the MR branch, and 4-6 blocks on the LR branch. Window sizes are set to 1,2,7,7 for 4 branches. We use relatively large MixCFN expansion ratios in small variants for performance and reduce the ratio to 2 on larger variants for efficiency. We gradually follow the scaling rule from CSWin  to increase the basic channel for the highest resolution branch from 32 to 64. #Blocks and #channels can be flexibly tuned for the 3rd/4th branch to match a specific hardware cost.
We pretrain all models on ImageNet-1K and conduct experiments on ADE20K and Cityscapes  for semantic segmentation. We compare the performance and efficiency of our HRViT with SoTA ViT backbones, i.e., Swin , Twins , MiT , and CSWin .
3.1 Semantic segmentation on ADE20K and Cityscapes
On semantic segmentation tasks, HRViT achieves the best performance-efficiency Pareto front, surpassing the SoTA MiT and CSWin under different settings. HRViT (b1-b3) outperform the previous SoTA SegFormer-MiT (B1-B3)  with +3.68, +2.26, and +0.80 higher mIoU on ADE20K val, and +3.13, +1.81, +1.46 higher mIoU on Cityscapes val.
ImageNet-1K pre-training. All ViT models are pre-trained on ImageNet-1K. We follow the same pre-training settings as DeiT  and other ViTs [17, 29, 9]. We adopt stochastic depth  for all HRViT variants with the max drop rate of 0.1. The drop rate is gradually increased on the deepest 3rd branch, and other shallow branches follow the rate of the 3rd branch within the same module. We use the HRNetV2  classification head in HRViT on ImageNet-1K pre-training. The pre-training results are in Table 3.
Settings. We evaluate HRViT for semantic segmentation on the Cityscapes and ADE20K datasets. We employ a lightweight SegFormer  head based on the mmsegmentation framework . We follow the training settings of prior work [29, 9]. The training image size for ADE20K and Cityscapes are 512512 and 10241024, respectively. We use an AdamW optimizer for 160 k iterations using a ’poly’ learning rate schedule, 1,500 steps of linear warm-up, an initial learning rate of 6e-5, a mini-batch size of 16, and a weight decay rate of 0.01. The test image size for ADE20K and Cityscapes is set to 5122048 and 10242048, respectively. We do inference on Cityscapes with sliding window test by cropping 10241024 patches.
Results on ADE20K. We evaluate different ViT backbones in single-scale mean intersection-over-union (mIoU), #Params, and GFLOPs. Figure 5 plots the Pareto curves in the #params and FLOPs space. On ADE20K val, HRViT outperforms other ViTs with better performance and efficiency trade-off. For example, with the SegFormer head, HRViT-b1 outperforms MiT-B1 with 3.68% higher mIoU, 40% fewer parameters, and 8% less computation. Our HRViT-b3 achieves a higher mIoU than the best CSWin-S but saves 23% parameters and 13% FLOPs. Compared with the convolutional HRNetV2+OCR, our HRViT shows considerable performance advantages with significant hardware efficiency boost.
|SegFormer Head |
|Backbone||#Param. (M)||GFLOPs||mIoU (%)|
Results on Cityscapes. In Table 4, our small model HRViT-b1 outperforms MiT-B1 and CSWin-Ti by +3.13 and +2.47 higher mIoU, which shows the larger effective width of HR architectures is especially effective on slim networks.
When training HRViT-b3 on Cityscapes, we set the multi-branch window settings to 1-2-9-9. HRViT-b3 outperforms the MiT-b4 with +0.86 higher mIoU, 55.4% fewer parameters, and 30.7% lower FLOPs. Compared with two SoTA ViT backbones, i.e., MiT and CSWin, HRViT achieves an average of +2.16 higher mIoU with 30.7% fewer parameters and 22.3% less computation.
3.2 Ablation studies
In Table 5, we independently remove each technique from HRViT and evaluate on ImageNet and Cityscapes.
Sharing key-value. When removing key-value sharing, i.e., using independent keys and values, HRViT-b1 shows the same ImageNet-1K accuracy but at the cost of lower Cityscapes segmentation mIoU, 9% more parameters, and 4% more computations.
|Eff. patch embed||9.9||16.5||80.19||81.18|
|Parallel CONV path||8.1||14.0||80.06||80.82|
|All block opt.||10.1||16.3||79.79||80.45|
Patch embedding. We change our efficient patch embedding to the CONV-based overlapped patch embedding. We observe 22% more parameters and 17% more FLOPs without accuracy/mIoU benefits.
MixCFN. Removing the mixed-scale convolutional feedforward block directly leads to 0.66% ImageNet accuracy drop and 0.11% Cityscapes mIoU loss with marginal efficiency improvement. We can observe that the MixCFN block is an important technique to guarantee our performance.
Parallel CONV path. The embedded inverted residual path in the attention block is very lightweight but contributes 0.46% higher ImageNet accuracy and 0.81% higher mIoU.
Additional nonlinearity/BN. The extra Hardswish and BN introduce negligible overhead but boost expressivity and trainability, bringing 0.15% higher ImageNet-1K accuracy 0.51% higher mIoU on Cityscapes val.
Dense vs. sparse fusion layers. The sparse fusion  is not effective in HRViT as it saves tiny hardware cost (1%) but leads to 0.57% accuracy drop and 0.37% mIoU loss.
Diversity-enhanced shortcut. The nonlinear shortcut (DES) helps improve the feature diversity and effectively improves the performance to a higher level on multiple tasks. Negligible hardware cost is introduced due to the high efficiency of the Kronecker decomposition-based projector.
Naive HRNet-ViT vs. HRViT.
In Table 6, we directly replace residual blocks in HRNetV2 with transformer blocks as a naive baseline. When comparing HRNet-MiT with the sequential MiT, we notice the HR variants have comparable mIoUs while significantly saving hardware cost. This shows that the multi-branch architecture is indeed helpful to boost the multi-scale representability. However, the naive HRNet-ViT overlooks the expensive cost of transformers. Thus it is not scalable as the hardware cost quickly outweigh its performance gain. In contrast, our heterogeneous branches and optimized components achieve good control of the hardware cost, enhance the model representability, and maintain good scalability.
4 Related Work
Multi-scale representation learning. Previous CNNs and ViTs progressively down-sample the feature map to compute the LR representations [18, 4, 10], and recover the HR features via up-sampling, e.g., SegNet , UNet , Hourglass . HRNet  maintains the HR representations throughout the network with cross-resolution fusion. Lite-HRNet  proposes conditional channel weighting blocks to exchange information across resolutions. HR-NAS  searches the channel/head settings for inverted residual blocks and the auxiliary transformer branches. HRFormer  improves HRNetV2 by replacing residual blocks with Swin transformer blocks. Different from the HRNet-family, HRViT is a pure ViT backbone with a novel multi-branch topology that benefits both from HR architectures and self-attentions. Besides, we explore heterogeneous branch design and block optimization to boost the hardware efficiency.
Multi-scale ViT backbones. Several multi-scale ViTs adopt hierarchical architectures to generate progressively down-sampled pyramid features, but they still follow the design concept of classification networks with a sequential topology, e.g., PVT , CrossViT , Swin , Twins , SegFormer , MViT , CSWin . However, there is no information flow from LR to HR features inside the ViT backbone, and the HR features are still very shallow ones of relatively low quality. In contrast, HRViT adopts a multi-branch network topology with enhanced multi-scale representability and improved efficiency.
In this paper, we delve into the multi-scale representation learning in vision transformers and present an efficient multi-scale high-resolution ViT backbone design, named HRViT. To fully exploit the potentials of ViTs in dense prediction tasks, we enhance ViT backbones with a multi-branch architecture to enable high-quality HR representation and cross-scale interaction. To scale up HRViT, we jointly optimize key building blocks with efficient embedding layers, augmented cross-shaped attentions, and mixed-scale convolutional feedforward networks. Our architecture-block co-design pushes the performance-efficiency Pareto front to a new level. Extensive experiments show that HRViT outperform state-of-the-art vision transformer backbone designs with significant performance improvement with lower hardware cost.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 2481–2495, 2017.
-  Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-End object detection with transformers. In Proc. ECCV, 2020.
-  Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proc. ICCV, 2021.
-  Liang-Chieh Chen, Yukun Zhu, George Papandreou Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proc. ECCV, 2018.
-  Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. In Proc. NeurIPS, 2021.
-  MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.
The cityscapes dataset for semantic urban scene understanding.In Proc. CVPR, 2016.
-  Mingyu Ding, Xiaochen Lian, Linjie Yang, Peng Wang, Xiaojie Jin, Zhiwu Lu, and Ping Luo. HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers. In Proc. CVPR, 2021.
-  Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, 2021.
-  A. Dosovitskiy, L. Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, M. Dehghani, Matthias Minderer, G. Heigold, S. Gelly, Jakob Uszkoreit, and N. Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proc. ICLR, 2021.
-  Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. Arxiv preprint 2104.11227, 2021.
-  Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. Arxiv preprint 2104.01136, 2021.
-  Daniel Haase and Manuel Amthor. Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets. In Proc. CVPR, pages 14588–14597, 2020.
-  Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In Proc. ECCV, 2016.
-  Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
-  Iasonas Kokkinos Liang-Chieh Chen, George Papandreou, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In Proc. ICLR, 2015.
-  Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proc. ICCV, 2021.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proc. CVPR, 2015.
A. Newell, K. Yang, and J. Deng.
Stacked hourglass networks for human pose estimation.In Proc. ECCV, page 483–499, 2016.
-  Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do Vision Transformers See Like Convolutional Neural Networks? Arxiv preprint 2108.08810, 2021.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), May 2015.
-  Yehui Tang, Kai Han, Chang Xu, An Xiao, Yiping Deng, Chao Xu, and Yunhe Wang. Augmented Shortcuts for Vision Transformers. In Proc. NeurIPS, 2021.
-  Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers: distillation through attention. In Proc. ICML, pages 10347–10357, 2021.
-  Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3349–3364, 2021.
-  Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proc. ICCV, 2021.
-  Wenxiao Wang, Lu Yao, Long Chen, Deng Cai, Xiaofei He, and Wei Liu. CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention. arXiv preprint arXiv:2108.00154, 2021.
-  Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, , and Huaxia Xia. End-to-end video instance segmentation with transformers. In Proc. CVPR, 2021.
-  Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, and Ross Girshick. Early Convolutions Help Transformers See Better. Arxiv preprint 2106.14881, 2021.
-  Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proc. NeurIPS, 2021.
-  Weijian Xu, Yifan Xu, Tyler Chang, and Zhuowen Tu. Co-scale conv-attentional image transformers. In Proc. ICCV, 2021.
-  Changqian Yu, Bin Xiao, Changxin Gao, Lu Yuan, Lei Zhang, Nong Sang, and Jingdong Wang. Lite-HRNet: A Lightweight High-Resolution Network. In Proc. CVPR, 2021.
-  Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan Yuille, and Wei Shen. Glance-and-Gaze Vision Transformer. arXiv preprint arXiv:2106.02277, 2021.
-  Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token ViT: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986, 2021.
-  Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. HRFormer: High-Resolution Transformer for Dense Prediction. In Proc. NeurIPS, 2021.
-  Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. arXiv preprint arXiv:2103.15358, 2021.
-  Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, , and Antonio Torralba. Scene parsing through ade20k dataset. In Proc. CVPR, 2017.