HRViT: Multi-Scale High-Resolution Vision Transformer

by   Jiaqi Gu, et al.
The University of Texas at Austin

Vision transformers (ViTs) have attracted much attention for their superior performance on computer vision tasks. To address their limitations of single-scale low-resolution representations, prior work adapts ViTs to high-resolution dense prediction tasks with hierarchical architectures to generate pyramid features. However, multi-scale representation learning is still under-explored on ViTs, given their classification-like sequential topology. To enhance ViTs with more capability to learn semantically-rich and spatially-precise multi-scale representations, in this work, we present an efficient integration of high-resolution multi-branch architectures with vision transformers, dubbed HRViT, pushing the Pareto front of dense prediction tasks to a new level. We explore heterogeneous branch design, reduce the redundancy in linear layers, and augment the model nonlinearity to balance the model performance and hardware efficiency. The proposed HRViT achieves 50.20 ADE20K and 83.16 surpassing state-of-the-art MiT and CSWin with an average of +1.78 mIoU improvement, 28 the potential of HRViT as strong vision backbones.



There are no comments yet.


page 1

page 2

page 3

page 4


MPViT: Multi-Path Vision Transformer for Dense Prediction

Dense computer vision tasks such as object detection and segmentation re...

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

This paper presents a new Vision Transformer (ViT) architecture Multi-Sc...

Multiscale Vision Transformers

We present Multiscale Vision Transformers (MViT) for video and image rec...

MAXIM: Multi-Axis MLP for Image Processing

Recent progress on Transformers and multi-layer perceptron (MLP) models ...

Twins: Revisiting Spatial Attention Design in Vision Transformers

Very recently, a variety of vision transformer architectures for dense p...

High-Resolution Representations for Labeling Pixels and Regions

High-resolution representation learning plays an essential role in many ...

Taming Transformers for High-Resolution Image Synthesis

Designed to learn long-range interactions on sequential data, transforme...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dense prediction vision tasks, e.g., semantic segmentation, object detection, are critical workloads on modern intelligent computing platforms, e.g., AR/VR devices. Convolutional neural networks (CNNs) have rapidly evolved with significant improvement in dense prediction tasks 

[18, 16, 4, 21, 24, 1]. Beyond classical CNNs, vision transformers (ViTs) have attracted extensive interests and showed competitive performance in vision tasks [2, 10, 33, 15, 35, 23, 17, 5, 29, 9, 26, 27, 3, 30]. Benefiting from the self-attention operations, ViTs embrace strong expressivity with long-distance information interaction. However, ViTs produce single-scale and low-resolution representations, which are not compatible with dense prediction workloads that require high position sensitivity and fine-grained image details.

Recently, various ViT backbones have been proposed to adapt to dense prediction tasks. Prior ViT backbones proposed various efficient global/local self-attention to extract hierarchical features [25, 17, 32, 26, 5, 29, 9]. A multi-scale ViT (MViT) [11] has been proposed to learn a hierarchy that progressively expands the channel capacity while reducing the spatial resolution. However, they still follow a classification-like network topology with a sequential or series architecture. For complexity consideration, they gradually downsample the feature maps to extract higher-level low-resolution (LR) representations and directly feed each stage’s output to the downstream framework. Such sequential structures lack enough cross-scale interaction thus cannot generate high-quality high-resolution (HR) representations.

HRNet [24] was proposed to enhance the cross-resolution interaction with a multi-branch architecture that maintains all resolutions throughout the network. Multi-resolution features are extracted in parallel and fused repeatedly to generate high-quality HR representations with richer semantic information. Such a design concept has achieved great success in various dense prediction tasks. Nevertheless, its expressivity is limited by small receptive fields and strong inductive bias from cascaded convolution operations. Later, a slimmed Lite-HRNet [31] was put forward with efficient shuffle blocks and channel weighting operators. HR-NAS [8] inserted a lightweight transformer path into the residual blocks to extract global information and applied the neural architecture search to remove the channel/head redundancies. However, those improved HRNet designs are still mainly based on the convolutional building blocks, and the demonstrated performance of their tiny models is still far behind the SoTA scores of ViT counterparts.

Migrating the success of HRNet to ViT designs is non-trivial. Given the high complexity of multi-branch HR architectures and self-attention operations, simply replacing all residual blocks in HRNet with transformer blocks will encounter severe scalability issues. The inherited powerful representability will be overwhelmed by the prohibitive hardware cost without careful architecture-block co-optimization.

To enhance ViTs with stronger representability to generate semantically-rich and position-precise features, in this work, we present HRViT, an efficient multi-scale high-resolution vision transformer backbone specifically optimized for high-resolution dense prediction tasks. Our goal is facilitating efficient multi-scale representation learning for vision transformers. HRViT

is different from prior sequential ViTs in several aspects: 1) our multi-branch HR architecture extracts multi-scale features in parallel with cross-resolution fusion to enhance the multi-scale representability of ViTs; 2) our augmented local self-attention removes redundant keys and values for better efficiency and enhances the expressivity with extra convolution paths, additional nonlinearity, and auxiliary shortcuts to enhance feature diversity; 3) we adopt mixed-scale convolutional feedforward networks to fortify the multi-scale feature extraction; 4) our HR convolutional stem and efficient patch embedding layers maintain more low-level fine-grained features with reduced hardware cost. Also, distinguished from the HRNet-family, our

HRViT follows a unique heterogeneous branch design to balance efficiency and performance, which is not simply an improved HRNet but a new topology of pure ViTs mainly constructed by self-attention operators. Our main contributions are as follows,

  • We deeply investigate the multi-scale representation learning in ViTs and integrate high-resolution architecture with vision transformers for high-performance dense prediction vision tasks.

  • To enable scalable HR-ViT integration with better performance and efficiency trade-off, we leverage the redundancy in transformer blocks and perform joint optimization on key components of HRViT with heterogeneous branch designs.

  • The proposed HRViT achieves 50.20% mIoU on ADE20K val and 83.16% mIoU on Cityscapes val for semantic segmentation tasks, outperforming state-of-the-art (SoTA) MiT and CSWin with 1.78 higher mIoU, 28% fewer parameters, and 21% lower FLOPs, on average.

2 Proposed HRViT Architecture

Figure 1: The overall architecture of our proposed HRViT. It progressively expands to 4 branches. Each stage has multiple modules. Each module contains multiple transformer blocks.

Compared with the surge in sophisticated attention operator innovations, the multi-scale representation learning of ViTs is much less explored, which is far behind the recent advance in their CNN counterparts. New topology designs create another dimension to unleash the potential of ViTs with even stronger vision expressivity. An important question that remains to be answered is whether the success of HRNet can be efficiently migrated to ViT backbones to consolidate their leading position in high-resolution dense prediction tasks.

In this section, we delve into the multi-scale representation learning in ViTs and introduce a hardware-efficient integration of the HR architecture and ViTs.

2.1 Architecture overview

We illustrate the architecture of HRViT in Figure 1. It consists of a convolutional stem to reduce spatial dimensions while extracting low-level features. Then we construct four progressive transformer stages where the -th stage contains parallel multi-scale transformer branches. Each stage can have one or more modules. Each module starts with a lightweight dense fusion layer to achieve cross-resolution interaction and an efficient patch embedding block for local feature extraction, followed by repeated augmented local self-attention blocks (HRViTAttn) and mixed-scale convolutional feedforward networks (MixCFN). Unlike sequential ViT backbones that progressively reduce the spatial dimension to generate pyramid features, we maintain the HR features throughout the network to strengthen the quality of HR representations via cross-resolution fusion.

2.2 Efficient HR-ViT integration with heterogeneous branch design

We design a heterogeneous multi-branch architecture for hardware-efficient multi-scale high-resolution ViTs. A straightforward choice is to replace all convolutions in HRNet with self-attentions. However, given the high complexity of multi-branch HRNet and self-attention operators, this brute-force combining will quickly cause an explosion in memory footprint, parameter size, and computational cost. The real challenge is that we want to leverage both of the superior multi-scale representability from HR architectures and the superior modeling capacity of transformers, meanwhile, we have to overcome the enormous complexity and make it even more hardware-efficient than both of them. Hence careful architecture and block co-design is critical to a scalable and efficient HR-ViT integration.

Heterogeneous branch configuration.  The first question is how to configure each branch for a scalable HRViT design. Simply assigning the same number of blocks with the same local self-attention window size on each module will make it intractably costly. We give a detailed analysis on the functionality and cost of each branch in Table 1

, based on which we summarize a simple design heuristic.

Feature/Arch. HR () MR () LR ()
Memory cost High Medium Low
Computation Heavy Moderate Light
#Params Small Medium Large
Eff. on class. Not quite useful Important Important
Granularity Fine Medium Coarse
Receptive field Local Region Global
Window size Narrow (s=1,2) Wide (s=7) Wide (s=7)
Depth Shallow (5-6) Deep (20-30) Shallow (4)
Table 1: Qualitative cost and functionality analysis. Window sizes and depth are given for each branch.

We give the parameter count of HRViTAttn and MixCFN blocks on the -th branch (=1,2,3,4),


The amount of floating-point operations (FLOPs) is,


The first and second HR branches (=1,2) can barely generate useful high-level features for classification but have a high memory and computational cost. On the other hand, they are parameter-efficient and can provide fine-grained detail calibration in segmentation tasks. Thus we use a narrow attention window size and use a minimum number of blocks on two HR paths.

The most important branch is the third one with a medium resolution (MR). Given its medium hardware cost, we can afford a deep branch with a large window size on the MR path to provide large receptive fields and well-extracted high-level features.

The lowest resolution (LR) branch contains most parameters and is very useful to provide high-level features as coarse segmentation maps. However, its small spatial sizes lose too many image details. Therefore, we only put a few blocks with a large window size on the LR branch to improve high-level feature quality under parameter budgets.

Nearly-even block assignment.  Once we decide the total branch depth, a unique question, which does not exist in the sequential ViT variants, is how to assign those blocks to each module. In our example HRViT, we need to assign 20 blocks to 4 modules on the 3rd path. To maximize the average depth of the network ensemble and help input/gradient flow through the deep transformer branch, we prefer a nearly-even parititioning, e.g., 6-6-6-2, to an extremely unbalanced assignment, e.g., 17-1-1-1.

2.3 Efficient HRViT component design

Then we will give a detailed introduction to the optimized building blocks and key features of HRViT.

Augmented cross-shaped local self-attention. To achieve high performance with improved efficiency, a hardware-efficient self-attention operator is necessary. We adopt one of the SoTA efficient cross-shaped self-attention [9] as our baseline attention operator. Based on that, we design our augmented cross-shaped local self-attention HRViTAttn. This attention has the following advantages. (1) Fine-grained attention: Compared with globally-downsampled attentions [25, 29], this one has fine-grained feature aggregation that preserves detailed information. (2) Approximated global view: By using two parallel orthogonal local attentions, this attention can collect global information. (3) Scalable complexity: one dimension of the window is fixed, which avoids quadratic complexity to image sizes.

To balance performance and hardware efficiency, we introduce our augmented version, denoted as HRViTAttn, with several key optimizations.

Figure 2: (a) HRViTAttn

: augmented cross-shaped local self-attention with a parallel convolution path and an efficient diversity-enhanced shortcut. (b) Window zero-padding with attention map masking.

In Figure ((a))(a), we follow the cross-shaped window partitioning approach in CSWin that separates the input into two parts . is partitioned into disjoint horizontal windows, and the other half is chunked into vertical windows. The window is set to or . Within each window, the patch is chunked into -dimensional heads, then a local self-attention is applied,


where are projection matrices to generate query , key , and value tensors for the -th head, is the output projection matrix, and is Hardswish activation. If the image sizes are not a multiple of window size, e.g., , we apply zero-padding to inputs to allow a complete -th window, shown in Figure ((b))(b). Then the padded region in the attention map is masked to 0 to avoid incoherent semantic correlation.

The original QKV linear layers are quite costly in computation and parameters. We share the linear projections for key and value tensors in HRViTAttn to save computation and parameters as follows,


Besides, we introduce an auxiliary path with parallel depth-wise convolution to inject inductive bias to facilitate training. Different from the local positional encoding in CSWin, our parallel path is nonlinear and applied on the entire 4-D feature map without window-partitioning. This path can be treated as an inverted residual module sharing point-wise convolutions with the linear projection layers in self-attention. This shared path can effectively inject inductive bias and reinforce local feature aggregation with marginal hardware overhead.

As a performance compensation for the above key-value sharing, we introduce an extra Hardswish function to improve the nonlinearity. We also append a BatchNorm (BN) layer that is initialized to an identity projection to stabilize the distribution for better trainability. Recent studies revealed that different transformer layers tend to have very similar features where the shortcut plays a critical role [20]. Inspired by the augmented shortcut [22]

, we add a channel-wise projector as a diversity-enhanced shortcut (DES). The main difference is that our shortcut has higher nonlinearity and does not depend on hardware-unfriendly Fourier transforms. The projection matrix in our DES

is approximated by Kronecker decomposition to minimize parameter cost, where is optimally set to . Then we fold as and convert into to save computations. We further insert Hardswish after the projection to increase the nonlinearity,


Mixed-scale convolutional feedforward network.  Inspired by the MixFFN in MiT [29] and multi-branch inverted residual blocks in HR-NAS [8], we design a mixed-scale convolutional FFN (MixCFN) by inserting two multi-scale depth-wise convolution paths between two linear layers.

Figure 3: MixCFN with multiple depth-wise convolution paths to extract multi-scale local information.

After LayerNorm, we expand the channel by a ratio of , then split it into two branches. The 33 and 55 depth-wise convolutions (DWConv) are used to increase the multi-scale local information extraction of HRViT. For efficiency, we exploit the channel redundancy by reducing the MixCFN expansion ratio from 4 [29, 17] to 2 or 3 with marginal performance loss on medium to large models.

Downsampling stem.  In dense prediction tasks, images are of high resolution, e.g., 10241024. Self-attention operators are known to be expensive as their complexity is quadratic to image sizes. To address the scalability issue when processing large images, we down-sample the inputs by 4 before feeding into the main body of HRViT. We do not use attention operations in the stem since early convolutions are more effective to extract low-level features than self-attentions [12, 28]

. On the other hand, instead of simply using a stride-4 convolution as in prior ViTs 

[29, 5, 17]

, we follow the design in HRNet and use two stride-2 CONV-BN-ReLU blocks as a stronger downsampling stem to extract

-channel features with more information maintained.

Efficient patch embedding.  Before transformer blocks in each module, we put a patch embedding block (CONV-LayerNorm) on each branch. It is used to match channels and extract patch information with enhanced inter-patch communication. Unlike in sequential architectures that only have 4 embedding layers, we found that the patch embedding layers have a non-trivial hardware cost in the HR architecture since each module at stage- will have embedding blocks. We slim it down with a blueprint convolution [13], i.e., point-wise CONV followed by a depth-wise CONV,

Figure 4: Channel matching, up-scaling, and down-sampling in the fusion layer.

Cross-resolution fusion layer.  The cross-resolution fusion layer is critical for HRViT to learn high-quality HR representations, shown in Figure 4. To impose more cross-resolution interaction, we borrow the idea from HRNet [24, 31] to insert repeated cross-resolution fusion layers at the beginning of each module.

To help LR features maintain more image details and precise position information, we merge them with down-sampled HR features. Instead of using a progressive convolution-based downsampling path to match tensor shapes [24, 31], we employ a direct down-sampling path to minimize hardware overhead. In the down-sampling path between the -th input and -th output (), we use a depth-wise separable convolution with a stride of to shrink the spatial dimension and match the output channels. The kernel size used in the DWConv is (+1) to create patch overlaps. Those HR paths inject more image information into the LR path to mitigate information loss and fortify gradient flows

during backpropagation to facilitate the training of LR transformer blocks.

On the other hand, the receptive field is usually limited in the HR blocks as we minimize the window size and branch depth on HR paths. Hence, we merge LR representations into HR paths to help them obtain higher-level features with a larger receptive field. Specifically, in the up-scaling path (

), we first increase the number of channels with a point-wise convolution and up-scale the spatial dimension via a nearest neighbor interpolation with a rate of

. When =, we directly pass the features to the output as a skip connection. Note that in HR-NAS [8], the dense fusion is simpflified by a sparse fusion module where only neighboring resolutions are merged. This technique is not considered in HRViT since it saves marginal hardware cost but leads to a noticeable accuracy drop, which will be shown in the ablation study later.

2.4 Architectural variants

Different HRViT variants scale both in network depth and width. Table 2 summarizes detailed branch designs of 3 variants.

Variant Architecture design Window MixCFN ratio Channel Head dim
Table 2: Architecture variants of HRViT. #blocks is marked in each module. Per branch settings are shown.

We follow the aforementioned design guidance and put 1 transformer block on HR branches, 20-24 blocks on the MR branch, and 4-6 blocks on the LR branch. Window sizes are set to 1,2,7,7 for 4 branches. We use relatively large MixCFN expansion ratios in small variants for performance and reduce the ratio to 2 on larger variants for efficiency. We gradually follow the scaling rule from CSWin [9] to increase the basic channel for the highest resolution branch from 32 to 64. #Blocks and #channels can be flexibly tuned for the 3rd/4th branch to match a specific hardware cost.

3 Experiments

We pretrain all models on ImageNet-1K and conduct experiments on ADE20K 

[36] and Cityscapes [7] for semantic segmentation. We compare the performance and efficiency of our HRViT with SoTA ViT backbones, i.e., Swin [17], Twins [5], MiT [29], and CSWin [9].

3.1 Semantic segmentation on ADE20K and Cityscapes

Variant Image Size
top-1 acc.
HRViT-b1 224 19.7 2.7 80.5
HRViT-b2 224 32.5 5.1 82.3
HRViT-b3 224 37.9 5.7 82.8
Table 3: ImageNet-1K pre-training results. GFLOPs is measured under an image size of 224224. #Params includes the classification head as used in HRNetV2 [24].

On semantic segmentation tasks, HRViT achieves the best performance-efficiency Pareto front, surpassing the SoTA MiT and CSWin under different settings. HRViT (b1-b3) outperform the previous SoTA SegFormer-MiT (B1-B3) [29] with +3.68, +2.26, and +0.80 higher mIoU on ADE20K val, and +3.13, +1.81, +1.46 higher mIoU on Cityscapes val.

ImageNet-1K pre-training.  All ViT models are pre-trained on ImageNet-1K. We follow the same pre-training settings as DeiT [23] and other ViTs [17, 29, 9]. We adopt stochastic depth [14] for all HRViT variants with the max drop rate of 0.1. The drop rate is gradually increased on the deepest 3rd branch, and other shallow branches follow the rate of the 3rd branch within the same module. We use the HRNetV2 [24] classification head in HRViT on ImageNet-1K pre-training. The pre-training results are in Table 3.

Settings.  We evaluate HRViT for semantic segmentation on the Cityscapes and ADE20K datasets. We employ a lightweight SegFormer [29] head based on the mmsegmentation framework [6]. We follow the training settings of prior work [29, 9]. The training image size for ADE20K and Cityscapes are 512512 and 10241024, respectively. We use an AdamW optimizer for 160 k iterations using a ’poly’ learning rate schedule, 1,500 steps of linear warm-up, an initial learning rate of 6e-5, a mini-batch size of 16, and a weight decay rate of 0.01. The test image size for ADE20K and Cityscapes is set to 5122048 and 10242048, respectively. We do inference on Cityscapes with sliding window test by cropping 10241024 patches.

Figure 5: HRViT achieves the best Pareto front compared with other models on ADE20K val.

Results on ADE20K.  We evaluate different ViT backbones in single-scale mean intersection-over-union (mIoU), #Params, and GFLOPs. Figure 5 plots the Pareto curves in the #params and FLOPs space. On ADE20K val, HRViT outperforms other ViTs with better performance and efficiency trade-off. For example, with the SegFormer head, HRViT-b1 outperforms MiT-B1 with 3.68% higher mIoU, 40% fewer parameters, and 8% less computation. Our HRViT-b3 achieves a higher mIoU than the best CSWin-S but saves 23% parameters and 13% FLOPs. Compared with the convolutional HRNetV2+OCR, our HRViT shows considerable performance advantages with significant hardware efficiency boost.

SegFormer Head [29]
Backbone #Param. (M) GFLOPs mIoU (%)
MiT-B0 [29] 3.8 8.4 76.20
MiT-B1 [29] 13.7 15.9 78.50
CSWin-Ti [9] 5.9 11.4 79.16
HRViT-b1 8.1 14.1 81.63
MiT-B2 [29] 27.5 62.4 81.00
CSWin-T [9] 22.4 28.3 81.56
HRViT-b2 20.8 27.4 82.81
MiT-B3 [29] 47.3 79.0 81.70
MiT-B4 [29] 64.1 95.7 82.30
CSWin-S [9] 37.3 78.1 82.58
HRViT-b3 28.6 66.3 83.16
Avg improv. -30.7% -22.3% +2.16
Table 4: Compare different ViT backbones on the Cityscapes val segmentation dataset. CSWin-Ti is a slimmed CSWin-T with half channels (6432). FLOPs are measured on the image size of 512512.

Results on Cityscapes.  In Table 4, our small model HRViT-b1 outperforms MiT-B1 and CSWin-Ti by +3.13 and +2.47 higher mIoU, which shows the larger effective width of HR architectures is especially effective on slim networks.

When training HRViT-b3 on Cityscapes, we set the multi-branch window settings to 1-2-9-9. HRViT-b3 outperforms the MiT-b4 with +0.86 higher mIoU, 55.4% fewer parameters, and 30.7% lower FLOPs. Compared with two SoTA ViT backbones, i.e., MiT and CSWin, HRViT achieves an average of +2.16 higher mIoU with 30.7% fewer parameters and 22.3% less computation.

3.2 Ablation studies

In Table 5, we independently remove each technique from HRViT and evaluate on ImageNet and Cityscapes.

Sharing key-value.  When removing key-value sharing, i.e., using independent keys and values, HRViT-b1 shows the same ImageNet-1K accuracy but at the cost of lower Cityscapes segmentation mIoU, 9% more parameters, and 4% more computations.

top-1 acc.
HRViT-b1 8.1 14.1 80.52 81.63
Key-value sharing 8.8 14.7 80.52 81.00
Eff. patch embed 9.9 16.5 80.19 81.18
MixCFN 7.9 13.6 79.86 80.52
Parallel CONV path 8.1 14.0 80.06 80.82
Nonlinearity/BN 8.1 14.1 80.37 81.12
Dense fusion 8.0 14.0 79.95 81.26
DES 8.1 14.0 80.36 81.38
All block opt. 10.1 16.3 79.79 80.45
Table 5: Ablation on proposed techniques. Each entry removes one technique independently. The last one removes all block optimization techniques.

Patch embedding.  We change our efficient patch embedding to the CONV-based overlapped patch embedding. We observe 22% more parameters and 17% more FLOPs without accuracy/mIoU benefits.

MixCFN.  Removing the mixed-scale convolutional feedforward block directly leads to 0.66% ImageNet accuracy drop and 0.11% Cityscapes mIoU loss with marginal efficiency improvement. We can observe that the MixCFN block is an important technique to guarantee our performance.

Parallel CONV path.  The embedded inverted residual path in the attention block is very lightweight but contributes 0.46% higher ImageNet accuracy and 0.81% higher mIoU.

Additional nonlinearity/BN.  The extra Hardswish and BN introduce negligible overhead but boost expressivity and trainability, bringing 0.15% higher ImageNet-1K accuracy 0.51% higher mIoU on Cityscapes val.

Dense vs. sparse fusion layers.  The sparse fusion [8] is not effective in HRViT as it saves tiny hardware cost (1%) but leads to 0.57% accuracy drop and 0.37% mIoU loss.

Diversity-enhanced shortcut.  The nonlinear shortcut (DES) helps improve the feature diversity and effectively improves the performance to a higher level on multiple tasks. Negligible hardware cost is introduced due to the high efficiency of the Kronecker decomposition-based projector.

Naive HRNet-ViT vs. HRViT.

top-1 acc.
HRNet18-MiT 8.4 29.3 79.3 80.30
HRNet18-CSWin 8.1 22.3 79.5 80.95
HRViT-b1 8.1 14.1 80.5 81.63
HRNet32-MiT 24.4 52.4 81.1 82.05
HRNet32-CSWin 23.9 42.2 81.1 82.11
HRViT-b2 20.8 27.4 82.3 82.81
HRNet40-MiT 40.1 108.0 82.3 82.10
HRNet40-CSWin 39.5 96.3 82.4 82.38
HRViT-b3 28.6 66.3 82.8 83.16
Avg Improv. -14.4% -38.5% +0.92 +0.89
Table 6: Compare naive HRNet-ViT variants with HRViT on ImageNet-1K and Cityscapes val. Heterogeneous branch designs with optimized blocks in HRViT are more efficient and scalable than naive HRNet-ViT counterparts.

In Table 6, we directly replace residual blocks in HRNetV2 with transformer blocks as a naive baseline. When comparing HRNet-MiT with the sequential MiT, we notice the HR variants have comparable mIoUs while significantly saving hardware cost. This shows that the multi-branch architecture is indeed helpful to boost the multi-scale representability. However, the naive HRNet-ViT overlooks the expensive cost of transformers. Thus it is not scalable as the hardware cost quickly outweigh its performance gain. In contrast, our heterogeneous branches and optimized components achieve good control of the hardware cost, enhance the model representability, and maintain good scalability.

4 Related Work

Multi-scale representation learning.  Previous CNNs and ViTs progressively down-sample the feature map to compute the LR representations [18, 4, 10], and recover the HR features via up-sampling, e.g., SegNet [1], UNet [21], Hourglass [19]. HRNet [24] maintains the HR representations throughout the network with cross-resolution fusion. Lite-HRNet [31] proposes conditional channel weighting blocks to exchange information across resolutions. HR-NAS [8] searches the channel/head settings for inverted residual blocks and the auxiliary transformer branches. HRFormer [34] improves HRNetV2 by replacing residual blocks with Swin transformer blocks. Different from the HRNet-family, HRViT is a pure ViT backbone with a novel multi-branch topology that benefits both from HR architectures and self-attentions. Besides, we explore heterogeneous branch design and block optimization to boost the hardware efficiency.

Multi-scale ViT backbones.  Several multi-scale ViTs adopt hierarchical architectures to generate progressively down-sampled pyramid features, but they still follow the design concept of classification networks with a sequential topology, e.g., PVT [25], CrossViT [3], Swin [17], Twins [5], SegFormer [29], MViT [11], CSWin [9]. However, there is no information flow from LR to HR features inside the ViT backbone, and the HR features are still very shallow ones of relatively low quality. In contrast, HRViT adopts a multi-branch network topology with enhanced multi-scale representability and improved efficiency.

5 Conclusion

In this paper, we delve into the multi-scale representation learning in vision transformers and present an efficient multi-scale high-resolution ViT backbone design, named HRViT. To fully exploit the potentials of ViTs in dense prediction tasks, we enhance ViT backbones with a multi-branch architecture to enable high-quality HR representation and cross-scale interaction. To scale up HRViT, we jointly optimize key building blocks with efficient embedding layers, augmented cross-shaped attentions, and mixed-scale convolutional feedforward networks. Our architecture-block co-design pushes the performance-efficiency Pareto front to a new level. Extensive experiments show that HRViT outperform state-of-the-art vision transformer backbone designs with significant performance improvement with lower hardware cost.