Log In Sign Up

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

by   Xiaosong Zhang, et al.

Recently, masked image modeling (MIM) has offered a new methodology of self-supervised pre-training of vision transformers. A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e.g., ViT), albeit hierarchical vision transformers (e.g., Swin Transformer) have potentially better properties in formulating vision inputs. In this paper, we offer a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) that enjoys both high efficiency and good performance in MIM. The key is to remove the unnecessary "local inter-unit operations", deriving structurally simple hierarchical vision transformers in which mask-units can be serialized like plain vision transformers. For this purpose, we start with Swin Transformer and (i) set the masking unit size to be the token size in the main stage of Swin Transformer, (ii) switch off inter-unit self-attentions before the main stage, and (iii) eliminate all operations after the main stage. Empirical studies demonstrate the advantageous performance of HiViT in terms of fully-supervised, self-supervised, and transfer learning. In particular, in running MAE on ImageNet-1K, HiViT-B reports a +0.6 Swin-B, and the performance gain generalizes to downstream tasks of detection and segmentation. Code will be made publicly available.


page 1

page 2

page 3

page 4


BEiT: BERT Pre-Training of Image Transformers

We introduce a self-supervised vision representation model BEiT, which s...

Architecture-Agnostic Masked Image Modeling – From ViT back to CNN

Masked image modeling (MIM), an emerging self-supervised pre-training me...

What to Hide from Your Students: Attention-Guided Masked Image Modeling

Transformers and masked language modeling are quickly being adopted and ...

Stochastic Layers in Vision Transformers

We introduce fully stochastic layers in vision transformers, without cau...

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

Vision Transformers (ViTs) enabled the use of transformer architecture o...

Aggregating Nested Transformers

Although hierarchical structures are popular in recent vision transforme...

Efficient Self-supervised Vision Transformers for Representation Learning

This paper investigates two techniques for developing efficient self-sup...

1 Introduction

Deep neural networks have been the fundamentals of deep learning 


and advanced the research fields of computer vision, natural language processing,

etc., in the past decade. Recently, the computer vision community has witnessed the emerge of vision transformers ViT2021 ; Swin2021 ; wang2021pvtv2 ; zhou2021deepvit ; dai2021coatnet ; li2021efficient

, transplanted from the language models 

Attention2017 ; devlin2019bert

, that replaced the dominance of convolutional neural networks 

AlexNet2012 ; Resnet2016 ; EfficientNet2019 . They have the ability of formulating long-range feature dependencies, which naturally benefits visual recognition especially when long-range relationship is important.

There are mainly two families of vision transformers, namely, the plain vision transformers ViT2021 ; DeiT2021 and the hierarchical vision transformers Swin2021 ; wang2021pvtv2 ; dong2021cswin ; chen2021crossvit , differing from each other in whether multi-resolution feature maps are used. While the latter is believed to capture the nature of vision signals (most convolution-based models have used the hierarchical configuration), but used some spatial local operations (i.e., early-stage self-attentions with shifting window). These models can encounter difficulties when the tokens need to be flexibly manipulated. A typical example lies in masked image modeling (MIM), a recent methodology of pre-training vision transformers bao2021beit ; MAE2021 ; xie2021simmim – a random part of image patches are hidden from input, and it is difficult for the hierarchical models to determine whether each pair of tokens need to communicate, unlike the plain models. Essentially, this is because hierarchical vision transformers have used non-global operations (e.g., window attentions) between the masking units111The minimum size of the masked pixels when executing MIM is defined as masking unit. For example, when the input image size is , the masking unit size for MAE MAE2021 is .. Hence, unlike the plain vision transformers that can serialize all tokens for acceleration, the hierarchical vision transformers must maintain the two-dimensional structure, keeping the dummy (masked) tokens throughout the encoder. Consequently, as shown in xie2021simmim , the training speed of hierarchical transformers is slower than that using plain transformers, and very few works chose to follow this direction.

In this paper, we start with categorizing the operations in hierarchical vision transformers into ‘intra-unit operations’, ‘global inter-unit operations’, and ‘local inter-unit operations’. We note that plain vision transformers only contain ‘intra-unit operations’ (i.e., patch embedding, layer normalization, MLP) and ‘global intra-unit operations’ (i.e., global self-attentions), hence the units’ spatial coordinates can be discarded and the units can be serialized for efficient computation, like in MAE MAE2021 . That said, for hierarchical vision transformers, it is the ‘local inter-unit operations’ (i.e., shifting-window self-attentions, patch merging) that calls for extra judgment based on the units’ spatial coordinates and obstructs the serialization as well as removing the masked units.

A key observation of this paper lies in that ‘local inter-unit operations’ do not contribute much for recognition performance – what really makes sense is the hierarchical design (i.e., multi-scale feature maps) itself. Hence, to fit hierarchical vision transformers to MIM, we remove the ‘local inter-unit operations’, resulting in a simple hierarchical vision transformer that absorbs both the flexibility of ViT ViT2021 and the superiority of Swin Transformer Swin2021

. There are usually 4 stages of different resolutions in hierarchical vision transformers where the 3rd stage has the largest number of layers and we call it the main stage. We remove the last stage of Swin and switch off all local inter-unit window attentions, only keeping the global attention between tokens in the main stage. In practice, the last stage is merged into the main stage (to keep the model FLOPs unchanged) and local window attentions in the early-stage are replaced by an intra-unit multi-layer perceptron with same FLOPs. With these minimal modifications, we remove all redundant ‘local inter-unit operations’ in hierarchical vision transformers, where only the simplest hierarchical structure is adopted. Compared to the plain ViTs, our model only adds only several spatial merge operations and MLP layers before the main stage. The resulting architecture is named

HiViT (short for Hierarchical ViT), which has the ability of modeling hierarchical visual signals yet all tokens are maximally individual and remain flexibility for manipulation. Meanwhile, our HiViT maintains the ViT paradigm, which is very simple to implement compared to other hierarchical vision transformers.

Figure 1: Self-supervised pre-training of HiViT is significantly faster than Swin with SimMIM xie2021simmim and the result is better than ViT trained with MAE MAE2021 and BEiT bao2021beit . Circle size denotes memory requirement. All the models are in base scale.

We perform fully-supervised classification experiments on ImageNet-1K to validate the superiority of HiViT. Lying between ViT and Swin Transformer, HiViT enjoys consistent accuracy gains over both the competitors, e.g., HiViT-B reports a top-1 accuracy, which is over ViT-B and over Swin-B. With extensive ablation studies, we find that removing the ‘local inter-unit operations’ does not harm the recognition performance, yet the hierarchical structure and relative positional encoding (not ‘local inter-unit operations’) slightly but consistently improves the classification accuracy. This makes HiViT applicable to a wide range of visual recognition scenarios.

Continuing to MIM, the advantages of HiViT become clearer. With epochs of MIM-based pre-training and epochs of fine-tuning, HiViT-B reports top-1 accuracy on ImageNet-1K, which is over ViT-B (using MAE MAE2021 , pre-training for epochs) and over Swin-B (using SimMIM xie2021simmim ). More importantly, HiViT enjoys the efficient implementation that discards all masked patches (or tokens) at the input stage, and hence the training speed is as fast as that of SimMIM, since the original Swin Transformer must forward-propagate the full token set (Fig. 1). The advantages persist to other visual recognition tasks, including linear probing ( top-1 accuracy) on ImageNet-1K, semantic segmentation (

mIoU) on the ADE20K dataset 

zhou2017scene , and object detection ( AP) and instance segmentation (

AP) on the COCO dataset 

lin2014microsoft ( training schedule). These results validate that removing ‘local inter-unit operations’ does not harm generic visual recognition.

The core contribution of this paper is HiViT, a hierarchical vision transformer architecture that is off-the-shelf for a wide range of vision applications. In particular, with masked image modeling being a popular self-supervised learning paradigm, HiViT has the potential of being directly plugged into many existing algorithms to improve their effectiveness and efficiency in learning visual representations from large-scale, unlabeled data.

2 Related Work

2.1 Vision Transformers

Vision transformers ViT2021 were adapted from the natural language processing (NLP) transformers Attention2017 ; devlin2019bert , opening a new direction of designing visual recognition models with weak induction bias mlp-mixer . Early vision transformers ViT2021 mainly adopted the plain configuration, and efficient training methods are strongly required DeiT2021 . To cater for vision-friendly priors, Swin Transformer Swin2021 proposed a hierarchical architecture that contains multi-level feature maps and validated good performance in many vision problems. Since then, various efforts emerged in improving hierarchical vision transformers, including borrowing design experiences from convolutional neural networks wang2021pvtv2 ; wu2021cvt ; vaswani2021scaling , adjusting the design of self-attention geometry dong2021cswin ; yang2021focal , designing hybrid architectures to integrate convolution and transformer modules srinivas2021bottleneck ; container ; dai2021coatnet ; Conformer2021 , etc.

Essentially, there is a tradeoff between plain and hierarchical vision transformers – in terms of whether strong induction bias is to be introduced. As we shall see later, the increase of induction bias may weaken the flexibility and thus efficiency of applying vision transformers to particular scenarios (e.g., masked image modeling). In this paper, we design a hierarchical vision transformer that maximally discards induction bias, achieving both high efficiency and good performance.

2.2 Self-Supervised Learning and Masked Image Modeling

In the context of computer vision, self-supervised learning aims to learn compact visual representations from unlabeled data. The key to this goal is to design a pretext task that sets a natural constraint for the target model to achieve by tuning its weights. The existing pretext tasks are roughly partitioned into three categories, namely, geometry-based proxies that were built upon the spatial relationship of image contents wei2019iterative ; jigsaw ; rotation , contrast-based proxies that assumed that different views of an image shall produce related visual features he2020momentum ; chen2020simple ; grill2020bootstrap ; caron2021emerging ; swav ; pixpro ; sage , and generation-based proxies that required visual representations to be capable of recovering the original image contents colorization ; inpainting ; MAE2021 ; bao2021beit ; beyond . After the self-supervised learning (a.k.a. pre-training) stage, the target model is often evaluated by fine-tuning in a few downstream recognition tasks – the popular examples include image classification at ImageNet-1K ImageNet2009 , semantic segmentation at ADE20K zhou2017scene , object detection and instance segmentation at COCO COCO2014 , etc.

We are interested in a particular generation-based method named masked image modeling (MIM) bao2021beit ; MAE2021 . The flowchart is straightforward: some image patches (corresponding to tokens) are discarded, the target model receives the incomplete input and the goal is to recover the original image contents. MIM is strongly related to the masked language modeling (MLM) task in NLP. BEiT bao2021beit transferred the task to the computer vision community by masking the image patches and recovering the tokens produced by a pre-trained model (knwon as the tokenizer). MAE MAE2021 improved the MIM framework by only taking the visible tokens as input and computing loss at the pixel level – the former change largely accelerated the training procedure as the computational costs of the encoder went down. The follow-up works explored different recovery targets wei2021masked , more complicated model designs CIM ; chen2020simple , and other pretext tasks.

It is worth noting that MIM matches plain vision transformers very well because each token is an individual unit and only the unmasked tokens are necessary during the pre-training process. The properties does not hold for hierarchical vision transformers, making them difficult to inherit the good properties (e.g., training efficiency). Although SimMIM xie2021simmim tried to combine Swin Transformer with MIM, it uses all tokens, including those corresponding to the masked patches, shall be preserved during the encoder stage, incurring much heavier computational costs. In this paper, we present a hierarchical vision transformer that is free of such burden.

3 Hierarchical Vision Transformer for Masked Image Modeling

3.1 Preliminaries

Masked image modeling (MIM) is an emerging paradigm of self-supervised visual representation learning. The flowchart involves feeding a partially masked image to the target model and training the model to recover it. Mathematically, let the target model be where denotes the learnable parameters. Given a training image, , it is first partitioned into a few patches, , where is the number of patches. Then, MIM randomly chooses a subset , feeds the patches with IDs in (denoted as ) into the target model (a.k.a., the encoder), and appends a decoder to it, aiming at recovering the original image contents, either tokenized features bao2021beit or pixels MAE2021 , at the end of the decoder. If is able to solve the problem, it is believed that the parameters have been well trained to extract compact visual features.

An efficient vision model that fits MIM is the vanilla vision transformer, abbreviated as ViT ViT2021 . In ViT, each image patch is transferred into a token (i.e.

, a feature vector), and the tokens are propagated through a few transformer blocks for visual feature extraction. Let there be

blocks, the -th block takes the token set of as input and outputs , and . The main part of each block is self-attention, for which three intermediate features are computed upon , namely the query, key, and value, denoted as , , and , respectively. Based on these quantities, the self-attention of is computed by , where is a scaling vector. Auxiliary operations, including layer normalization, multi-layer perceptron, skip-layer connection, are applied after the self-attention computation. ViT has been applied to a series of vision problems, but we emphasize its particular efficiency on MIM, which lies in that the tokens not in can be discarded at the beginning of encoder, decreasing the complexities of the pre-training process by a factor of (e.g., in the regular setting of MAE MAE2021 ).

Intuitively, hierarchical vision transformers (e.g., Swin Transformer) are better at capturing multi-level visual features. It has three major differences from ViT: (i) the architecture is partitioned into a few stages and the spatial resolution, rather than being fixed, is gradually shrunk throughout the forward propagation; (ii) to handle relatively large token maps, the self-attention computation is constrained within a grid of windows, and the window partition is shifted across layers; (iii) global positional encoding is replaced by relative positional encoding – this is to fit the window attention mechanism. Although hierarchical vision transformers report higher visual recognition accuracy, these models are not so efficient as ViT in terms of MIM, and the reasons are revealed in the next part. Consequently, few prior works have tried the combination – as an example, SimMIM xie2021simmim fed the entire image (the masked patches are replaced with mask tokens which are learnable) into the encoder, resulting in heavier computational costs in time and memory.

Figure 2: Comparison of the architectures of Swin Transformer, ViT, and the proposed HiViT.

3.2 HiViT: Efficient Hierarchical Transformer for MIM

We pursue for the efficient implementation of MAE MAE2021 , i.e., only the active (unmasked) tokens are fed into the encoder – mathematically, we are always dealing with a squeezed list of tokens. The major difficulty of integrating it with hierarchical vision transformers (e.g., Swin Transformers) lies in the ‘local inter-unit operations’, which make it difficult to serialize the tokens and abandon the inactive (masked) ones. To remove them, we first set the masking unit size to be the token size at the main stage – for Swin Transformers, this is the 3rd stage that is the major part (e.g., for Swin-B, the 3rd stage has blocks, and the entire architecture has blocks). The masking unit is pixels that aligns with the constant token size of ViT. Then, we adjust the model as follows:

  • For the operations after the main stage, we do not allow the patch merging that mixes active and inactive patches. For simplicity, we directly remove the last (4th) stage in the Swin Transformer, which has only two blocks, and append the same number of blocks to the 3rd stage. Since the 3rd stage has a smaller token dimensionality, such an operation saves both trainable parameters, while changing the ‘local inter-unit operations’ in the 3rd stage to ‘global inter-unit operations’. As a result, the model’s recognition performance becomes better (Fig. 2).

  • For the operations prior to the main stage, we do not allow the window attention in the former 2 stages. That said, we remove the shift window of Swin and do not introduce any other ‘local inter-unit operations’ such as window attentions or convolutions. As an alternative, we only employ MLP block (replacing the self-attention by another mlp layer) for the former 2 stages. Surprisingly, as demonstrated in Fig. 2, this modification brings performance improvement without bells and whistles. Compared to plain ViT as shown in Fig. 2, the derived architecture possesses hierarchical property and only requires two MLP blocks but enjoys much better performance on both self-/fully-supervised learning.

The above procedure produces an architecture between Swin Transformer (hierarchical) and ViT (plain). We illustrate the procedure in Figure 2. The resulting architecture, named HiViT (short for Hierarchical ViT), is structurally simple and brings an efficient implementation for MIM. Specifically, HiViT abandons all the ‘local inter-unit operations’ in the entire architecture, therefore the masked image patches can be discarded at the input layer and its computation can be eliminated in all stages. As a result, HiViT enjoys both the effectiveness of hierarchical vision transformers to capture visual representations (i.e., the recognition accuracy is much higher than ViT) and the efficiency of plain vision transformers in the masked image modeling task (i.e., the efficient implementation of MAE MAE2021 can be directly transplanted, making HiViT almost faster than Swin Transformer in MIM). Detailed results are provided in the experimental part.

Model Depth Dim Heads Params FLOPs
(M) (G)
HiViT-T (Tiny) 1 1 10 96 192 384 - - 6 19.2 4.6
HiViT-S (Small) 2 2 20 96 192 384 - - 6 37.5 9.1
HiViT-B (Base) 2 2 20 128 256 512 - - 8 66.4 15.9
Table 1: Configurations for HiViT variants.

4 Experiments

We first conduct fully supervised experiments with labels using the proposed HiViT on ImageNet dataset ImageNet2009 . Then, HiViT models are tested using masked image modeling self-supervised methods (MIM) MAE2021 . The self-supervised pre-trained models are also transferred to downstream tasks including object detection on COCO dataset COCO2014 and semantic segmentation on ADE20K zhou2017scene . We provide the ablation studies about our methods on both fully-/self-supervised learning.

4.1 ImageNet Classification with Labels

Training Settings

We first evaluate our models with fully supervised learning on ImageNet-1K ImageNet2009 which contains 1.28M training images and 50K validation ones divided into 1,000 categories. We follow Swin Swin2021 and use the same training settings without other tricks. Specifically, we use AdamW optimizer adamw with an initial learning rate of 0.001, a weight decay of 0.05, batch size of 1024, a cosine decay learning rate scheduler, and a linearly warm-up for 20 epochs. All the models are trained for 300 epochs with augmentation and regularization strategies of  Swin2021 and exponential moving average (EMA) technique. The input size is

in default. The output feature of 3rd stage is followed by an average pooling layer and then a classifier layer. We adopt drop path rate of

, , and for HiViT-T/S/B.

Model Params FLOPs Top-1
(M) (G) (%)
DeiT-S/16 DeiT2021 22.1 4.5 79.8
PVT-Swang2021pyramid 24.5 3.8 79.8
Swin-T liu2021swin 28.3 4.5 81.2
CvT-13 wu2021cvt 20.0 4.5 81.6
CaiT-XS-24 touvron2021going 26.6 5.4 81.8
HiViT-T (Ours) 19.2 4.6 82.1
CvT-21 wu2021cvt 32.0 7.1 82.5
UFO-ViT-M song2021ufo 37.0 7.0 82.8
Swin-S liu2021swin 49.6 8.7 83.1
ViL-M zhang2021multi 39.7 9.1 83.3
CaiT-S36 touvron2021going 68.0 13.9 83.3
HiViT-S (Ours) 37.5 9.1 83.5
Model Params FLOPs Top-1
(M) (G) (%)
ResNet-152 Resnet2016 60.0 11.0 78.3
PVT-L wang2021pvtv2 61.4 9.8 81.7
DeiT-B/16 DeiT2021 86.7 17.4 81.8
CrossViT-B chen2021crossvit 104.7 21.2 82.2
T2T-ViT-24 yuan2021tokens 64.1 14.1 82.3
CPVT-B chu2021conditional 88.0 17.6 82.3
TNT-B han2021transformer 65.6 14.1 82.8
ViL-B zhang2021multi 55.7 13.4 83.2
UFO-ViT-B song2021ufo 64.0 11.9 83.3
CaiT-M24 touvron2021going 185.9 36.0 83.4
Swin-B liu2021swin 87.8 15.4 83.5
HiViT-B (Ours) 66.4 15.9 83.8
Table 2: Comparison of image classification on ImageNet-1K for different models. All models are trained and evaluated with resolution on ImageNet-1K in default, unless otherwise noted.

Model Configurations

Three models including HiViT-T/S/B are tested on fully supervised learning and their configurations are shown in Tab. 1. The “Depth” represents the block number on different stages (, , and are the 1st, 2nd, and 3rd stage respectively). The “Dim” and “Heads” represent the dimension and attention head of 3 stages. We align our models on FLOPs, and the parameters are less compared to other works. We report the inference throughput speed in Tab. 5 by testing the images on V100 GPU with the same script.

ImageNet Results

The fully supervised training results are shown in Tab. 8. Compared to vanilla ViT models, all the HiViT models report dominant results. HiViT-T/B surpass DeiT-S/B models by and respectively with similar flops and fewer parameters. Compared to its follow-ups, our models still show competitive results. In particular, HiViT-T/S/B beat Swin-T/S/B by , , and respectively with similar complexities and fewer parameters. All the results do not introduce any other tricks compared to Swin Swin2021 . In addition, all our models are parameter friendly. For example, compared to Swin-T/S/B models, HiViT-T/S/B enjoy , and fewer parameters. We note that our models are structurally simple which offer a promising baseline for future research.

Ablation Studies

We conduct fully supervised ablation studies to show the advantages of our method. In this part, we inherit the training settings above and the results are shown in Tab. LABEL:tab.evolve. The ‘Setting’ represents that the modules we remove from top to bottom. The ‘Dim’ represents the dimension on resolution. The ‘Depth’ represents the block numbers of the corresponding stages. The ‘RPE’ is the relative position embedding and the “Win. Att.” denotes whether there are window attentions in the model. As we can see in the Tab. LABEL:tab.evolve, removing the last stage (Stage4) from Swin-B (using global attention for stage3 simultaneously) brings a performance improvement of which implies that the last stage is unnecessary. Replacing window attention with MLP blocks in the former 2 stages (Win. Att.) boosts the performance to , which demonstrates that window attention is unnecessary in early stages. The RPE is important and getting rid of that (RPE) will harm the performance about . If we abandon the former 2 stages and down-sample using the patch embedding like plain ViT but increasing the block number to 24 (Hierarchical), the performance will decrease from to . However, that is still higher than of plain ViT (Deep), which implies that hierarchical input module is important and a deeper architecture is much better than a shallow one.

Model Setting Dim Depth RPE Win. Att. Params FLOPs Top-1
(M) (G) (%)
Swin-B - 512 2 2 18 2 88.0 15.4 83.5
Stage4 512 2 2 20 - 66.3 16.0 83.6
HiViT-B Win. Att. 512 2 2 20 - 66.4 15.9 83.8
RPE 512 2 2 20 - 66.3 15.9 83.5
Hierarchical 512 - - 24 - 76.7 15.8 82.9
ViT-B Deep 768 - - 12 - 86.6 17.5 81.8
Table 3: Ablations of fully-supervised training on ImageNet-1K.

4.2 Self-Supervised Learning Results

Experimental Details

For self-supervised pre-training, we use ImageNet-1K training dataset without using labels, and test the pre-trained models by fine-tuning and linear probing metrics in validation dataset. We inherit the pre-training settings of MAE MAE2021 for pre-training. Specifically, we set the mask ratio to in default. The normalized target trick is also adopted. We use the AdamW optimizer adamw with the an initial learning rate of , a weight decay of 0.05, and the learning rate follows the cosine decay learning schedule with a warm-up for epochs. The batch size is set to 4096 and the input size is . The overall pipeline is an encoder-decoder framework and the decoder is designed to have 6 transformer layers followed by a reshape operation to cast the feature to . As for data augmentation, we only employ random cropping and random horizontal flip. We test HiViT-B model in this part and the model is pre-trained for and epochs, which are then evaluated using fine-tuning and linear-probing metrics. As for fine-tuning, we inherit the training settings from  MAE2021 and all the models are trained for epochs using AdamW optimizer with a warm-up for epochs, a weight decay of 0.05, and input size of . We use the layer-wise learning rate decay of 0.65. The initial learning rate is set to and batch size is set to 1024. As for linear probing, we train all the models for 100 epochs using LARS huo2021large optimizer with the batch size of 16,384 and learning rate of 0.1.


The fine-tuning and linear probing results are provided in Tab. 4 and only the encoder part is used to test. As shown in the Tab. 4, the HiViT-B (300e) version achieves and performance improvement than MAE models which are pre-trained for 1,600 epochs. Our longer training schedule version (800e) attains the dominant result of , which outperforms MAE (1600e) by and SimMIM (Swin) by . Compared to other methods including CAE, BEiT and iBOT, etc, HiViT-B model also shows superior results: for BEiT, for CAE, and for MaskFeat.

Linear Probing

We evaluated the pre-trained models using linear probing metric, where all the parameters of the encoder are frozen except for a learnable classifier layer. From Tab. 4, we can see that HiViT-B model achieves good result of , which is the best performance compared to all the MIM based methods. For example, the 800e model surpasses MAE (1600e) by with fewer pre-training epochs and the same training settings. The result also outperforms SimMIM (ViT-B) by and CAE (ViT-B) by .

Method Network Params Supervision Encoder Epochs FT (%) LIN (%)
BEiT bao2021beit ViT-B 86 DALLE 100% 400 83.2 -
CAE chen2022context ViT-B 86 DALLE 100% 800 83.6 68.3
MaskFeat wei2021masked ViT-B 86 HOG 100% 800 84.0 -
SimMIM xie2021simmim ViT-B 86 Pixel 100% 800 83.8 68.7
SimMIM xie2021simmim Swin-B 86 Pixel 100% 800 84.0 -
MAE MAE2021 ViT-B 86 Pixel 25% 1600 83.6 68.0
Ours HiViT-B 66 Pixel 25% 300 83.8 -
Ours HiViT-B 66 Pixel 25% 800 84.2 71.3
Table 4: Self-supervised learning results. The fine-tuning and linear probing results.
Input Size ViT-B Swin-B HiViT-B
(MAE) (SimMIM) (Ours)
7.2 14.2 7.4
9.5 18.4 9.7
Table 5: Pre-training efficiency comparison with different input sizes.

Training Efficiency

HiViT only requires the active tokens as inputs so that our method enjoys the efficiency during the MIM pre-training. As shown in Tab. 5, we report the pre-training speed of MAE (ViT-B), SimMIM (Swin-B), and our HiViT-B with different input sizes. All the results represent the pre-training time (minutes) of 1 epoch on V100 GPUs. As the input image is , HiViT-B only takes 7.4 minutes per epoch, which is faster about than SimMIM and comparable with MAE. HiViT-B takes about 9.7 minutes when the input is , which is faster than SimMIM and comparable with MAE. We note that we use 6 decoder blocks with 512 dimension in default. Decreasing the decoder block number or the dimension also accelerate the pre-training speed without affecting our results. For example, setting the decoder block number to 4 and dimension to 384 only requires about 8 minutes per epoch and achieves performance.

ID Depth Params FLOPs FT
(M) (G) (%)
#0 2 2 20 66.4 15.9 83.8
#1 1 1 22 71.8 15.9 83.9
#2 2 0 22 71.1 15.9 83.7
#3 0 2 22 72.3 15.9 83.6
#4 0 0 24 77.1 16.0 83.6
Table 6: Ablations by pre-training for 300 epochs and fine-tuning for 100 epochs (FT). The , , and denote the 1st, 2nd and 3rd stage respectively. The ID #0 is the default setting.

Ablation Studies

We perform some experiments to ablates our methods (see Tab. 6) and all the results are attained by pre-training for epochs. As shown, we test the self-supervised learning pre-training by setting different block number for stage 1, 2, and 3 (referred as to , , and respectively). The default setting (#0) achieves performance with block setting. Decreasing the block number for stage 1, 2 and increasing for stage 3 (#1) brings more parameters and better performance of , which is comparable to SimMIM result () by pre-training for epochs with Swin-B. We note that the M parameters are still much lower than M of Swin-B. Removing the stage-1 (#3) or stage-2 (#2) both harm the performance to and respectively, which demonstrates that the hierarchical architecture before the main stage is important and brings performance improvement. In addition, the result of #3 is lower than #2 denotes that the first stage seems more important than the second one. Removing both the former 2 stages attains a result of , which further verifies the importance of the hierarchical architecture.

4.3 Transfer to Dense Prediction Tasks

Experimental Details

We transfer the self-supervised pre-trained models above to object detection on COCO and semantic segmentation on ADE20K. We follow the convention to perform object detection and semantic segmentation experiments. For the COCO experiments, we use the Mask R-CNN MaskRCNN2017 head implemented by MMDetection library MMdet2019 . We use the AdamW optimizer adamw with an initial learning rate of which decays by after the 9-th and 11-th epochs. The layer-wise decay rate is set to 0.75 and training schedule (12 epochs) is adopted. We also apply multi-scale training strategy and single-scale testing. For ADE20K, we use the UperNet xiao2018unified head following BEiT bao2021beit . We also choose AdamW optimizer and the learning rate is . We totally train the model for 160 iterations and the batch size is 16. The input resolution is without using multi-scale testing.

Objection Detection on COCO.

We transfer the same settings of CAE chen2022context to test our model in MS-COCO. We choose the 5-, 9-, 13-, 19-th blocks as inputs for later FPN network. As shown in Tab. 7, we compare the performance with the state-of-the-art methods. Compared to BEiT bao2021beit (we bollow the results from CAE chen2022context ), HiViT-B shows superior results over AP and AP respectively. MAE MAE2021 (1600e) achieves the AP and AP results, which is lower than our and respectively. CAE chen2022context improves the performances to AP and AP by pre-training for 800 epochs. But the results still are below than our results by and respectively even though CAE chen2022context uses image tokenzier described in DALLE ramesh2021zero .

Semantic Segmentation on ADE20K.

The results on ADE20K are shown in Tab. 7. We note that we do not introduce any other tricks and test all the models using the same settings. We report the mean intersection over union (mIoU) performances. As shown, MoCo-v3 reports the mIoU result by pre-training for epochs, which is lower than our . BEiT bao2021beit , CAEchen2022context , and MAE MAE2021 report the performance of , , and respectively. By pre-training for 1600 epochs, MAE achieves the mIoU. Compared to these state-of-the-art methods, HiViT-B, by pre-training for 800 epochs, reports the result, which is higher than all the methods in apart from CAE, which uses tokenizer of DALLE ramesh2021zero .

Method Network Params Pre-train data COCO ADE20K
Supervised MAE2021 ViT-B 86 IN1K w/ labels 47.9 42.9 47.0
MoCo v3 chen2021empirical ViT-B 86 IN1K 45.5 40.5 47.3
BEiT bao2021beit ViT-B 86 IN1K+DALLE 42.1 37.8 47.1
CAE chen2022context ViT-B 86 IN1K+DALLE 49.2 43.3 48.8
MAE MAE2021 (1600e) ViT-B 86 IN1K 48.4 42.6 48.1
Ours (800e) HiViT-B 66 IN1K 49.5 43.8 48.3
Table 7: Downstream task fine-tuning results transferred from self-supervised pre-training.

5 Conclusions

This paper presents a hierarchical vision transformer named HiViT. Starting with Swin Transformers, we remove redundant operations that cross the border of tokens in the main stage, and show that such modifications do not harm, but slightly improve the model’s performance in both fully-supervised and self-supervised visual representation learning. HiViT shows a clear advantage in integrating with masked image modeling, on which the efficient implementation on ViT can be directly transplanted, accelerating the training speed by almost . We expect that HiViT becomes an off-the-shelf replacement of ViT and Swin Transformers in the future research.


Despite the improvement observed in the experiments, our method has some limitations. The most important one lies in that the masking unit size is fixed – this implies that we need to choose a single ‘main stage’. Fortunately, the 3rd stage of Swin Transformers contribute most parameters and computations, hence it is naturally chosen, however, the method may encounter difficulties in the scenarios that no dominant stages exist. In addition, we look forward to more flexible architecture designs that go beyond the constraints – a possible solution lies in modifying low-level code (e.g., CUDA) to support arbitrary and variable grouping of tokens.

Societal Impacts

Our research focus on (i) designing efficient architectures for deep neural networks and (ii) self-supervised learning. These two topics have been widely studied in the community, and our work does not have further societal impacts than others.


Appendix A Architecture Comparison of Fully-Supervised Learning on ImageNet-1K

We compare different transformer architectures including plain transformers, hierarchical transformers, and hybrid ones on fully-supervised learning using ImageNet-1K and the results are shown in Tab. 8. We can see that plain transformers usually do not contains local inter-unit operations, which implies that these models are structurally simple generally. However, these models suffer poor performance than hierarchical ones. To capture hierarchical property for transformers, researcher introduce complex local inter-unit operations (such as spatial-reduction attention and window attention, etc, as shown in the table) to plain transformer, so that the models cater to visual priors and attain better results. Our HiViT, as shown, enjoys the hierarchical attributes without local inter-unit operations, which makes our model structurally simple and MIM friendly. In addition, HiViT gets better results without bells and whistles. The hybrid transformers bring convolutional layers (expect for patch embedding) to transformer and usually attain the dominant results than the former two. But these models are unfriendly to MIM self-supervised task because of the sliding prior of convolutional layers. Moerover, our self-supervised HiViT-B achieves competitive result of .

Model type Model Local inter-unit operations Params FLOPs Top-1
(M) (G) (%)
ViT-B/16 DeiT2021 - 86.7 17.4 81.8
CrossViT-B chen2021crossvit - 104.7 21.2 82.2
CaiT-M24 touvron2021going - 185.9 36.0 83.4
T2T-ViT-24 yuan2021tokens - 64.1 14.1 82.3
TNT-B han2021transformer - 65.6 14.1 82.8
PVT-L wang2021pvtv2 spatial-reduction attention 61.4 9.8 81.7
ViL-B zhang2021multi window attention 55.7 13.4 83.2
Swin-B liu2021swin shifted window attention 87.8 15.4 83.5
HiViT-B (Ours) - 66.4 15.9 83.8
ConformerConformer2021 CNN branch 83.3 23.3 84.1
CoAtNet-2dai2021coatnet depth-wise convolution 75.0 15.7 84.1
CSwin-Bdong2021cswin 78.0 15.0 84.2
Self-supervised HiViT-B (Ours) - 66.4 15.9 84.2
Table 8: ImageNet-1K results of different transformer architectures.

Appendix B Improved Results on COCO and ADE20K

In dense prediction tasks such as object detection and semantic segmentation, the feature pyramid is a key component. Our experiments in Tab. 7 use the same settings as the ViT model using self-supervised pre-training in  MAE2021 ; chen2022context . This setting extracts intermediate features and up-samples/down-samples them by deconvolution/convolution for pyramid feature generation, which works for plain transformers but does not take advantage of our hierarchical structure.

In order to fully exploit the capabilities of our hierarchical vision transformer, we perform an improved experimental setting following liu2021swin

. We use the three resolution features (with strides of 4, 8, 16) generated by stage-1/-2/-3, and add a stride-32 feature which is down-sampled from the stage-3 block to align with the standard feature pyramid generation. With this change, HiViT achieves improved results on COCO and ADE20K, as shown in Tab. 

9. In particular, it achieves 51.2% AP, 44.2% AP in COCO and 51.2% mIoU in ADE20K.

Method Network Params Pre-train data COCO ADE20K
Supervised MAE2021 ViT-B 86 IN1K w/ labels 47.9 42.9 47.0
MoCo v3 chen2021empirical ViT-B 86 IN1K 45.5 40.5 47.3
BEiT bao2021beit ViT-B 86 IN1K+DALLE 42.1 37.8 47.1
CAE chen2022context ViT-B 86 IN1K+DALLE 49.2 43.3 48.8
MAE MAE2021 (1600e) ViT-B 86 IN1K 48.4 42.6 48.1
Ours HiViT-B 66 IN1K 49.5 43.8 48.3
Ours (improved) HiViT-B 66 IN1K 51.2 44.2 51.2
Table 9: Improved downstream task fine-tuning results on COCO and ADE20K.

Appendix C The Processing Pipeline of HiViT in MIM Pre-training

In MIM pre-training, HiViT uses the encoder to process the visible tokens of the input image as shown in Fig. 3. Here, since we have multi-resolution image patches, we call the smallest unit that may be masked as a mask-unit. For an input image, we first align the patch embedding to get a feature vector of shape 44128, where is the total number of all mask-units (e.g., according to the common practice). Then, we randomly select a certain proportion () of mask-units to obtain a feature vector of 44128, where is the number of visible mask-units which is often much smaller than . HiViT will process the features of the visible mask-units stage by stage, and finally get 512 vectors as the output of encoder and feed them to the decoder for pixel restoration.

During the pre-training process, all three stages only need to process visible mask-units – this is why our model preserves the hierarchical structure and has efficient self-supervised pre-training efficiency.

Figure 3: Pipeline of HiViT in MIM pre-training.