Log In Sign Up

Architecture-Agnostic Masked Image Modeling – From ViT back to CNN

by   Siyuan Li, et al.

Masked image modeling (MIM), an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers (ViT). Its underlying idea is simple: a portion of the input image is randomly masked out and then reconstructed via the pre-text task. However, why MIM works well is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this paper, we first study interactions among patches to understand what knowledge is learned and how it is acquired via the MIM task. We observe that MIM essentially teaches the model to learn better middle-level interactions among patches and extract more generalized features. Based on this fact, we propose an Architecture-Agnostic Masked Image Modeling framework (A^2MIM), which is compatible with not only Transformers but also CNNs in a unified way. Extensive experiments on popular benchmarks show that our A^2MIM learns better representations and endows the backbone model with the stronger capability to transfer to various downstream tasks for both Transformers and CNNs.


page 4

page 6

page 9

page 15


HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Recently, masked image modeling (MIM) has offered a new methodology of s...

CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

Masked Image Modeling (MIM) has recently been established as a potent pr...

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

The past year has witnessed a rapid development of masked image modeling...

HindSight: A Graph-Based Vision Model Architecture For Representing Part-Whole Hierarchies

This paper presents a model architecture for encoding the representation...

The Lottery Ticket Hypothesis for Vision Transformers

The conventional lottery ticket hypothesis (LTH) claims that there exist...

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

Vision Transformers (ViTs) enabled the use of transformer architecture o...

Improved Cross-view Completion Pre-training for Stereo Matching

Despite impressive performance for high-level downstream tasks, self-sup...

1 Introduction

Supervised deep learning with large-scale annotated data has witnessed an explosion of success in computer vision (CV) 

NIPS2012_c399862d; he2016deep

and natural language processing (NLP) 


. However, large amount of high-quality annotations are not always available in real-world applications. Learning representations without supervision by leveraging pre-text tasks has become increasingly popular. In CV, early self-supervised learning approaches 

eccv2016coloring; iccv2015relativeloc; iclr2018rotation

aim to capture invariant features through predicting transformations applied to the same image. However, these methods rely on vision ad-hoc heuristics, and representations thus learned are less generic for downstream tasks. Recently, the contrastive learning-based approaches to self-supervised learning have witnessed significant progress, even outperforming supervised methods on several downstream tasks. Despite different contrasting mechanisms, contrastive-based methods learn generic representation by minimizing the distance between two augmented views of the same image in the embedded space. More recently, inspired by the masked autoencoding in NLP such as GPT 

radford2018improving and BERT devlin2018bert, Masked Image Modeling (MIM) methods he2021masked; wei2021masked; xie2021simmim have brought about new advances for self-supervised pre-training for CV tasks. The transition from human language understanding to NLP masked autoencoding is quite natural because the filling of missing words in a sentence requires relatively comprehensive semantic understanding. In analogy, humans can understand and imagine masked content by visually filling the missing structures in an image containing occluded parts. The underlying idea of MIM is thus to randomly mask out a proportion of the image and then recover the masked patches. Different from contrastive learning, which yields a clustering effect from pre-training by pulling similar samples and pushing away dissimilar samples, MIM pre-training methods have not been extensively explored in the context of the expected knowledge learned or how this knowledge is acquired. Moreover, the success of existing MIM methods is largely confined to Vision Transformer (ViT) structures  dosovitskiy2020image. Current MIM frameworks are not generic in terms of network architectures because it is not straightforward to directly apply mask token devlin2018bert

and positional embedding to convolutional neural networks (CNNs). In this work, we carry out systematic experiments and show that MIM as pre-training task essentially teaches the model to better learn middle-level interactions between patches for more generalized feature extraction regardless of the underlying network structure. Compared to the local texture features learned by low-level interactions between patches, more complex features such as shape and edge could be extracted via middle-level interactions among patches. The interaction of patches could be considered as information fusion via both convolution operation of a CNN and the self-attention mechanism of a Transformer. That is to say, CNN and Transformer should both benefit from better middle-level interactions with MIM as the pre-text task. To bridge the gap of MIM in terms of network architectures based on our extensive experimental analysis, we propose an Architecture-Agnostic Masked Image Modeling framework (A

MIM) that focuses on enhancing the middle-level interaction capabilities of the network. Specifically, we mask the input image with the mean RGB value and place the mask token at intermediate feature maps of the network. In addition, we propose a loss in the Fourier domain to further enhance the middle-level interaction capability of the network. Our contributions are summarized as follows:

  • We conducted systematic experiments and showed the essence of MIM is to better learn middle-level interactions between patches but not reconstruction quality.

  • We proposed a novel MIM-based framework dubbed AMIM that bridges the gap between CNNs and Transformers. To the best of our knowledge, we are the first to carry out MIM on CNNs that outperforms contrastive learning counterparts.

  • Extensive experiments with both Transformers and CNNs on ImageNet-1K and public benchmarks for various downstream tasks show that our method achieves significant performance improvement on pre-trained representation quality than state-of-the-arts.

2 Related Work

Learning representations without supervision by leveraging pre-text tasks has become increasingly popular. This section briefly discusses two types of popular self-supervised vision pre-training approaches: contrastive learning and autoregressive modeling.

2.1 Contrastive Learning

Contrastive learning learns instance-level discriminative representations by extracting invariant features over distorted views of the same data. MoCo cvpr2020moco adopted a large momentum memory bank to introduce enough negative samples, while SimCLR chen2020simple replaced the vanilla memory bank cvpr2018npid by a larger batch size. BYOL nips2020byol and its variants chen2020simsiam; chen2021empirical further eliminates the requirement of negative samples, using various techniques to avoid representation collapse. Besides pairwise contrasting, SwAV caron2020unsupervised clusters the data while enforcing consistency between multi-augmented views of the same image. Barlow Twins zbontar2021barlow proposed to measure the cross-correlation matrix of distorted views of the same image to avoid representation collapsing. Meanwhile, some effects have been made on top of contrastive methods to improve pre-training quality for specific downstream tasks iccv2021detco; xiao2021region; cvpr2021casting; wu2021align. With the recent emergence of Transformers as an alternative to CNN for CV tasks, MoCo.V3 chen2021empirical and DINO iccv2021dino adopted ViT dosovitskiy2020image in self-supervised pre-training to replace of CNN backbones.

2.2 Autoregressive Modeling

Autoencoders are a typical type of neural network architecture that allows representation learning with no annotation requirement 


. A standard autoencoder contains an encoder that maps the input to a representation and a decoder that outputs a reconstructed version of the input using the learned representation. By forcing denoising property onto the learned representations, denoising autoencoders 

vincent2008extracting; vincent2010stacked are a family of autoencoders that reconstruct the uncorrected input signal with a corrupted version of the signal as input. In addition to directly reconstructing the corrupted image, CIM  fang2022corrupted also proposed to predict whether each visual token is replaced by a generator sample or not. Generalizing the notion of denoising autoregressive modeling, masked predictions attracted the attention of both the NLP and vision community. BERT devlin2018bert performs masked language modeling where the task is to predict the randomly masked input tokens. Representations learned by BERT as pre-training generalize well to various downstream tasks. For CV, inpainting tasks pathak2016context to predict large missing regions using convolutional networks is proposed to learn representation without supervision. Inpainting the original color of images with removed color channels is also proposed to learn generic representations eccv2016coloring. With the introduction of the Vision Transformer (ViT), iGPT chen2020generative predicts succeeding pixels given a sequence of pixels as input. MAE he2021masked and BEiT bao2021beit mask out random patches of the input image and reconstruct the missing patches with ViT. Compared to MAE, MaskFeat wei2021masked and simMIM xie2021simmim adopt linear layers as the prediction head instead of another Transformer as in MAE. MaskFeat proposed to use HOG as the prediction target instead of RGB value as used in MAE. Some methods el2021large; zhou2021ibot; assran2022masked combine the idea of contrastive learning with MIM. SplitMask el2021large proposed to use half of the image pixels to predict the other half while applying InfoNCE loss van2018representation across the corresponding latent representations. MSN assran2022masked matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. Similarly, iBOT zhou2021ibot adopts the Siamese framework to combine self-distillation with MIM. Moreover, data2vec baevski2022data2vec proposed a framework that applies the masked prediction idea for either speech, NLP, or CV. In this work, we focus on MIM itself, and thus, the combination of MIM with contrastive learning is beyond the scope of this paper.

Figure 1: (a)(b)(c)(d): Robustness against different occlusion ratios of images is studied for both ViT-S and ResNet-50 under different experimental settings (see Section 3.1). (e)(f)(g)(h): Distributions of the interaction strength is explored for both ViT-S and ResNet-50 under different experimental settings. The label indicates the pre-training method fine-tuning augmentation used, random stands for random weight initialization. Appendix B provides more results.

3 Intriguing Properties of Masked Image Modeling

3.1 Does MIM Bring Occlusion Robustness?

The higher degree of freedom of the receptive field of vision Transformers induced by the self-attention mechanism is believed to be the reason for Transformers outperforming CNNs. Compared to CNN, Transformers gain tremendous performance improvement with carefully designed image augmentation techniques such as RandAug

cubuk2020randaugment, and CutMixyun2019cutmix. Random erasingzhong2020random randomly removes part of the image and replace with Gaussian noise while Cutmix randomly removes part of the image and replaces the corresponding region with a patch from another image. Similarly, as in most MIM pre-training tasks, some patches of images are masked out and replaced with a learnable mask token. We hypothesize that MIM as pre-training task as well as similar data augmentations enhance the network’s robustness towards occlusion and thus enables the network with a more generalized feature extraction ability. To verify our hypothesis, we design an occlusion robustness test. Let be an input image and be its corresponding label, where is the class number. Consider a classification task where denotes a neural network, the network is considered robust if the network outputs the correct label given an occluded version of the image , namely . For occlusion, we consider the patch based random masking as adopted in most MIM works he2021masked; xie2021simmim; wei2021masked. In particular, we split the image of size into patch size and randomly mask patches out of the total number of patches. The occlusion ratio could then be defined as

. We conduct experiments on ImageNet-100 (IN-100) 

krizhevsky2012imagenet with both Transformer and CNN with different settings. Without loss of generality, we choose ViT-S dosovitskiy2020image and ResNet-50he2016deep as the network architecture. We compare robustness under the following settings: (a) random weight initialization with no image augmentation applied, (b) random weight initialization with different image augmentations applied, (c) MIM pre-training as weight initialization with and without image augmentations applied and (d) Contrastive learning pre-training as weight initialization with and without image augmentations applied. We report the average top-1 accuracy across five runs trained with different settings under various occlusion ratios in Figure 1. The results reported in Figure 1 show significantly robust performance of MIM as pre-training against random weight initialization for ViT-S. MIM pre-trained models are robust and give comparable accuracy up to 80% occlusion ratio as compared to a drastic accuracy drop of models with random weight initialization. It can be also seen that by applying augmentations as used in DeiTtouvron2021training further improves the model robustness. For random weight initialization, we notice that all patch-removing alike augmentations boost the robustness as the accuracy curve shows a convex trend with augmentations. A similar phenomenon could be observed for CNN (ResNet-50) in Figure 1. Compared with the concave shape of model with random weight initialization, MIM pre-trained CNN models demonstrate a convex curve indicating a huge robustness gain. It can be found in Figure 1 that compared to contrastive learning, MIM shows a more convex trend of accuracy curve, which indicates a better occlusion robustness capability.

Figure 2: (a) Four patches interact with each other and forms a contour or edge pattern of the fox for image categorization. (b) Image with 30% masking ratio. Masked patches and interact with neighboring patches and to predict the missing patches. (c) Image with 50% masking ratio. Masked patches are forced to interact with middle-level interactions for the MIM task. (d) Image with 70% masking ratio. Masked Patch interacts with longer-range patches and forming an edge pattern. (e) A typical masking pattern for existing inpainting tasks.

3.2 Middle-level Interactions for Generalized Feature Extraction

In the previous subsection, we showed that the MIM task improves occlusion robustness for both Transformers and CNNs. However, it remains unclear to us that how occlusion robustness relates to knowledge learned via MIM. Note that existing MIM works adopt a medium or high masking ratio xie2021simmim; he2021masked (e.g., 60% or 70%, see Figure 2) during pre-training, and in these settings, the pairwise interactions between patches are under a middle-size context. This implies a possibility that MIM aims to let the model learn to encode interactions of certain complexity, for instance, the interactions of intermediate complexity. To verify this claim, we resort to the tool of multi-order interactions introduced by deng2021discovering; zhang2020interpreting, and investigate whether MIM makes the model more sensitive to interactions of some particular orders. Specifically, the multi-order interaction is to measure the level of interactions between variables and . We define to be the average interaction utility between variables and on all contexts consisting of variables. indicates the level of contextual complexity of the interaction. Formally, given an input image with a set of variables (e.g., an image with pixels), the multi-order interaction is defined as:


where . indicates the score of output with variables in kept unchanged but replaced with the baseline value ancona2019explaining, where the context . To measure the interaction complexity of the neural network, we measure the relative interaction strength of the encoded -th order interaction as follows:


where is the set of all samples and . is the average value over all possible pairs of variables of input samples. is normalized by the average value of all interaction strength. The distribution (area under curve sums up to one) of indicates the level of interactions of the network. In this work, we use as the metric to evaluate and to analyze interaction levels of the network with MIM pre-training. We conduct experiments on IN-100 with image size and use ViT-S dosovitskiy2020image and ResNet-50 he2016deep as the network architecture. We consider a patch of size as an input variable. For the computation of , we adopt the sampling solution following previous works deng2021discovering; zhang2020interpreting. As can be seen from Figure 1 that ViT-S with random weight initialization tends to learn simple interactions with few patches (e.g., less than patches) while MIM pre-trained models show a stronger interaction for relative middle-level (from to ). Similarly, as observed from 1, MIM pre-trained ResNet-50 enhances the middle-level interactions from to compared to random initialized models. As Figure 1 and 1 show, MIM pre-training task demonstrates better middle-level interactions compared to contrastive learning pre-trianing methods for both Transformer and CNN. A stronger middle-level interactions form more complex features such as shape and edge compared to local texture features learned from low-level interactions naseer2021intriguing.

3.3 From Inpainting to MIM on CNN

Figure 3:

(a) Fourier transformed feature maps. The vertical axis is the relative log amplitudes and the horizontal axis is the normalized depth of the network. The blue columns indicate the pooling layers while the white columns indicate convolution layers. (b) Feature maps variance. The vertical axis is the average variance value of feature maps. DeiT (Sup.) is supervised pre-training.

It is worth noticing that early inpainting work based on CNN pathak2016context resembles MIM. However, the inpainting task as pre-training on CNNs attracts little attention due to the much inferior performance than contrastive learning methods. We study the resemblance and difference of MIM and Inpainting on CNNs from the feature map perspective. We first plot the log magnitude of Fourier transformed feature maps of ResNet-50 with different pre-training methods on IN-1K park2022vision. As shown in Figure 3, inpainting and MIM show similar low-pass filtering effect at convolution layers as compared to contrastive learning. This indicates that inpainting and MIM reduce noise and uncertainty induced by high-frequency features. However, inpainting as pre-training adopts the masking strategy as illustrated in 1 and focuses on the reconstruction performance instead of the patch interactions. We argue that the reconstruction performance of MIM is mainly the patch interaction results of the low or high level while middle-level interactions lead to informative features deng2021discovering. Also, the reconstruction performance is not directly related to the learned representation quality. Figure 3 shows the feature variance of each layer of ResNet-50 with different pre-training methods on IN-1K. This figure indicates that MIM tends to reduce the feature map variance, and conversely, supervised training, inpainting, and contrastive learning on CNN tend to increase variance. Compared to MIM, which learns better middle-level interactions, inpainting task fails to filter out low-level interactions and thus leads to higher variance. To conclude, MIM enhances middle-level interactions and reduces the feature map uncertainty for a generalized and stabilized feature extraction park2022blur.

4 Approach

Considering our insight that the essence of MIM is to learn better middle-level interactions for generalized feature extraction, we propose a generic MIM framework following two design rules: (a)The masking should happen where middle-level interactions occur; (b)Guiding the model to learn initially neglected target during training. An overall illustration of our proposed framework in comparison to existing MIM framework is given in Figure 4.

Mask Token.

For the masking strategy, we follow the common practice of existing works el2021large; he2021masked; xie2021simmim; wei2021masked where the input image is divided into non-overlapping patches, and a random subset of patches are masked. MAE utilizes a Transformer as decoder and takes only the visible patches into the encoder. Masked tokens are appended to the decoder to reconstruct the masked patches. simMIM xie2021simmim and MaskFeat wei2021masked utilize a fully connected layer as the decoder and feed the mask token into the encoder together with the visible patches. The mask token devlin2018bert is a token-shared learnable parameter that indicates the presence of a missing patches to be predicted. Despite different choices of decoder structures, the mask token is either placed at the input to the encoder or the decoder. Mathematically, the masking process of MIM is defined as , where is the random occlusion mask and represents the learnable mask token. We argue that by masking the image patches as done by current works, the network is picking up low-level interactions. As discussed earlier, we wish to enhance the mid-range interaction of the network and we thus propose to mask intermediate features of the network where information are fused to certain a degree. More concretely, our masking operation is defined as , where is the intermediate feature map of at layer- in the Transformer encoder (or denote as for stage- in CNNs) and is the corresponding down-sampling function of the occlusion mask. We use the average RGB value to fill the masked patches as the input to the encoder. It worth nothing that existing works directly replace the occluded patches with the mask token in the input space or after the patch embedding dosovitskiy2020image; bao2021beit; xie2021simmim. In contrast, we replace the occluded patches with the mean of RGB channels of the image and add the mask token onto the intermediate feature maps of the encoder.

Figure 4: An illustration comparison between existing MIM framework and our proposed framework. For existing MIM framework, the input image is patchfied into a sequence of patches without overlapping with masked patches replaced with learnable mask tokens. The sequence is then input to the Transformer encoder. The is applied between the ground truth patches and the reconstructed patches from the decoder in the spatiotemporal domain. Our proposed framework uses the mean RGB value of the image instead of the mask token in the input space. We then add a learnable mask token onto the intermediate feature map of layer- of stage- of the encoder instead of replacement in the input space. The encoder could either be of the Transformer or the CNN family. In addition to the , we adopt a in the Fourier domain to enhance the encoder to learn more middle-level interactions. Specifically, we apply DFT on both the ground truth image and the predicted image and then use Mean square error (MSE) to measure the difference.

Prediction Target.

Currently proposed works el2021large; he2021masked; xie2021simmim adopt raw RGB values as the prediction target. However, raw pixels are heavy redundant and often contains low-level statistics bao2021beit; wei2021masked; zhou2021ibot. Following MAE, Maskfeat wei2021masked proposed to use the Histogram of Oriented Gradients (HOG) as the prediction target outperforming MAE. HOG is a descriptor that captures shape features based on middle-level interactions. Such high-frequency features resulted from middle-level interactions could be well portrayed in the phase component of Fourier spectrum. Given a RGB image , the discrete Fourier transform (DFT) of each channel is defined by:


In addition to the common MIM loss in the spatial domain , we propose to enforce better middle-level interactions in the Fourier domain. is defined as:


where is the predicted image, represents the gradiant detachment operation and is the frequency weighting matrix as in Focal Frequency Lossjiang2021focal. is defined as:


where is the scaling factor for flexibility ( = 1 in our experiments). It is worth noting that for a 2D signal (one channel of the image), its DFT is conjugate symmetric, which implies that half of the DFT contains the global information. This reduces the prediction target size by half compared to directly predicting missing information from the spatial domain. Moreover, DFT computation is cheap and introduces negligible overhead by using fast Fourier transform (FFT) algorithms (only require

complexity) that take advantage of the symmetry and periodicity properties of Fourier transform. The overall loss function of our proposed framework is then defined as:


where and is a weight parameter to balance the two losses. We set to 0.5 by default in this work.

5 Experiments

5.1 Pre-training Setup.

We adopt ResNet-50 he2016deep and Vision Transformer dosovitskiy2020image (ViT-S/16 and ViT-B/16) as the backbone. We pre-train on ImageNet-1K (IN-1K) training set with AdamW iclr2019AdamW optimizer with the basic learning rate adjusted by a cosine learning rate scheduler and a batch size of 2048. The input image size is with a patch size of

. We use a random masking ratio of 60%. By default, the learnable mask tokens are placed at stage-3 in ResNet-50 and layer-5/layer-8 in ViT-S/ViT-B, respectively. We adopt a linear prediction head as the decoder. Our experiments are implemented by Pytorch and conducted on a workstation with NVIDIA V100 GPUs. We report the average results of 3 trials for all experiments, and use

bold and underline to indicate the best and the second-best performance. See Appendix A for detailed pre-training settings.

Method Fast Pre-training Longer Pre-training
Epochs Lin. FT (A3) Epochs FT (A3) FT (A2)
PyTorch (Sup.) 90 76.6 78.8 300 78.9 79.8
Inpainting 70 40.1 78.4 - - -
Relative-Loc 70 38.8 77.8 - - -
Rotation 70 48.1 77.7 - - -
SimCLR 100 64.4 78.5 800 78.8 79.9
MoCoV2 100 66.8 78.5 800 78.8 79.8
BYOL 100 68.4 78.7 400 78.9 80.0
SwAV 100 71.9 78.9 400 79.0 80.2
Barlow Twins 100 67.2 78.5 300 78.8 79.9
MoCoV3 100 68.9 78.7 300 79.0 80.1
SimMIM 100 47.5 78.2 300 78.1 79.7
CIM - - - 300 78.6 80.4
AMIM 100 48.1 78.8 300 79.0 80.5
Table 1: ImageNet-1K linear probing (Lin.) and fine-tuning (FT) top-1 accuracy (%) of ResNet-50. Multi-crop augmentation.  Our modified version for CNN.
Method ViT-S ViT-B
Epochs FT Epochs FT
DeiT (Sup.) 300 79.9 300 81.8
DINO 300 80.7 300 82.8
MoCoV3 300 81.4 300 83.2
BEiT - - 800 83.2
iBOT 800 82.3 400 84.0
PeCo - - 800 84.5
MAE - - 1600 83.6
MaskFeat - - 800 84.0
SimMIM 300 81.7 800 83.8
CAE 300 81.8 800 83.6
CIM 300 81.6 300 83.3
AMIM 300 82.2 800 84.2
Table 2: ImageNet-1K fine-tuning (FT) top-1 accuracy (%) of ViT-S and ViT-B models.

5.2 Image Classification on ImageNet-1K

Evaluation Protocols.

We first evaluate the learned representation by end-to-end fine-tuning (FT) and linear probing (Lin.) protocols on IN-1K. For evaluation on CNN, we adopt RSB A2/A3 wightman2021rsb training settings for fine-tuning on ResNet-50, which employs LAMB iclr2020lamb

optimizer with a cosine scheduler for 300/100 epochs. For the linear probing setting on ResNet-50, we freeze the backbone features and train a linear classifier with an initial learning rate of 30 and batch size of 256 following MoCo 

cvpr2020moco. For evaluation on Transformer, we employ the fine-tuning as MAE he2021masked, which uses DeiT touvron2021training augmentation setting, an AdamW optimizer for 100-epoch training, and adopt a layer-wise learning rate decay of 0.65 following bao2021beit. See Appendix A for detailed evaluation configurations.


We compare the proposed AMIM with classical self-supervised learning methods (Inpainting pathak2016context, Relative-Loc iccv2015relativeloc, and Rotation iclr2018rotation), contrastive learning, and MIM methods with various pre-training epochs. As shown in Table 1, our approach achieves competitive performance with state-of-the-art contrastive-based methods under 100-epoch RSB A3 fine-tuning. Note that MIM methods see fewer training samples per epoch than contrastive learning methods (40% vs. 200% of patches) and usually require longer pre-training epochs. Based on a longer fine-tuning evaluation using RSB A2, our method (300-epoch) outperforms contrastive-based methods with even fewer training epochs. Meanwhile, our approach also improves the baseline SimMIM xie2021simmim (+0.8%) and CIM fang2022corrupted (+0.4%) in terms of RSB A3 fine-tuning for the longer pre-training. Besides, we also report the linear probing accuracy in the fast pre-training for reference, although our main focus is to learn representations with better fine-tuning performances. The linear probing performance of our method is lower than contrastive-based methods, it still improves the baseline by 0.6%.


We then compare AMIM with recent proposed contrastive-based methods and MIM methods based on ViT-S and ViT-B in Table 2. Our approach outperforms current state-of-the-art methods, e.g., iBOT zhou2021ibot (MIM with contrastive learning combined) and MaskFeat wei2021masked, and improves the baseline SimMIM by 0.5% and 0.4% based on ViT-S and ViT-B. Although the performances of our method are slightly lower than PeCo dong2021peco (online tokenizer required) based on ViT-B, our method is adaptive to both CNN and Transformer architectures.

Method Epochs COCO ADE-20K
PyTorch (Sup.) 120 38.2 33.3 36.1
SimCLR 800 37.9 33.3 37.6
MoCoV2 400 39.2 34.3 37.5
BYOL 400 38.9 34.2 37.2
SwAV 800 38.4 33.8 37.3
SimSiam 400 39.2 34.4 37.2
Balow Twins 800 39.2 34.3 37.3
CIM 300 - - 38.0
AMIM 300 39.8 34.9 38.3
Table 3:

Performance of object detection and segmentation tasks based on ResNet-50 on COCO and ADE20K.

Method Epochs COCO ADE-20K
DeiT (Sup.) 300 47.9 42.9 47.0
MoCoV3 300 47.9 42.7 47.3
DINO 400 46.8 41.5 47.2
BEiT 300 43.1 38.2 47.1
PeCo 300 43.9 39.8 46.7
MAE 1600 48.5 42.7 48.1
SimMIM 800 48.9 43.0 48.4
CAE 800 49.2 43.3 48.8
AMIM 800 49.3 43.5 49.0
Table 4: Performance of object detection and segmentation tasks based on ViT-B on COCO and ADE-20K.

5.3 Transfer Learning Experiments

Object detection and segmentation on COCO.

To verify transferring abilities of self-supervised methods, we benchmark contrastive learning and MIM methods on object detection and segmentation with COCO eccv2014MSCOCO. For evaluation on CNN, we follow the setup in MoCo cvpr2020moco, which fine-tunes Mask R-CNN 2017iccvmaskrcnn with ResNet-50-C4 backbone using 2 schedule on the COCO train2017 and evaluates on the COCO val2017. Results in Table 3 indicate that our approach (300-epoch) significantly outperforms contrastive-based methods with longer pre-training (+0.7% AP and +0.6% AP). For evaluation on Transformer, we follow MAE he2021masked, which efficiently fine-tunes Mask R-CNN with ViT-B backbone using 1 schedule. In Table 4, our approach (800-epoch) is superior to popular contrastive-based and MIM methods, e.g., improves MAE (1600-epoch) by 0.8% AP and 0.8% AP. The significant performance gain in downstream tasks might be related to capturing more global information during pre-training (see Figure 5).

Semantic segmentation on ADE20K.

We experiment on ADE-20K ijcv2019ADE20K using UperNet eccv2018upernet following MAE he2021masked to fine-tune the model for 100-epoch with the batch size of 16. Based on ResNet-50, results in Table 3 shows that our method outperforms contrastive learning methods by at least 0.9% mIoU and improves CIM (required extra pre-trained BEiT bao2021beit) by 0.3% mIoU. Based on ViT-B, Table 2 shows that our approach consistently improves MIM methods (e.g., improves MAE and SimMIM by 0.9% and 0.6% mIoU). These observations are consistent with those in COCO.

Figure 5: Visualizations of predicted results from SimMIM (middle) and our AMIM (right) based on ViT-S pre-trained 300-epochs on IN-1K. Notice that denotes the mask token to the optimal layer-5 in ViT-S. We ablate the proposed components by adding them to the baseline. Compared to results from SimMIM, reconstruction results () with the RGB mean mask relieves grid-like artifacts; adding the mask token further improves the smoothness; using the proposed helps the model to capture more informative details and contours.

5.4 Ablation Study

To investigate the effectiveness of the proposed components, we conduct ablation studies on ResNet-50 and ViT-S on IN-100 using the fine-tuning protocol. Based on the modified baseline SimMIM, we first compare different mask token mechanisms: Replacing denotes the original way in most MIM methods and Addition denotes our proposed way that adds the mask token to intermediate feature maps of the backbone. As shown in Figure 6, adding the mask token to the medium stages (stage-3) or layers (layer-5) yields the best performance. Notice that replacing masked patches in input images by RGB channel mean slightly improves the baseline SimMIM, especially for ResNet-50 (88.14 vs. 87.75). Then, we verify the proposed in Table 5. We find that simply using without the adaptive re-weighting (Eqn. 5) brings some improvements as the frequency constraint to , while employing further enhances the performance by helping the model to learn more informative frequency components. Additionally, we visualize reconstruction results in Figure 5 to domesticate the improvements brought by our proposed components (see more visualization results in Appendix B).

ResNet-50 ViT-S
88.19 85.17
88.47 86.05
88.73 86.41
88.86 86.62
Table 5: Ablation of proposed on IN-100. denotes removing the re-weighting term in and denotes adding the mask token to the optimal layer-.
Figure 6: Ablation of mask token in various stages (S) or layers (L) based on SimMIM (without ) on IN-100.

6 Conclusion

In this paper, we delved deep into MIM and answer the question of what exactly is learned during MIM pre-training. We adopted the multi-order interactions to study the interaction level among image patches. We discovered that MIM is essentially teaching the network to learn middle-level interactions among image patches for more complex feature extraction regardless of the network architecture. Based on our findings, we further proposed a general framework AMIM that is compatible with both Transformers and CNNs for MIM tasks aiming at enhancing patch interactions during self-supervised pre-training. Besides a different mask token mechanism, we proposed a loss in the Fourier domain to better learn the middle-level interaction. Experimental results have shown that our proposed framework improves the representations learned for both CNNs and Transformers yielding superior performance than state-of-the-arts on various downstream tasks.


Appendix A Details of Comparison Experiments

This section provides experimental details for Sec. 5, e.g.,

pre-training and evaluation on ImageNet-1K and transfer learning settings on downstream tasks.

a.1 ImageNet-1K Experiments


The default settings of AMIM for ResNet-50 and ViTs are provided in Table A1, following SimMIM xie2021simmim. We use AdamW iclr2019AdamW optimizer with the cosine scheduler and the linear learning rate scaling rule 2017msgd: lr = base_lrbatchsize / 256. Similar to current MIM methods, we only use RandomResizedCrop with the scale of and do not require other complex augmentations (e.g., Rand Augment cubuk2020randaugment, mixups yun2019cutmix, or stochastic depth) during pre-training.

End-to-end fine-tuning.

Our fine-tuning follow common practices of supervised image classification on ImageNet-1K. As shown in Table A2, we fine-tune pre-trained ViTs for 100 epochs using the DeiT touvron2021training training recipe, which employs AdamW iclr2019AdamW optimizer with the cross-entropy (CE) loss; we fine-tune pre-trained ResNet-50 for 100/300 epochs using RSB A3/A2 wightman2021rsb settings, which employs LAMB iclr2020lamb optimizer with the binary cross-entropy (BCE) loss. Additionally, we use layer-wise learning rate decay as bao2021beit for fine-tuning ViT models.

Configuration ResNet-50 ViTs
Pre-training resolution 224 224
Mask patch size 32 32
Optimizer AdamW AdamW
Base learning rate 1.5e-4 1e-4
Weight decay 0.05 0.05
Optimizer momentum
Batch size 2048 2048
Learning rate schedule cosine decay cosine decay
Warmup epochs 10 10
Rand Augment
Stochastic Depth
Gradient Clipping 5
Table A1: ImageNet-1K AMIM pre-training settings for ResNet-50 and ViT models.
Configuration ViTs ResNet-50
FT epochs 100 300 100
Training resolution 224 224 160
Testing resolution 224 224 224
Testing crop ratio 0.875 0.95 0.95
Optimizer AdamW LAMB LAMB
Base learning rate
Weight decay 0.05 0.02 0.02
Batch size 1024 2048 2048
Learning rate schedule cosine decay cosine decay cosine decay
Warmup epochs 5 5 5
Label smoothing 0.1
Stochastic depth 0.1 0.05
Gradient clipping 5.0
Rand Augment (9, 0.5) (7, 0.5) (6, 0.5)
Mixup alpha 0.8 0.1 0.1
CutMix alpha 1.0 1.0 1.0
Loss function CE loss BCE loss BCE loss
Table A2: ImageNet-1K fine-tuning recipes for ResNet-50 (RSB A2/A3) and ViTs (DeiT).

a.2 Object Detection and Segmentation on COCO

We adopt Mask-RCNN 2017iccvmaskrcnn framework to perform transfer learning to object detection and segmentation on COCO eccv2014MSCOCO in Detectron2111 For evaluation on ResNet-50, we follow MoCo cvpr2020moco and fine-tune Mask R-CNN with the pre-trained ResNet-50-C4 backbone using 2 schedule (24 epochs). For evaluation of ViTs, we follow MAE he2021masked, which employs the pre-trained ViT backbone and an FPN neck cvpr2017fpn in Mask R-CNN, and fine-tune the model using 1 schedule (12 epochs). For a fair comparison, we follow bao2021beit; xie2021simmim to turn on relative position bias in ViT dosovitskiy2020image during both pre-training and transfer learning, initialized as zero.

a.3 Semantic Segmentation on ADE-20K

We adopt UperNet eccv2018upernet to perform transfer learning to semantic segmentation on ADE-20K and use the semantic segmentation implementation in MMSegmentation222 We initialize the UperNet using the pre-trained backbones (ResNet-50 or ViTs) on ImageNet-1K and fine-tune end-to-end for 100 epochs with a batch size of 16. We search for the optimal lr for all competitors. Similar to fine-tuning settings on COCO, we use relative position bias in ViT dosovitskiy2020image during both pre-training and transfer learning as bao2021beit; xie2021simmim.

Appendix B Empirical Experiments

b.1 Occlusion Robustness

In Sec. 3.1, we analyze robustness against occlusion of fine-tuned models on ImageNet-100 (a subset on ImageNet-1K divided by eccv2020CMC) using the official implementation333 provided by naseer2021intriguing. Both MIM and contrastive-based methods are pre-trained 400 epochs on ImageNet-100 using their pre-training settings on ImageNet-1K. We adopt the fine-tuning training recipe as DeiT in Table A2 and use the same setting (100-epoch) for both ViT-S and ResNet-50. Note that we use the modified SimMIM for ResNet-50 (replacing masked patches in the input image with the RGB mean) in all experiments. As shown in Figure 1, we compared MIM pre-trained models supervised methods with various augmentations and contrastive learning pre-trained methods in terms of the top-1 accuracy under various occlusion ratios. Note that the occlusion ratio means the ratio of dropped and total patches and we plot the mean of accuracy across 3 runs. We can conclude that MIM pre-trained models have stronger robustness against occlusion compared to supervised and contrastive-based methods.

b.2 Multi-order Interaction

In Sec. 3.2, we interpret what is learned by MIM by multi-order interaction deng2021discovering; zhang2020interpreting. The interaction complexity can be represented by (defined in Eqn. 1), which measures the average interaction utility between variables on all contexts consisting of variables. Notice that the order reflects the contextual complexity of the interaction . For example, a low-order interaction (e.g., ) means the relatively simple collaboration between variables , while a high-order interaction (e.g., ) corresponds to the complex collaboration. As figured out in the representation bottleneck deng2021discovering, deep neural networks (DNNs) are more likely to encode both low-order interactions and high-order interactions, but often fail to learn middle-order interactions. We hypothesize that MIM helps models learn more middle-level interactions since MIM has a natural advantage in cases where some parts of the image are masked out. In Figure 1, we calculate the interaction strength (defined in Eqn. 2) for fine-tuned models on ImageNet-100 using the official implementation444 provided by deng2021discovering. Specially, we use the image of resolution as the input and calculate on grids, i.e., . And we set the model output as given the masked sample , where denotes the groundtruth label and

denotes the probability of classifying the masked sample

to the true category. In addition to Sec. 3.2, we further provide occlusion robustness results and interaction strength for ResNet-50 on ImageNet-1K in Figure A1. These observations are consistent with those in Sec. 3.

b.3 Analysis of Feature Maps

In Sec. 3.3, we perform Fourier and variant analysis of feature maps in pre-trained ResNet-50 on ImageNet-1K. Following park2022vision

, we first convert feature maps into the frequency domain and represent them on the normalized frequency domain (the highest frequency components are at

). In Figure 3, we report the amplitude ratio of high-frequency components by using

amplitude, and find that inpainting and MIM reduce high-frequency components at convolution layers compared to supervised and contrastive learning. Then, we provide the standard deviation of feature maps by block depth as

park2022vision; park2022blur in Figure 3, which shows that MIM tends to reduce feature map variances compared to other pre-training methods. Notice that we also plot results of the randomly initialized network in Figure 3 for reference. Therefore, we conclude that MIM learns features with less uncertainty than supervised and contrastive learning methods.

Appendix C Visualization Experimental Details

In addition to Sec. 5.4, we provide more visualization results of AMIM. Similar to Figure 5, we ablate the proposed components in AMIM based on ResNet-50 in Figure A2, which demonstrates that AMIM helps ResNet-50 learn more spatial details, i.e., more middle-level interactions. Moreover, we study the effects of the mask token in both ViTs and CNNs in Figure A3.

Figure A1: (a)(b): Robustness against different occlusion ratios of images is studied for ResNet-50 under various experimental settings on ImageNet-1K. (c)(d): Distributions of the interaction strength is explored for ResNet-50 under various experimental settings. The label indicates the pre-training method fine-tuning setting used, random stands for random weight initialization.
Figure A2: Visualizations of predicted results from SimMIM (middle) and our AMIM (right) based on ResNet-50 pre-trained 100-epochs on ImageNet-1K. Notice that denotes the mask token to the optimal stage-s in ResNet-50. We ablate the proposed components by adding them to the baseline SimMIM: replacing the zero mask with the RGB mean mask and adding the mask token relieve grid-like artifacts in predicted results; adding the proposed helps the model to capture more informative details.
Figure A3: Visualizations of predicted results with and without the mask token on ImageNet-1K. Notice that mask tokens are adopted in the pre-trained models based on ViT-S (300-epoch) or ResNet-50 (100-epoch). Based on ViT-S, removing the mask token corrupts both contents of masked patches and overall colors in SimMIM while only corrupting the masked contents in A

MIM. Based on ResNet-50, removing the mask token slightly affects spatial details in the masked patches and causes grid-like artifacts in the unmasked patches. The different effects of the mask token in ViT-S and ResNet-50 might be because the two architectures use different spatial-mixing operators and normalization layers. As for ViTs, the self-attention operation captures informative details from unmasked patches, but the non-overlap patch embedding and layer normalization mask each patch isolated. The mask token learns the mean templates (contents) of masked patches and gathers spatial details from unmasked patches by the self-attention operation. As for CNNs, each patch shares the same contents from the batch normalization, and the convolution operation extract features from unmasked and masked patches equally. The mask token learns more high-frequency and informative details.