and natural language processing (NLP)vaswani2017attention
. However, large amount of high-quality annotations are not always available in real-world applications. Learning representations without supervision by leveraging pre-text tasks has become increasingly popular. In CV, early self-supervised learning approacheseccv2016coloring; iccv2015relativeloc; iclr2018rotation
aim to capture invariant features through predicting transformations applied to the same image. However, these methods rely on vision ad-hoc heuristics, and representations thus learned are less generic for downstream tasks. Recently, the contrastive learning-based approaches to self-supervised learning have witnessed significant progress, even outperforming supervised methods on several downstream tasks. Despite different contrasting mechanisms, contrastive-based methods learn generic representation by minimizing the distance between two augmented views of the same image in the embedded space. More recently, inspired by the masked autoencoding in NLP such as GPTradford2018improving and BERT devlin2018bert, Masked Image Modeling (MIM) methods he2021masked; wei2021masked; xie2021simmim have brought about new advances for self-supervised pre-training for CV tasks. The transition from human language understanding to NLP masked autoencoding is quite natural because the filling of missing words in a sentence requires relatively comprehensive semantic understanding. In analogy, humans can understand and imagine masked content by visually filling the missing structures in an image containing occluded parts. The underlying idea of MIM is thus to randomly mask out a proportion of the image and then recover the masked patches. Different from contrastive learning, which yields a clustering effect from pre-training by pulling similar samples and pushing away dissimilar samples, MIM pre-training methods have not been extensively explored in the context of the expected knowledge learned or how this knowledge is acquired. Moreover, the success of existing MIM methods is largely confined to Vision Transformer (ViT) structures dosovitskiy2020image. Current MIM frameworks are not generic in terms of network architectures because it is not straightforward to directly apply mask token devlin2018bert
and positional embedding to convolutional neural networks (CNNs). In this work, we carry out systematic experiments and show that MIM as pre-training task essentially teaches the model to better learn middle-level interactions between patches for more generalized feature extraction regardless of the underlying network structure. Compared to the local texture features learned by low-level interactions between patches, more complex features such as shape and edge could be extracted via middle-level interactions among patches. The interaction of patches could be considered as information fusion via both convolution operation of a CNN and the self-attention mechanism of a Transformer. That is to say, CNN and Transformer should both benefit from better middle-level interactions with MIM as the pre-text task. To bridge the gap of MIM in terms of network architectures based on our extensive experimental analysis, we propose an Architecture-Agnostic Masked Image Modeling framework (AMIM) that focuses on enhancing the middle-level interaction capabilities of the network. Specifically, we mask the input image with the mean RGB value and place the mask token at intermediate feature maps of the network. In addition, we propose a loss in the Fourier domain to further enhance the middle-level interaction capability of the network. Our contributions are summarized as follows:
We conducted systematic experiments and showed the essence of MIM is to better learn middle-level interactions between patches but not reconstruction quality.
We proposed a novel MIM-based framework dubbed AMIM that bridges the gap between CNNs and Transformers. To the best of our knowledge, we are the first to carry out MIM on CNNs that outperforms contrastive learning counterparts.
Extensive experiments with both Transformers and CNNs on ImageNet-1K and public benchmarks for various downstream tasks show that our method achieves significant performance improvement on pre-trained representation quality than state-of-the-arts.
2 Related Work
Learning representations without supervision by leveraging pre-text tasks has become increasingly popular. This section briefly discusses two types of popular self-supervised vision pre-training approaches: contrastive learning and autoregressive modeling.
2.1 Contrastive Learning
Contrastive learning learns instance-level discriminative representations by extracting invariant features over distorted views of the same data. MoCo cvpr2020moco adopted a large momentum memory bank to introduce enough negative samples, while SimCLR chen2020simple replaced the vanilla memory bank cvpr2018npid by a larger batch size. BYOL nips2020byol and its variants chen2020simsiam; chen2021empirical further eliminates the requirement of negative samples, using various techniques to avoid representation collapse. Besides pairwise contrasting, SwAV caron2020unsupervised clusters the data while enforcing consistency between multi-augmented views of the same image. Barlow Twins zbontar2021barlow proposed to measure the cross-correlation matrix of distorted views of the same image to avoid representation collapsing. Meanwhile, some effects have been made on top of contrastive methods to improve pre-training quality for specific downstream tasks iccv2021detco; xiao2021region; cvpr2021casting; wu2021align. With the recent emergence of Transformers as an alternative to CNN for CV tasks, MoCo.V3 chen2021empirical and DINO iccv2021dino adopted ViT dosovitskiy2020image in self-supervised pre-training to replace of CNN backbones.
2.2 Autoregressive Modeling
Autoencoders are a typical type of neural network architecture that allows representation learning with no annotation requirementhinton1993autoencoders
. A standard autoencoder contains an encoder that maps the input to a representation and a decoder that outputs a reconstructed version of the input using the learned representation. By forcing denoising property onto the learned representations, denoising autoencodersvincent2008extracting; vincent2010stacked are a family of autoencoders that reconstruct the uncorrected input signal with a corrupted version of the signal as input. In addition to directly reconstructing the corrupted image, CIM fang2022corrupted also proposed to predict whether each visual token is replaced by a generator sample or not. Generalizing the notion of denoising autoregressive modeling, masked predictions attracted the attention of both the NLP and vision community. BERT devlin2018bert performs masked language modeling where the task is to predict the randomly masked input tokens. Representations learned by BERT as pre-training generalize well to various downstream tasks. For CV, inpainting tasks pathak2016context to predict large missing regions using convolutional networks is proposed to learn representation without supervision. Inpainting the original color of images with removed color channels is also proposed to learn generic representations eccv2016coloring. With the introduction of the Vision Transformer (ViT), iGPT chen2020generative predicts succeeding pixels given a sequence of pixels as input. MAE he2021masked and BEiT bao2021beit mask out random patches of the input image and reconstruct the missing patches with ViT. Compared to MAE, MaskFeat wei2021masked and simMIM xie2021simmim adopt linear layers as the prediction head instead of another Transformer as in MAE. MaskFeat proposed to use HOG as the prediction target instead of RGB value as used in MAE. Some methods el2021large; zhou2021ibot; assran2022masked combine the idea of contrastive learning with MIM. SplitMask el2021large proposed to use half of the image pixels to predict the other half while applying InfoNCE loss van2018representation across the corresponding latent representations. MSN assran2022masked matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. Similarly, iBOT zhou2021ibot adopts the Siamese framework to combine self-distillation with MIM. Moreover, data2vec baevski2022data2vec proposed a framework that applies the masked prediction idea for either speech, NLP, or CV. In this work, we focus on MIM itself, and thus, the combination of MIM with contrastive learning is beyond the scope of this paper.
3 Intriguing Properties of Masked Image Modeling
3.1 Does MIM Bring Occlusion Robustness?
The higher degree of freedom of the receptive field of vision Transformers induced by the self-attention mechanism is believed to be the reason for Transformers outperforming CNNs. Compared to CNN, Transformers gain tremendous performance improvement with carefully designed image augmentation techniques such as RandAugcubuk2020randaugment, and CutMixyun2019cutmix. Random erasingzhong2020random randomly removes part of the image and replace with Gaussian noise while Cutmix randomly removes part of the image and replaces the corresponding region with a patch from another image. Similarly, as in most MIM pre-training tasks, some patches of images are masked out and replaced with a learnable mask token. We hypothesize that MIM as pre-training task as well as similar data augmentations enhance the network’s robustness towards occlusion and thus enables the network with a more generalized feature extraction ability. To verify our hypothesis, we design an occlusion robustness test. Let be an input image and be its corresponding label, where is the class number. Consider a classification task where denotes a neural network, the network is considered robust if the network outputs the correct label given an occluded version of the image , namely . For occlusion, we consider the patch based random masking as adopted in most MIM works he2021masked; xie2021simmim; wei2021masked. In particular, we split the image of size into patch size and randomly mask patches out of the total number of patches. The occlusion ratio could then be defined as
. We conduct experiments on ImageNet-100 (IN-100)krizhevsky2012imagenet with both Transformer and CNN with different settings. Without loss of generality, we choose ViT-S dosovitskiy2020image and ResNet-50he2016deep as the network architecture. We compare robustness under the following settings: (a) random weight initialization with no image augmentation applied, (b) random weight initialization with different image augmentations applied, (c) MIM pre-training as weight initialization with and without image augmentations applied and (d) Contrastive learning pre-training as weight initialization with and without image augmentations applied. We report the average top-1 accuracy across five runs trained with different settings under various occlusion ratios in Figure 1. The results reported in Figure 1 show significantly robust performance of MIM as pre-training against random weight initialization for ViT-S. MIM pre-trained models are robust and give comparable accuracy up to 80% occlusion ratio as compared to a drastic accuracy drop of models with random weight initialization. It can be also seen that by applying augmentations as used in DeiTtouvron2021training further improves the model robustness. For random weight initialization, we notice that all patch-removing alike augmentations boost the robustness as the accuracy curve shows a convex trend with augmentations. A similar phenomenon could be observed for CNN (ResNet-50) in Figure 1. Compared with the concave shape of model with random weight initialization, MIM pre-trained CNN models demonstrate a convex curve indicating a huge robustness gain. It can be found in Figure 1 that compared to contrastive learning, MIM shows a more convex trend of accuracy curve, which indicates a better occlusion robustness capability.
3.2 Middle-level Interactions for Generalized Feature Extraction
In the previous subsection, we showed that the MIM task improves occlusion robustness for both Transformers and CNNs. However, it remains unclear to us that how occlusion robustness relates to knowledge learned via MIM. Note that existing MIM works adopt a medium or high masking ratio xie2021simmim; he2021masked (e.g., 60% or 70%, see Figure 2) during pre-training, and in these settings, the pairwise interactions between patches are under a middle-size context. This implies a possibility that MIM aims to let the model learn to encode interactions of certain complexity, for instance, the interactions of intermediate complexity. To verify this claim, we resort to the tool of multi-order interactions introduced by deng2021discovering; zhang2020interpreting, and investigate whether MIM makes the model more sensitive to interactions of some particular orders. Specifically, the multi-order interaction is to measure the level of interactions between variables and . We define to be the average interaction utility between variables and on all contexts consisting of variables. indicates the level of contextual complexity of the interaction. Formally, given an input image with a set of variables (e.g., an image with pixels), the multi-order interaction is defined as:
where . indicates the score of output with variables in kept unchanged but replaced with the baseline value ancona2019explaining, where the context . To measure the interaction complexity of the neural network, we measure the relative interaction strength of the encoded -th order interaction as follows:
where is the set of all samples and . is the average value over all possible pairs of variables of input samples. is normalized by the average value of all interaction strength. The distribution (area under curve sums up to one) of indicates the level of interactions of the network. In this work, we use as the metric to evaluate and to analyze interaction levels of the network with MIM pre-training. We conduct experiments on IN-100 with image size and use ViT-S dosovitskiy2020image and ResNet-50 he2016deep as the network architecture. We consider a patch of size as an input variable. For the computation of , we adopt the sampling solution following previous works deng2021discovering; zhang2020interpreting. As can be seen from Figure 1 that ViT-S with random weight initialization tends to learn simple interactions with few patches (e.g., less than patches) while MIM pre-trained models show a stronger interaction for relative middle-level (from to ). Similarly, as observed from 1, MIM pre-trained ResNet-50 enhances the middle-level interactions from to compared to random initialized models. As Figure 1 and 1 show, MIM pre-training task demonstrates better middle-level interactions compared to contrastive learning pre-trianing methods for both Transformer and CNN. A stronger middle-level interactions form more complex features such as shape and edge compared to local texture features learned from low-level interactions naseer2021intriguing.
3.3 From Inpainting to MIM on CNN
(a) Fourier transformed feature maps. The vertical axis is the relative log amplitudes and the horizontal axis is the normalized depth of the network. The blue columns indicate the pooling layers while the white columns indicate convolution layers. (b) Feature maps variance. The vertical axis is the average variance value of feature maps. DeiT (Sup.) is supervised pre-training.
It is worth noticing that early inpainting work based on CNN pathak2016context resembles MIM. However, the inpainting task as pre-training on CNNs attracts little attention due to the much inferior performance than contrastive learning methods. We study the resemblance and difference of MIM and Inpainting on CNNs from the feature map perspective. We first plot the log magnitude of Fourier transformed feature maps of ResNet-50 with different pre-training methods on IN-1K park2022vision. As shown in Figure 3, inpainting and MIM show similar low-pass filtering effect at convolution layers as compared to contrastive learning. This indicates that inpainting and MIM reduce noise and uncertainty induced by high-frequency features. However, inpainting as pre-training adopts the masking strategy as illustrated in 1 and focuses on the reconstruction performance instead of the patch interactions. We argue that the reconstruction performance of MIM is mainly the patch interaction results of the low or high level while middle-level interactions lead to informative features deng2021discovering. Also, the reconstruction performance is not directly related to the learned representation quality. Figure 3 shows the feature variance of each layer of ResNet-50 with different pre-training methods on IN-1K. This figure indicates that MIM tends to reduce the feature map variance, and conversely, supervised training, inpainting, and contrastive learning on CNN tend to increase variance. Compared to MIM, which learns better middle-level interactions, inpainting task fails to filter out low-level interactions and thus leads to higher variance. To conclude, MIM enhances middle-level interactions and reduces the feature map uncertainty for a generalized and stabilized feature extraction park2022blur.
Considering our insight that the essence of MIM is to learn better middle-level interactions for generalized feature extraction, we propose a generic MIM framework following two design rules: (a)The masking should happen where middle-level interactions occur; (b)Guiding the model to learn initially neglected target during training. An overall illustration of our proposed framework in comparison to existing MIM framework is given in Figure 4.
For the masking strategy, we follow the common practice of existing works el2021large; he2021masked; xie2021simmim; wei2021masked where the input image is divided into non-overlapping patches, and a random subset of patches are masked. MAE utilizes a Transformer as decoder and takes only the visible patches into the encoder. Masked tokens are appended to the decoder to reconstruct the masked patches. simMIM xie2021simmim and MaskFeat wei2021masked utilize a fully connected layer as the decoder and feed the mask token into the encoder together with the visible patches. The mask token devlin2018bert is a token-shared learnable parameter that indicates the presence of a missing patches to be predicted. Despite different choices of decoder structures, the mask token is either placed at the input to the encoder or the decoder. Mathematically, the masking process of MIM is defined as , where is the random occlusion mask and represents the learnable mask token. We argue that by masking the image patches as done by current works, the network is picking up low-level interactions. As discussed earlier, we wish to enhance the mid-range interaction of the network and we thus propose to mask intermediate features of the network where information are fused to certain a degree. More concretely, our masking operation is defined as , where is the intermediate feature map of at layer- in the Transformer encoder (or denote as for stage- in CNNs) and is the corresponding down-sampling function of the occlusion mask. We use the average RGB value to fill the masked patches as the input to the encoder. It worth nothing that existing works directly replace the occluded patches with the mask token in the input space or after the patch embedding dosovitskiy2020image; bao2021beit; xie2021simmim. In contrast, we replace the occluded patches with the mean of RGB channels of the image and add the mask token onto the intermediate feature maps of the encoder.
Currently proposed works el2021large; he2021masked; xie2021simmim adopt raw RGB values as the prediction target. However, raw pixels are heavy redundant and often contains low-level statistics bao2021beit; wei2021masked; zhou2021ibot. Following MAE, Maskfeat wei2021masked proposed to use the Histogram of Oriented Gradients (HOG) as the prediction target outperforming MAE. HOG is a descriptor that captures shape features based on middle-level interactions. Such high-frequency features resulted from middle-level interactions could be well portrayed in the phase component of Fourier spectrum. Given a RGB image , the discrete Fourier transform (DFT) of each channel is defined by:
In addition to the common MIM loss in the spatial domain , we propose to enforce better middle-level interactions in the Fourier domain. is defined as:
where is the predicted image, represents the gradiant detachment operation and is the frequency weighting matrix as in Focal Frequency Lossjiang2021focal. is defined as:
where is the scaling factor for flexibility ( = 1 in our experiments). It is worth noting that for a 2D signal (one channel of the image), its DFT is conjugate symmetric, which implies that half of the DFT contains the global information. This reduces the prediction target size by half compared to directly predicting missing information from the spatial domain. Moreover, DFT computation is cheap and introduces negligible overhead by using fast Fourier transform (FFT) algorithms (only require
complexity) that take advantage of the symmetry and periodicity properties of Fourier transform. The overall loss function of our proposed framework is then defined as:
where and is a weight parameter to balance the two losses. We set to 0.5 by default in this work.
5.1 Pre-training Setup.
We adopt ResNet-50 he2016deep and Vision Transformer dosovitskiy2020image (ViT-S/16 and ViT-B/16) as the backbone. We pre-train on ImageNet-1K (IN-1K) training set with AdamW iclr2019AdamW optimizer with the basic learning rate adjusted by a cosine learning rate scheduler and a batch size of 2048. The input image size is with a patch size of
. We use a random masking ratio of 60%. By default, the learnable mask tokens are placed at stage-3 in ResNet-50 and layer-5/layer-8 in ViT-S/ViT-B, respectively. We adopt a linear prediction head as the decoder. Our experiments are implemented by Pytorch and conducted on a workstation with NVIDIA V100 GPUs. We report the average results of 3 trials for all experiments, and usebold and underline to indicate the best and the second-best performance. See Appendix A for detailed pre-training settings.
5.2 Image Classification on ImageNet-1K
We first evaluate the learned representation by end-to-end fine-tuning (FT) and linear probing (Lin.) protocols on IN-1K. For evaluation on CNN, we adopt RSB A2/A3 wightman2021rsb training settings for fine-tuning on ResNet-50, which employs LAMB iclr2020lamb
optimizer with a cosine scheduler for 300/100 epochs. For the linear probing setting on ResNet-50, we freeze the backbone features and train a linear classifier with an initial learning rate of 30 and batch size of 256 following MoCocvpr2020moco. For evaluation on Transformer, we employ the fine-tuning as MAE he2021masked, which uses DeiT touvron2021training augmentation setting, an AdamW optimizer for 100-epoch training, and adopt a layer-wise learning rate decay of 0.65 following bao2021beit. See Appendix A for detailed evaluation configurations.
We compare the proposed AMIM with classical self-supervised learning methods (Inpainting pathak2016context, Relative-Loc iccv2015relativeloc, and Rotation iclr2018rotation), contrastive learning, and MIM methods with various pre-training epochs. As shown in Table 1, our approach achieves competitive performance with state-of-the-art contrastive-based methods under 100-epoch RSB A3 fine-tuning. Note that MIM methods see fewer training samples per epoch than contrastive learning methods (40% vs. 200% of patches) and usually require longer pre-training epochs. Based on a longer fine-tuning evaluation using RSB A2, our method (300-epoch) outperforms contrastive-based methods with even fewer training epochs. Meanwhile, our approach also improves the baseline SimMIM xie2021simmim (+0.8%) and CIM fang2022corrupted (+0.4%) in terms of RSB A3 fine-tuning for the longer pre-training. Besides, we also report the linear probing accuracy in the fast pre-training for reference, although our main focus is to learn representations with better fine-tuning performances. The linear probing performance of our method is lower than contrastive-based methods, it still improves the baseline by 0.6%.
We then compare AMIM with recent proposed contrastive-based methods and MIM methods based on ViT-S and ViT-B in Table 2. Our approach outperforms current state-of-the-art methods, e.g., iBOT zhou2021ibot (MIM with contrastive learning combined) and MaskFeat wei2021masked, and improves the baseline SimMIM by 0.5% and 0.4% based on ViT-S and ViT-B. Although the performances of our method are slightly lower than PeCo dong2021peco (online tokenizer required) based on ViT-B, our method is adaptive to both CNN and Transformer architectures.
Performance of object detection and segmentation tasks based on ResNet-50 on COCO and ADE20K.
5.3 Transfer Learning Experiments
Object detection and segmentation on COCO.
To verify transferring abilities of self-supervised methods, we benchmark contrastive learning and MIM methods on object detection and segmentation with COCO eccv2014MSCOCO. For evaluation on CNN, we follow the setup in MoCo cvpr2020moco, which fine-tunes Mask R-CNN 2017iccvmaskrcnn with ResNet-50-C4 backbone using 2 schedule on the COCO train2017 and evaluates on the COCO val2017. Results in Table 3 indicate that our approach (300-epoch) significantly outperforms contrastive-based methods with longer pre-training (+0.7% AP and +0.6% AP). For evaluation on Transformer, we follow MAE he2021masked, which efficiently fine-tunes Mask R-CNN with ViT-B backbone using 1 schedule. In Table 4, our approach (800-epoch) is superior to popular contrastive-based and MIM methods, e.g., improves MAE (1600-epoch) by 0.8% AP and 0.8% AP. The significant performance gain in downstream tasks might be related to capturing more global information during pre-training (see Figure 5).
Semantic segmentation on ADE20K.
We experiment on ADE-20K ijcv2019ADE20K using UperNet eccv2018upernet following MAE he2021masked to fine-tune the model for 100-epoch with the batch size of 16. Based on ResNet-50, results in Table 3 shows that our method outperforms contrastive learning methods by at least 0.9% mIoU and improves CIM (required extra pre-trained BEiT bao2021beit) by 0.3% mIoU. Based on ViT-B, Table 2 shows that our approach consistently improves MIM methods (e.g., improves MAE and SimMIM by 0.9% and 0.6% mIoU). These observations are consistent with those in COCO.
5.4 Ablation Study
To investigate the effectiveness of the proposed components, we conduct ablation studies on ResNet-50 and ViT-S on IN-100 using the fine-tuning protocol. Based on the modified baseline SimMIM, we first compare different mask token mechanisms: Replacing denotes the original way in most MIM methods and Addition denotes our proposed way that adds the mask token to intermediate feature maps of the backbone. As shown in Figure 6, adding the mask token to the medium stages (stage-3) or layers (layer-5) yields the best performance. Notice that replacing masked patches in input images by RGB channel mean slightly improves the baseline SimMIM, especially for ResNet-50 (88.14 vs. 87.75). Then, we verify the proposed in Table 5. We find that simply using without the adaptive re-weighting (Eqn. 5) brings some improvements as the frequency constraint to , while employing further enhances the performance by helping the model to learn more informative frequency components. Additionally, we visualize reconstruction results in Figure 5 to domesticate the improvements brought by our proposed components (see more visualization results in Appendix B).
In this paper, we delved deep into MIM and answer the question of what exactly is learned during MIM pre-training. We adopted the multi-order interactions to study the interaction level among image patches. We discovered that MIM is essentially teaching the network to learn middle-level interactions among image patches for more complex feature extraction regardless of the network architecture. Based on our findings, we further proposed a general framework AMIM that is compatible with both Transformers and CNNs for MIM tasks aiming at enhancing patch interactions during self-supervised pre-training. Besides a different mask token mechanism, we proposed a loss in the Fourier domain to better learn the middle-level interaction. Experimental results have shown that our proposed framework improves the representations learned for both CNNs and Transformers yielding superior performance than state-of-the-arts on various downstream tasks.
Appendix A Details of Comparison Experiments
This section provides experimental details for Sec. 5, e.g.,
pre-training and evaluation on ImageNet-1K and transfer learning settings on downstream tasks.
a.1 ImageNet-1K Experiments
The default settings of AMIM for ResNet-50 and ViTs are provided in Table A1, following SimMIM xie2021simmim. We use AdamW iclr2019AdamW optimizer with the cosine scheduler and the linear learning rate scaling rule 2017msgd: lr = base_lrbatchsize / 256. Similar to current MIM methods, we only use RandomResizedCrop with the scale of and do not require other complex augmentations (e.g., Rand Augment cubuk2020randaugment, mixups yun2019cutmix, or stochastic depth) during pre-training.
Our fine-tuning follow common practices of supervised image classification on ImageNet-1K. As shown in Table A2, we fine-tune pre-trained ViTs for 100 epochs using the DeiT touvron2021training training recipe, which employs AdamW iclr2019AdamW optimizer with the cross-entropy (CE) loss; we fine-tune pre-trained ResNet-50 for 100/300 epochs using RSB A3/A2 wightman2021rsb settings, which employs LAMB iclr2020lamb optimizer with the binary cross-entropy (BCE) loss. Additionally, we use layer-wise learning rate decay as bao2021beit for fine-tuning ViT models.
a.2 Object Detection and Segmentation on COCO
We adopt Mask-RCNN 2017iccvmaskrcnn framework to perform transfer learning to object detection and segmentation on COCO eccv2014MSCOCO in Detectron2111https://github.com/facebookresearch/detectron2. For evaluation on ResNet-50, we follow MoCo cvpr2020moco and fine-tune Mask R-CNN with the pre-trained ResNet-50-C4 backbone using 2 schedule (24 epochs). For evaluation of ViTs, we follow MAE he2021masked, which employs the pre-trained ViT backbone and an FPN neck cvpr2017fpn in Mask R-CNN, and fine-tune the model using 1 schedule (12 epochs). For a fair comparison, we follow bao2021beit; xie2021simmim to turn on relative position bias in ViT dosovitskiy2020image during both pre-training and transfer learning, initialized as zero.
a.3 Semantic Segmentation on ADE-20K
We adopt UperNet eccv2018upernet to perform transfer learning to semantic segmentation on ADE-20K and use the semantic segmentation implementation in MMSegmentation222https://github.com/open-mmlab/mmsegmentation. We initialize the UperNet using the pre-trained backbones (ResNet-50 or ViTs) on ImageNet-1K and fine-tune end-to-end for 100 epochs with a batch size of 16. We search for the optimal lr for all competitors. Similar to fine-tuning settings on COCO, we use relative position bias in ViT dosovitskiy2020image during both pre-training and transfer learning as bao2021beit; xie2021simmim.
Appendix B Empirical Experiments
b.1 Occlusion Robustness
In Sec. 3.1, we analyze robustness against occlusion of fine-tuned models on ImageNet-100 (a subset on ImageNet-1K divided by eccv2020CMC) using the official implementation333https://github.com/Muzammal-Naseer/Intriguing-Properties-of-Vision-Transformers provided by naseer2021intriguing. Both MIM and contrastive-based methods are pre-trained 400 epochs on ImageNet-100 using their pre-training settings on ImageNet-1K. We adopt the fine-tuning training recipe as DeiT in Table A2 and use the same setting (100-epoch) for both ViT-S and ResNet-50. Note that we use the modified SimMIM for ResNet-50 (replacing masked patches in the input image with the RGB mean) in all experiments. As shown in Figure 1, we compared MIM pre-trained models supervised methods with various augmentations and contrastive learning pre-trained methods in terms of the top-1 accuracy under various occlusion ratios. Note that the occlusion ratio means the ratio of dropped and total patches and we plot the mean of accuracy across 3 runs. We can conclude that MIM pre-trained models have stronger robustness against occlusion compared to supervised and contrastive-based methods.
b.2 Multi-order Interaction
In Sec. 3.2, we interpret what is learned by MIM by multi-order interaction deng2021discovering; zhang2020interpreting. The interaction complexity can be represented by (defined in Eqn. 1), which measures the average interaction utility between variables on all contexts consisting of variables. Notice that the order reflects the contextual complexity of the interaction . For example, a low-order interaction (e.g., ) means the relatively simple collaboration between variables , while a high-order interaction (e.g., ) corresponds to the complex collaboration. As figured out in the representation bottleneck deng2021discovering, deep neural networks (DNNs) are more likely to encode both low-order interactions and high-order interactions, but often fail to learn middle-order interactions. We hypothesize that MIM helps models learn more middle-level interactions since MIM has a natural advantage in cases where some parts of the image are masked out. In Figure 1, we calculate the interaction strength (defined in Eqn. 2) for fine-tuned models on ImageNet-100 using the official implementation444https://github.com/Nebularaid2000/bottleneck provided by deng2021discovering. Specially, we use the image of resolution as the input and calculate on grids, i.e., . And we set the model output as given the masked sample , where denotes the groundtruth label and
denotes the probability of classifying the masked sampleto the true category. In addition to Sec. 3.2, we further provide occlusion robustness results and interaction strength for ResNet-50 on ImageNet-1K in Figure A1. These observations are consistent with those in Sec. 3.
b.3 Analysis of Feature Maps
In Sec. 3.3, we perform Fourier and variant analysis of feature maps in pre-trained ResNet-50 on ImageNet-1K. Following park2022vision
, we first convert feature maps into the frequency domain and represent them on the normalized frequency domain (the highest frequency components are at). In Figure 3, we report the amplitude ratio of high-frequency components by using
amplitude, and find that inpainting and MIM reduce high-frequency components at convolution layers compared to supervised and contrastive learning. Then, we provide the standard deviation of feature maps by block depth aspark2022vision; park2022blur in Figure 3, which shows that MIM tends to reduce feature map variances compared to other pre-training methods. Notice that we also plot results of the randomly initialized network in Figure 3 for reference. Therefore, we conclude that MIM learns features with less uncertainty than supervised and contrastive learning methods.
Appendix C Visualization Experimental Details
In addition to Sec. 5.4, we provide more visualization results of AMIM. Similar to Figure 5, we ablate the proposed components in AMIM based on ResNet-50 in Figure A2, which demonstrates that AMIM helps ResNet-50 learn more spatial details, i.e., more middle-level interactions. Moreover, we study the effects of the mask token in both ViTs and CNNs in Figure A3.