Revealing the Dark Secrets of Masked Image Modeling

05/26/2022
by   Zhenda Xie, et al.
0

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.

READ FULL TEXT

page 6

page 16

page 18

research
05/28/2022

A Closer Look at Self-supervised Lightweight Vision Transformers

Self-supervised learning on large-scale Vision Transformers (ViTs) as pr...
research
10/07/2022

Pre-trained Adversarial Perturbations

Self-supervised pre-training has drawn increasing attention in recent ye...
research
06/03/2022

MetaLR: Layer-wise Learning Rate based on Meta-Learning for Adaptively Fine-tuning Medical Pre-trained Models

When applying transfer learning for medical image analysis, downstream t...
research
11/21/2018

Rethinking ImageNet Pre-training

We report competitive results on object detection and instance segmentat...
research
04/02/2023

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

In this paper, we study masked autoencoder (MAE) pretraining on videos f...
research
06/09/2022

On Data Scaling in Masked Image Modeling

An important goal of self-supervised learning is to enable model pre-tra...
research
05/11/2021

Scene Understanding for Autonomous Driving

To detect and segment objects in images based on their content is one of...

Please sign up or login with your details

Forgot password? Click here to reset