MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

05/27/2022
by   Jihao Liu, et al.
1

In this study, we propose Mixed and Masked Image Modeling (MixMIM), a simple but efficient MIM method that is applicable to various hierarchical Vision Transformers. Existing MIM methods replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes training-finetuning inconsistency, due to the large masking ratio (e.g., 40% in BEiT). In contrast, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the original two images from the mixed input, which significantly improves efficiency. While MixMIM can be applied to various architectures, this paper explores a simpler but stronger hierarchical Transformer, and scales with MixMIM-B, -L, and -H. Empirical results demonstrate that MixMIM can learn high-quality visual representations efficiently. Notably, MixMIM-B with 88M parameters achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs, setting a new record for neural networks with comparable model sizes (e.g., ViT-B) among MIM methods. Besides, its transferring performances on the other 6 datasets show MixMIM has better FLOPs / performance tradeoff than previous MIM methods.

READ FULL TEXT

page 2

page 8

research
03/08/2023

Centroid-centered Modeling for Efficient Vision Transformer Pre-training

Masked Image Modeling (MIM) is a new self-supervised vision pre-training...
research
06/05/2020

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Computer vision has achieved great success using standardized image repr...
research
10/03/2022

Visual Prompt Tuning for Generative Transfer Learning

Transferring knowledge from an image synthesis model trained on a large ...
research
11/16/2022

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Masked Autoencoders (MAEs) learn generalizable representations for image...
research
03/25/2023

Masked Diffusion Transformer is a Strong Image Synthesizer

Despite its success in image synthesis, we observe that diffusion probab...
research
06/21/2021

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

In this paper, we introduce a novel visual representation learning which...
research
01/28/2022

DynaMixer: A Vision MLP Architecture with Dynamic Mixing

Recently, MLP-like vision models have achieved promising performances on...

Please sign up or login with your details

Forgot password? Click here to reset