CAE v2: Context Autoencoder with CLIP Target

11/17/2022
by   Xinyu Zhang, et al.
0

Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance. To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio, and reveal two interesting perspectives, relying on our developed simple pipeline, context autodecoder with CLIP target (CAE v2). Firstly, we observe that the supervision on visible patches achieves remarkable performance, even better than that on masked patches, where the latter is the standard format in the existing MIM methods. Secondly, the optimal mask ratio positively correlates to the model size. That is to say, the smaller the model, the lower the mask ratio needs to be. Driven by these two discoveries, our simple and concise approach CAE v2 achieves superior performance on a series of downstream tasks. For example, a vanilla ViT-Large model achieves 81.7 accuracy on linear probing and fine-tuning on ImageNet-1K, and 55.9 semantic segmentation on ADE20K with the pre-training for 300 epochs. We hope our findings can be helpful guidelines for the pre-training in the MIM area, especially for the small-scale models.

READ FULL TEXT
research
11/28/2022

Good helper is around you: Attention-driven Masked Image Modeling

It has been witnessed that masked image modeling (MIM) has shown a huge ...
research
02/07/2022

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

We introduce Corrupted Image Modeling (CIM) for self-supervised visual p...
research
11/23/2022

Integrally Pre-Trained Transformer Pyramid Networks

In this paper, we present an integral pre-training framework based on ma...
research
05/28/2022

SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners

Self-supervised Masked Autoencoders (MAE) are emerging as a new pre-trai...
research
04/12/2023

Hard Patches Mining for Masked Image Modeling

Masked image modeling (MIM) has attracted much research attention due to...
research
02/28/2023

Efficient Masked Autoencoders with Self-Consistency

Inspired by masked language modeling (MLM) in natural language processin...
research
01/30/2023

Advancing Radiograph Representation Learning with Masked Record Modeling

Modern studies in radiograph representation learning rely on either self...

Please sign up or login with your details

Forgot password? Click here to reset