Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking

03/09/2023
by   Peng Gao, et al.
0

Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75 pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25 to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75 and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8 fine-tuning, surpassing the 1600-epoch MAE base by +2.2 state-of-the-art BEiT V2 base by +0.3 released at https://github.com/Alpha-VL/ConvMAE.

READ FULL TEXT

page 5

page 9

research
12/13/2022

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

Pre-training by numerous image data has become de-facto for robust 2D re...
research
06/01/2023

DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection

Existing deepfake detection methods fail to generalize well to unseen or...
research
07/31/2023

Disruptive Autoencoders: Leveraging Low-level features for 3D Medical Image Pre-training

Harnessing the power of pre-training on large-scale datasets like ImageN...
research
12/19/2021

On Efficient Transformer and Image Pre-training for Low-level Vision

Pre-training has marked numerous state of the arts in high-level compute...
research
12/13/2022

FastMIM: Expediting Masked Image Modeling Pre-training for Vision

The combination of transformers and masked image modeling (MIM) pre-trai...
research
04/02/2023

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

In this paper, we study masked autoencoder (MAE) pretraining on videos f...
research
10/20/2022

i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?

Masked image modeling (MIM) has been recognized as a strong and popular ...

Please sign up or login with your details

Forgot password? Click here to reset