SdAE: Self-distillated Masked Autoencoder

07/31/2022
by   Yabo Chen, et al.
0

With the development of generative-based self-supervised learning (SSL) approaches like BeiT and MAE, how to learn good representations by masking random patches of the input image and reconstructing the missing information has grown in concern. However, BeiT and PeCo need a "pre-pretraining" stage to produce discrete codebooks for masked patches representing. MAE does not require a pre-training codebook process, but setting pixels as reconstruction targets may introduce an optimization gap between pre-training and downstream tasks that good reconstruction quality may not always lead to the high descriptive capability for the model. Considering the above issues, in this paper, we propose a simple Self-distillated masked AutoEncoder network, namely SdAE. SdAE consists of a student branch using an encoder-decoder structure to reconstruct the missing information, and a teacher branch producing latent representation of masked tokens. We also analyze how to build good views for the teacher branch to produce latent representation from the perspective of information bottleneck. After that, we propose a multi-fold masking strategy to provide multiple masked views with balanced information for boosting the performance, which can also reduce the computational complexity. Our approach generalizes well: with only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1 on ADE20K segmentation, and 48.9 mAP on COCO detection, which surpasses other methods by a considerable margin. Code is available at https://github.com/AbrahamYabo/SdAE.

READ FULL TEXT

page 7

page 20

research
11/11/2021

Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-superv...
research
05/28/2022

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Masked Autoencoders (MAE) have shown great potentials in self-supervised...
research
08/18/2023

GiGaMAE: Generalizable Graph Masked Autoencoder via Collaborative Latent Space Reconstruction

Self-supervised learning with masked autoencoders has recently gained po...
research
03/27/2022

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

The past year has witnessed a rapid development of masked image modeling...
research
12/13/2022

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

Pre-training by numerous image data has become de-facto for robust 2D re...
research
11/02/2021

PatchGame: Learning to Signal Mid-level Patches in Referential Games

We study a referential game (a type of signaling game) where two agents ...
research
09/07/2023

DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

As it is empirically observed that Vision Transformers (ViTs) are quite ...

Please sign up or login with your details

Forgot password? Click here to reset