Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

03/27/2022
by   Yunjie Tian, et al.
0

The past year has witnessed a rapid development of masked image modeling (MIM). MIM is mostly built upon the vision transformers, which suggests that self-supervised visual representations can be done by masking input image parts while requiring the target model to recover the missing contents. MIM has demonstrated promising results on downstream tasks, yet we are interested in whether there exist other effective ways to `learn by recovering missing contents'. In this paper, we investigate this topic by designing five other learning objectives that follow the same procedure as MIM but degrade the input image in different ways. With extensive experiments, we manage to summarize a few design principles for token-based pre-training of vision transformers. In particular, the best practice is obtained by keeping the original image style and enriching spatial masking with spatial misalignment – this design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost. The code is available at https://github.com/sunsmarterjie/beyond_masking.

READ FULL TEXT

page 6

page 11

page 18

research
05/28/2022

A Closer Look at Self-supervised Lightweight Vision Transformers

Self-supervised learning on large-scale Vision Transformers (ViTs) as pr...
research
11/02/2021

PatchGame: Learning to Signal Mid-level Patches in Referential Games

We study a referential game (a type of signaling game) where two agents ...
research
09/18/2023

FactoFormer: Factorized Hyperspectral Transformers with Self-Supervised Pre-Training

Hyperspectral images (HSIs) contain rich spectral and spatial informatio...
research
05/27/2022

Architecture-Agnostic Masked Image Modeling – From ViT back to CNN

Masked image modeling (MIM), an emerging self-supervised pre-training me...
research
09/07/2023

DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

As it is empirically observed that Vision Transformers (ViTs) are quite ...
research
07/31/2022

SdAE: Self-distillated Masked Autoencoder

With the development of generative-based self-supervised learning (SSL) ...
research
11/10/2022

Demystify Transformers Convolutions in Modern Image Deep Networks

Recent success of vision transformers has inspired a series of vision ba...

Please sign up or login with your details

Forgot password? Click here to reset