BEiT: BERT Pre-Training of Image Transformers

06/15/2021
by   Hangbo Bao, et al.
0

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2 from-scratch DeiT training (81.8 BEiT obtains 86.3 supervised pre-training on ImageNet-22K (85.2 are available at https://aka.ms/beit.

READ FULL TEXT
research
02/07/2022

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

We introduce Corrupted Image Modeling (CIM) for self-supervised visual p...
research
11/24/2021

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

This paper explores a better codebook for BERT pre-training of vision tr...
research
06/29/2023

Learning Nuclei Representations with Masked Image Modelling

Masked image modelling (MIM) is a powerful self-supervised representatio...
research
06/12/2023

Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

The use of self-supervised pre-training has emerged as a promising appro...
research
03/08/2023

Centroid-centered Modeling for Efficient Vision Transformer Pre-training

Masked Image Modeling (MIM) is a new self-supervised vision pre-training...
research
08/12/2022

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Masked image modeling (MIM) has demonstrated impressive results in self-...
research
05/30/2022

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Recently, masked image modeling (MIM) has offered a new methodology of s...

Please sign up or login with your details

Forgot password? Click here to reset