iBOT: Image BERT Pre-Training with Online Tokenizer

11/15/2021
by   Jinghao Zhou, et al.
0

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 81.6 evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.

READ FULL TEXT

page 7

page 8

page 24

page 25

page 26

page 27

page 28

page 29

research
06/16/2022

Patch-level Representation Learning for Self-supervised Vision Transformers

Recent self-supervised learning (SSL) methods have shown impressive resu...
research
09/16/2021

Dense Semantic Contrast for Self-Supervised Visual Representation Learning

Self-supervised representation learning for visual pre-training has achi...
research
02/07/2022

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

We introduce Corrupted Image Modeling (CIM) for self-supervised visual p...
research
05/19/2022

Masked Image Modeling with Denoising Contrast

Since the development of self-supervised visual representation learning ...
research
12/09/2022

Co-training 2^L Submodels for Visual Recognition

We introduce submodel co-training, a regularization method related to co...
research
12/31/2022

Disjoint Masking with Joint Distillation for Efficient Masked Image Modeling

Masked image modeling (MIM) has shown great promise for self-supervised ...
research
08/12/2022

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Masked image modeling (MIM) has demonstrated impressive results in self-...

Please sign up or login with your details

Forgot password? Click here to reset