PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

11/24/2021
by   Xiaoyi Dong, et al.
23

This paper explores a better codebook for BERT pre-training of vision transformers. The recent work BEiT successfully transfers BERT pre-training from NLP to the vision field. It directly adopts one simple discrete VAE as the visual tokenizer, but has not considered the semantic level of the resulting visual tokens. By contrast, the discrete tokens in NLP field are naturally highly semantic. This difference motivates us to learn a perceptual codebook. And we surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5 Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by +1.3 with the same pre-training epochs. It can also improve the performance of object detection and segmentation tasks on COCO val by +1.3 box AP and +1.0 mask AP, semantic segmentation on ADE20k by +1.0 mIoU, The code and models will be available at <https://github.com/microsoft/PeCo>.

READ FULL TEXT

page 3

page 7

page 8

page 13

research
06/15/2021

BEiT: BERT Pre-Training of Image Transformers

We introduce a self-supervised vision representation model BEiT, which s...
research
07/14/2022

Bootstrapped Masked Autoencoders for Vision BERT Pretraining

We propose bootstrapped masked autoencoders (BootMAE), a new approach fo...
research
10/13/2022

Exploring Long-Sequence Masked Autoencoders

Masked Autoencoding (MAE) has emerged as an effective approach for pre-t...
research
12/22/2022

Reversible Column Networks

We propose a new neural network design paradigm Reversible Column Networ...
research
03/30/2021

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

We present a new vision-language (VL) pre-training model dubbed Kaleido-...
research
02/11/2021

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals

Being able to learn dense semantic representations of images without sup...
research
03/02/2023

Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves

Formula-driven supervised learning (FDSL) has been shown to be an effect...

Please sign up or login with your details

Forgot password? Click here to reset