AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

03/14/2023
by   Xiao Wang, et al.
8

Unsupervised learning of vision transformers seeks to pretrain an encoder via pretext tasks without labels. Among them is the Masked Image Modeling (MIM) aligned with pretraining of language transformers by predicting masked patches as a pretext task. A criterion in unsupervised pretraining is the pretext task needs to be sufficiently hard to prevent the transformer encoder from learning trivial low-level features not generalizable well to downstream tasks. For this purpose, we propose an Adversarial Positional Embedding (AdPE) approach – It distorts the local visual structures by perturbing the position encodings so that the learned transformer cannot simply use the locally correlated patches to predict the missing ones. We hypothesize that it forces the transformer encoder to learn more discriminative features in a global context with stronger generalizability to downstream tasks. We will consider both absolute and relative positional encodings, where adversarial positions can be imposed both in the embedding mode and the coordinate mode. We will also present a new MAE+ baseline that brings the performance of the MIM pretraining to a new level with the AdPE. The experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE by 0.8% and 0.4% over 1600 epochs of pretraining ViT-B and ViT-L on Imagenet1K. For the transfer learning task, it outperforms the MAE with the ViT-B backbone by 2.6% in mIoU on ADE20K, and by 3.2% in AP^bbox and 1.6% in AP^mask on COCO, respectively. These results are obtained with the AdPE being a pure MIM approach that does not use any extra models or external datasets for pretraining. The code is available at https://github.com/maple-research-lab/AdPE.

READ FULL TEXT

page 8

page 9

research
05/17/2022

Vision Transformer Adapter for Dense Predictions

This work investigates a simple yet powerful adapter for Vision Transfor...
research
08/29/2022

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

Vision and Language Pretraining has become the prevalent approach for ta...
research
01/27/2021

Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network

Despite having impressive vision-language (VL) pretraining with BERT-bas...
research
06/08/2022

Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks

For unsupervised pretraining, mask-reconstruction pretraining (MRP) appr...
research
03/30/2022

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

In this paper, we propose a simple yet powerful improvement over the rec...
research
06/19/2022

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Motivated by biological evolution, this paper explains the rationality o...
research
02/05/2021

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Vision-and-Language Pretraining (VLP) has improved performance on variou...

Please sign up or login with your details

Forgot password? Click here to reset