DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

09/07/2023
by   Haochen Wang, et al.
0

As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of input tokens, the need for an appropriate self-supervised pretext task that enhances the location awareness of ViTs is becoming evident. To address this, we present DropPos, a novel pretext task designed to reconstruct Dropped Positions. The formulation of DropPos is simple: we first drop a large random subset of positional embeddings and then the model classifies the actual position for each non-overlapping patch among all possible positions solely based on their visual appearance. To avoid trivial solutions, we increase the difficulty of this task by keeping only a subset of patches visible. Additionally, considering there may be different patches with similar visual appearances, we propose position smoothing and attentive reconstruction strategies to relax this classification problem, since it is not necessary to reconstruct their exact positions in these cases. Empirical evaluations of DropPos show strong capabilities. DropPos outperforms supervised pre-training and achieves competitive results compared with state-of-the-art self-supervised alternatives on a wide range of downstream benchmarks. This suggests that explicitly encouraging spatial reasoning abilities, as DropPos does, indeed contributes to the improved location awareness of ViTs. The code is publicly available at https://github.com/Haochen-Wang409/DropPos.

READ FULL TEXT
research
06/12/2023

Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

The use of self-supervised pre-training has emerged as a promising appro...
research
09/18/2023

FactoFormer: Factorized Hyperspectral Transformers with Self-Supervised Pre-Training

Hyperspectral images (HSIs) contain rich spectral and spatial informatio...
research
03/27/2022

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

The past year has witnessed a rapid development of masked image modeling...
research
06/08/2023

Improving Visual Prompt Tuning for Self-supervised Vision Transformers

Visual Prompt Tuning (VPT) is an effective tuning method for adapting pr...
research
05/23/2022

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

Vision-language pre-training (VLP) has shown impressive performance on a...
research
07/31/2022

SdAE: Self-distillated Masked Autoencoder

With the development of generative-based self-supervised learning (SSL) ...
research
12/15/2022

FlexiViT: One Model for All Patch Sizes

Vision Transformers convert images to sequences by slicing them into pat...

Please sign up or login with your details

Forgot password? Click here to reset