Vision Transformer with Progressive Sampling

08/03/2021
by   Xiaoyu Yue, et al.
0

Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image classification, by simply splitting images into tokens with a fixed length, and employing transformers to learn relations between these tokens. However, such naive tokenization could destruct object structures, assign grids to uninterested regions such as background, and introduce interference signals. To mitigate the above issues, in this paper, we propose an iterative and progressive sampling strategy to locate discriminative regions. At each iteration, embeddings of the current sampling step are fed into a transformer encoder layer, and a group of sampling offsets is predicted to update the sampling locations for the next step. The progressive sampling is differentiable. When combined with the Vision Transformer, the obtained PS-ViT network can adaptively learn where to look. The proposed PS-ViT is both effective and efficient. When trained from scratch on ImageNet, PS-ViT performs 3.8 4× fewer parameters and 10× fewer FLOPs. Code is available at https://github.com/yuexy/PS-ViT.

READ FULL TEXT

page 1

page 5

page 8

research
01/28/2021

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explore...
research
04/19/2022

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Vision transformers have achieved great successes in many computer visio...
research
06/19/2023

RaViTT: Random Vision Transformer Tokens

Vision Transformers (ViTs) have successfully been applied to image class...
research
03/28/2022

Automated Progressive Learning for Efficient Training of Vision Transformers

Recent advances in vision Transformers (ViTs) have come with a voracious...
research
03/17/2021

ENCONTER: Entity Constrained Progressive Sequence Generation via Insertion-based Transformer

Pretrained using large amount of data, autoregressive language models ar...
research
03/26/2023

Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers

Vision transformers have recently shown strong global context modeling c...
research
05/28/2022

MDMLP: Image Classification from Scratch on Small Datasets with MLP

The attention mechanism has become a go-to technique for natural languag...

Please sign up or login with your details

Forgot password? Click here to reset