Vision Transformers with Mixed-Resolution Tokenization

04/01/2023
by   Tomer Ronen, et al.
0

Vision Transformer models process input images by dividing them into a spatially regular grid of equal-size patches. Conversely, Transformers were originally introduced over natural language sequences, where each token represents a subword - a chunk of raw data of arbitrary size. In this work, we apply this approach to Vision Transformers by introducing a novel image tokenization scheme, replacing the standard uniform grid with a mixed-resolution sequence of tokens, where each token represents a patch of arbitrary size. Using the Quadtree algorithm and a novel saliency scorer, we construct a patch mosaic where low-saliency areas of the image are processed in low resolution, routing more of the model's capacity to important image regions. Using the same architecture as vanilla ViTs, our Quadformer models achieve substantial accuracy gains on image classification when controlling for the computational budget. Code and models are publicly available at https://github.com/TomerRonen34/mixed-resolution-vit .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2021

Token Labeling: Training a 85.4 56M Parameters on ImageNet

This paper provides a strong baseline for vision transformers on the Ima...
research
04/19/2022

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Vision transformers have achieved great successes in many computer visio...
research
07/05/2023

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

The input tokens to Vision Transformers carry little semantic meaning as...
research
05/31/2021

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Transformers have offered a new methodology of designing neural networks...
research
11/06/2022

ViT-CX: Causal Explanation of Vision Transformers

Despite the popularity of Vision Transformers (ViTs) and eXplainable AI ...
research
05/23/2022

Super Vision Transformer

We attempt to reduce the computational costs in vision transformers (ViT...
research
08/10/2022

PatchDropout: Economizing Vision Transformers Using Patch Dropout

Vision transformers have demonstrated the potential to outperform CNNs i...

Please sign up or login with your details

Forgot password? Click here to reset