FlexiViT: One Model for All Patch Sizes

12/15/2022
by   Lucas Beyer, et al.
15

Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision

READ FULL TEXT

page 6

page 8

page 18

page 19

page 20

page 22

page 23

page 24

research
08/19/2022

Accelerating Vision Transformer Training via a Patch Sampling Schedule

We introduce the notion of a Patch Sampling Schedule (PSS), that varies ...
research
07/18/2023

FlexiAST: Flexibility is What AST Needs

The objective of this work is to give patch-size flexibility to Audio Sp...
research
04/28/2023

Pre-processing training data improves accuracy and generalisability of convolutional neural network based landscape semantic segmentation

In this paper, we trialled different methods of data preparation for Con...
research
06/27/2023

Structured State Space Models for Multiple Instance Learning in Digital Pathology

Multiple instance learning is an ideal mode of analysis for histopatholo...
research
08/31/2023

Towards Optimal Patch Size in Vision Transformers for Tumor Segmentation

Detection of tumors in metastatic colorectal cancer (mCRC) plays an esse...
research
10/12/2020

On the Minimal Recognizable Image Patch

In contrast to human vision, common recognition algorithms often fail on...
research
09/07/2023

DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

As it is empirically observed that Vision Transformers (ViTs) are quite ...

Please sign up or login with your details

Forgot password? Click here to reset