Accelerating Vision Transformer Training via a Patch Sampling Schedule

08/19/2022
by   Bradley McDanel, et al.
27

We introduce the notion of a Patch Sampling Schedule (PSS), that varies the number of Vision Transformer (ViT) patches used per batch during training. Since all patches are not equally important for most vision objectives (e.g., classification), we argue that less important patches can be used in fewer training iterations, leading to shorter training time with minimal impact on performance. Additionally, we observe that training with a PSS makes a ViT more robust to a wider patch sampling range during inference. This allows for a fine-grained, dynamic trade-off between throughput and accuracy during inference. We evaluate using PSSs on ViTs for ImageNet both trained from scratch and pre-trained using a reconstruction loss function. For the pre-trained model, we achieve a 0.26 a 31 patches each iteration. Code, model checkpoints and logs are available at https://github.com/BradMcDanel/pss.

READ FULL TEXT

page 5

page 10

page 11

page 12

research
12/15/2022

FlexiViT: One Model for All Patch Sizes

Vision Transformers convert images to sequences by slicing them into pat...
research
09/21/2022

PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification

Automatic pavement distress classification facilitates improving the eff...
research
04/10/2017

Weakly-Supervised Spatial Context Networks

We explore the power of spatial context as a self-supervisory signal for...
research
07/30/2021

DPT: Deformable Patch-based Transformer for Visual Recognition

Transformer has achieved great success in computer vision, while how to ...
research
06/07/2022

Localizing Semantic Patches for Accelerating Image Classification

Existing works often focus on reducing the architecture redundancy for a...
research
11/09/2022

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

Transformers have become central to recent advances in computer vision. ...
research
08/01/2021

Knowing When to Quit: Selective Cascaded Regression with Patch Attention for Real-Time Face Alignment

Facial landmarks (FLM) estimation is a critical component in many face-r...

Please sign up or login with your details

Forgot password? Click here to reset