AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

11/30/2021
by   Lingchen Meng, et al.
0

Built on top of self-attention mechanisms, vision transformers have demonstrated remarkable performance on a variety of vision tasks recently. While achieving excellent performance, they still require relatively intensive computational cost that scales up drastically as the numbers of patches, self-attention heads and transformer blocks increase. In this paper, we argue that due to the large variations among images, their need for modeling long-range dependencies between patches differ. To this end, we introduce AdaViT, an adaptive computation framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use throughout the backbone on a per-input basis, aiming to improve inference efficiency of vision transformers with a minimal drop of accuracy for image recognition. Optimized jointly with a transformer backbone in an end-to-end manner, a light-weight decision network is attached to the backbone to produce decisions on-the-fly. Extensive experiments on ImageNet demonstrate that our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8 efficiency/accuracy trade-offs conditioned on different computational budgets. We further conduct quantitative and qualitative analysis on learned usage polices and provide more insights on the redundancy in vision transformers.

READ FULL TEXT

page 8

page 9

page 10

research
01/31/2022

BOAT: Bilateral Local Attention Vision Transformer

Vision Transformers achieved outstanding performance in many computer vi...
research
06/23/2021

IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers

The self-attention-based model, transformer, is recently becoming the le...
research
01/06/2023

Exploring Efficient Few-shot Adaptation for Vision Transformers

The task of Few-shot Learning (FSL) aims to do the inference on novel ca...
research
05/28/2021

An Attention Free Transformer

We introduce Attention Free Transformer (AFT), an efficient variant of T...
research
07/08/2022

VidConv: A modernized 2D ConvNet for Efficient Video Recognition

Since being introduced in 2020, Vision Transformers (ViT) has been stead...
research
03/01/2021

OmniNet: Omnidirectional Representations from Transformers

This paper proposes Omnidirectional Representations from Transformers (O...
research
12/21/2021

MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation

ViTs are often too computationally expensive to be fitted onto real-worl...

Please sign up or login with your details

Forgot password? Click here to reset