AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

by   Lingchen Meng, et al.

Built on top of self-attention mechanisms, vision transformers have demonstrated remarkable performance on a variety of vision tasks recently. While achieving excellent performance, they still require relatively intensive computational cost that scales up drastically as the numbers of patches, self-attention heads and transformer blocks increase. In this paper, we argue that due to the large variations among images, their need for modeling long-range dependencies between patches differ. To this end, we introduce AdaViT, an adaptive computation framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use throughout the backbone on a per-input basis, aiming to improve inference efficiency of vision transformers with a minimal drop of accuracy for image recognition. Optimized jointly with a transformer backbone in an end-to-end manner, a light-weight decision network is attached to the backbone to produce decisions on-the-fly. Extensive experiments on ImageNet demonstrate that our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8 efficiency/accuracy trade-offs conditioned on different computational budgets. We further conduct quantitative and qualitative analysis on learned usage polices and provide more insights on the redundancy in vision transformers.


page 8

page 9

page 10


BOAT: Bilateral Local Attention Vision Transformer

Vision Transformers achieved outstanding performance in many computer vi...

IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers

The self-attention-based model, transformer, is recently becoming the le...

Exploring Efficient Few-shot Adaptation for Vision Transformers

The task of Few-shot Learning (FSL) aims to do the inference on novel ca...

An Attention Free Transformer

We introduce Attention Free Transformer (AFT), an efficient variant of T...

VidConv: A modernized 2D ConvNet for Efficient Video Recognition

Since being introduced in 2020, Vision Transformers (ViT) has been stead...

OmniNet: Omnidirectional Representations from Transformers

This paper proposes Omnidirectional Representations from Transformers (O...

MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation

ViTs are often too computationally expensive to be fitted onto real-worl...

Please sign up or login with your details

Forgot password? Click here to reset