AdaViT: Adaptive Tokens for Efficient Vision Transformer

12/14/2021
by   Hongxu Yin, et al.
0

We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds. We reformulate Adaptive Computation Time (ACT) for this task, extending halting to discard redundant spatial tokens. The appealing architectural properties of vision transformers enables our adaptive token reduction mechanism to speed up inference without modifying the network architecture or inference hardware. We demonstrate that AdaViT requires no extra parameters or sub-network for halting, as we base the learning of adaptive halting on the original network parameters. We further introduce distributional prior regularization that stabilizes training compared to prior ACT approaches. On the image classification task (ImageNet1K), we show that our proposed AdaViT yields high efficacy in filtering informative spatial features and cutting down on the overall compute. The proposed method improves the throughput of DeiT-Tiny by 62 accuracy drop, outperforming prior art by a large margin.

READ FULL TEXT

page 1

page 6

page 7

page 11

research
11/30/2021

ATS: Adaptive Token Sampling For Efficient Vision Transformers

While state-of-the-art vision transformer models achieve promising resul...
research
08/03/2021

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Vision transformers have recently received explosive popularity, but the...
research
05/31/2021

Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Vision Transformers (ViT) have achieved remarkable success in large-scal...
research
05/22/2022

Dynamic Query Selection for Fast Visual Perceiver

Transformers have been matching deep convolutional networks for vision a...
research
09/21/2021

DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers

Dynamic networks have shown their promising capability in reducing theor...
research
12/21/2021

MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation

ViTs are often too computationally expensive to be fitted onto real-worl...
research
03/23/2023

MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Mobile monocular 3D object detection (Mono3D) (e.g., on a vehicle, a dro...

Please sign up or login with your details

Forgot password? Click here to reset