Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

08/03/2021
by   Yifan Xu, et al.
0

Vision transformers have recently received explosive popularity, but the huge computational cost is still a severe issue. Recent efficient designs for vision transformers follow two pipelines, namely, structural compression based on local spatial prior and non-structural token pruning. However, token pruning breaks the spatial structure that is indispensable for local spatial prior. To take advantage of both two pipelines, this work seeks to dynamically identify uninformative tokens for each instance and trim down both the training and inference complexity while maintaining complete spatial structure and information flow. To achieve this goal, we propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers. Specifically, we conduct unstructured instance-wise token selection by taking advantage of the global class attention that is unique to vision transformers. Then, we propose to update informative tokens and placeholder tokens that contribute little to the final prediction with different computational priorities, namely, slow-fast updating. Thanks to the slow-fast updating mechanism that guarantees information flow and spatial structure, our Evo-ViT can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process. Experimental results demonstrate that the proposed method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification. For example, our method accelerates DeiTS by over 60 top-1 accuracy.

READ FULL TEXT

page 2

page 7

research
01/28/2021

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explore...
research
04/21/2023

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

Although vision transformers (ViTs) have shown promising results in vari...
research
11/24/2021

Self-slimmed Vision Transformer

Vision transformers (ViTs) have become the popular structures and outper...
research
12/14/2021

AdaViT: Adaptive Tokens for Efficient Vision Transformer

We introduce AdaViT, a method that adaptively adjusts the inference cost...
research
05/17/2023

CageViT: Convolutional Activation Guided Efficient Vision Transformer

Recently, Transformers have emerged as the go-to architecture for both v...
research
06/12/2023

Revisiting Token Pruning for Object Detection and Instance Segmentation

Vision Transformers (ViTs) have shown impressive performance in computer...
research
11/30/2021

A Unified Pruning Framework for Vision Transformers

Recently, vision transformer (ViT) and its variants have achieved promis...

Please sign up or login with your details

Forgot password? Click here to reset