Multi-Tailed Vision Transformer for Efficient Inference

03/03/2022
by   Yunke Wang, et al.
0

Recently, Vision Transformer (ViT) has achieved promising performance in image recognition and gradually serves as a powerful backbone in various vision tasks. To satisfy the sequential input of Transformer, the tail of ViT first splits each image into a sequence of visual tokens with a fixed length. Then the following self-attention layers constructs the global relationship between tokens to produce useful representation for the downstream tasks. Empirically, representing the image with more tokens leads to better performance, yet the quadratic computational complexity of self-attention layer to the number of tokens could seriously influence the efficiency of ViT's inference. For computational reduction, a few pruning methods progressively prune uninformative tokens in the Transformer encoder, while leaving the number of tokens before the Transformer untouched. In fact, fewer tokens as the input for the Transformer encoder can directly reduce the following computational cost. In this spirit, we propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder. A tail predictor is introduced to decide which tail is the most efficient for the image to produce accurate prediction. Both modules are optimized in an end-to-end fashion, with the Gumbel-Softmax trick. Experiments on ImageNet-1K demonstrate that MT-ViT can achieve a significant reduction on FLOPs with no degradation of the accuracy and outperform other compared methods in both accuracy and FLOPs.

READ FULL TEXT

page 2

page 4

page 8

research
05/31/2021

Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Vision Transformers (ViT) have achieved remarkable success in large-scal...
research
06/17/2022

CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer

Transformer has achieved great successes in learning vision and language...
research
12/21/2021

MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation

ViTs are often too computationally expensive to be fitted onto real-worl...
research
05/15/2022

Transkimmer: Transformer Learns to Layer-wise Skim

Transformer architecture has become the de-facto model for many machine ...
research
11/20/2021

Discrete Representations Strengthen Vision Transformer Robustness

Vision Transformer (ViT) is emerging as the state-of-the-art architectur...
research
05/27/2022

X-ViT: High Performance Linear Vision Transformer without Softmax

Vision transformers have become one of the most important models for com...
research
11/14/2022

CabViT: Cross Attention among Blocks for Vision Transformer

Since the vision transformer (ViT) has achieved impressive performance i...

Please sign up or login with your details

Forgot password? Click here to reset