PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

08/07/2021
by   Boyu Chen, et al.
0

In this paper, we observe two levels of redundancies when applying vision transformers (ViT) for image recognition. First, fixing the number of tokens through the whole network produces redundant features at the spatial level. Second, the attention maps among different transformer layers are redundant. Based on the observations above, we propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy, effectively enhancing the feature representation ability, and achieving a better speed-accuracy trade-off. Specifically, in our PSViT, token pooling can be defined as the operation that decreases the number of tokens at the spatial level. Besides, attention sharing will be built between the neighboring transformer layers for reusing the attention maps having a strong correlation among adjacent layers. Then, a compact set of the possible combinations for different token pooling and attention sharing mechanisms are constructed. Based on the proposed compact set, the number of tokens in each layer and the choices of layers sharing attention can be treated as hyper-parameters that are learned from data automatically. Experimental results show that the proposed scheme can achieve up to 6.6 DeiT.

READ FULL TEXT
research
03/20/2023

Robustifying Token Attention for Vision Transformers

Despite the success of vision transformers (ViTs), they still suffer fro...
research
09/04/2023

One Wide Feedforward is All You Need

The Transformer architecture has two main non-embedding components: Atte...
research
07/05/2022

Efficient Representation Learning via Adaptive Context Pooling

Self-attention mechanisms model long-range context by using pairwise att...
research
01/25/2006

Fast Lexically Constrained Viterbi Algorithm (FLCVA): Simultaneous Optimization of Speed and Memory

Lexical constraints on the input of speech and on-line handwriting syste...
research
10/08/2021

Token Pooling in Vision Transformers

Despite the recent success in many applications, the high computational ...
research
06/19/2023

Vision Transformer with Attention Map Hallucination and FFN Compaction

Vision Transformer(ViT) is now dominating many vision tasks. The drawbac...
research
10/01/2022

CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens

Recognizing an image and segmenting it into coherent regions are often t...

Please sign up or login with your details

Forgot password? Click here to reset