Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers

03/24/2023
by   Cong Wei, et al.
0

Vision Transformers (ViT) have shown their competitive advantages performance-wise compared to convolutional neural networks (CNNs) though they often come with high computational costs. To this end, previous methods explore different attention patterns by limiting a fixed number of spatially nearby tokens to accelerate the ViT's multi-head self-attention (MHSA) operations. However, such structured attention patterns limit the token-to-token connections to their spatial relevance, which disregards learned semantic connections from a full attention mask. In this work, we propose a novel approach to learn instance-dependent attention patterns, by devising a lightweight connectivity predictor module to estimate the connectivity score of each pair of tokens. Intuitively, two tokens have high connectivity scores if the features are considered relevant either spatially or semantically. As each token only attends to a small number of other tokens, the binarized connectivity masks are often very sparse by nature and therefore provide the opportunity to accelerate the network via sparse computations. Equipped with the learned unstructured attention pattern, sparse attention ViT (Sparsifiner) produces a superior Pareto-optimal trade-off between FLOPs and top-1 accuracy on ImageNet compared to token sparsity. Our method reduces 48 MHSA while the accuracy drop is within 0.4 attention and token sparsity reduces ViT FLOPs by over 60

READ FULL TEXT

page 1

page 4

page 6

page 7

page 8

research
06/03/2021

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Attention is sparse in vision transformers. We observe the final predict...
research
10/07/2022

Breaking BERT: Evaluating and Optimizing Sparsified Attention

Transformers allow attention between all pairs of tokens, but there is r...
research
06/08/2021

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Vision transformers (ViTs) have recently received explosive popularity, ...
research
09/15/2022

Hydra Attention: Efficient Attention with Many Heads

While transformers have begun to dominate many tasks in vision, applying...
research
06/09/2022

Extreme Masking for Learning Instance and Distributed Visual Representations

The paper presents a scalable approach for learning distributed represen...
research
10/08/2021

Token Pooling in Vision Transformers

Despite the recent success in many applications, the high computational ...
research
06/08/2021

On Improving Adversarial Transferability of Vision Transformers

Vision transformers (ViTs) process input images as sequences of patches ...

Please sign up or login with your details

Forgot password? Click here to reset