Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

09/28/2022
by   Xiangcheng Liu, et al.
0

Vision transformer has emerged as a new paradigm in computer vision, showing excellent performance while accompanied by expensive computational cost. Image token pruning is one of the main approaches for ViT compression, due to the facts that the complexity is quadratic with respect to the token number, and many tokens containing only background regions do not truly contribute to the final prediction. Existing works either rely on additional modules to score the importance of individual tokens, or implement a fixed ratio pruning strategy for different input instances. In this work, we propose an adaptive sparse token pruning framework with a minimal cost. Our approach is based on learnable thresholds and leverages the Multi-Head Self-Attention to evaluate token informativeness with little additional operations. Specifically, we firstly propose an inexpensive attention head importance weighted class attention scoring mechanism. Then, learnable parameters are inserted in ViT as thresholds to distinguish informative tokens from unimportant ones. By comparing token attention scores and thresholds, we can discard useless tokens hierarchically and thus accelerate inference. The learnable thresholds are optimized in budget-aware training to balance accuracy and complexity, performing the corresponding pruning configurations for different input instances. Extensive experiments demonstrate the effectiveness of our approach. For example, our method improves the throughput of DeiT-S by 50 top-1 accuracy, which achieves a better trade-off between accuracy and latency than the previous methods.

READ FULL TEXT

page 2

page 3

page 4

page 8

page 12

research
08/20/2021

Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer

Transformer has achieved great success in NLP. However, the quadratic co...
research
06/08/2023

Muti-Scale And Token Mergence: Make Your ViT More Efficient

Since its inception, Vision Transformer (ViT) has emerged as a prevalent...
research
11/13/2022

WR-ONE2SET: Towards Well-Calibrated Keyphrase Generation

Keyphrase generation aims to automatically generate short phrases summar...
research
11/15/2022

HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers

While vision transformers (ViTs) have continuously achieved new mileston...
research
12/17/2020

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

The attention mechanism is becoming increasingly popular in Natural Lang...
research
05/24/2023

Predicting Token Impact Towards Efficient Vision Transformer

Token filtering to reduce irrelevant tokens prior to self-attention is a...
research
01/30/2022

Fast Monte-Carlo Approximation of the Attention Mechanism

We introduce Monte-Carlo Attention (MCA), a randomized approximation met...

Please sign up or login with your details

Forgot password? Click here to reset