Learned Token Pruning for Transformers

by   Sehoon Kim, et al.

A major challenge in deploying transformer models is their prohibitive inference cost, which quadratically scales with the input sequence length. This makes it especially difficult to use transformers for processing long sequences. To address this, we present a novel Learned Token Pruning (LTP) method that reduces redundant tokens as the data passes through the different layers of the transformer. In particular, LTP prunes tokens with an attention score below a threshold value, which is learned during training. Importantly, our threshold based method avoids algorithmically expensive operations such as top-k token selection which are used in prior token pruning methods, and also leads to structured pruning. We extensively test the performance of our approach on multiple GLUE tasks and show that our learned threshold based method consistently outperforms the prior state-of-the-art top-k token based method by up to  2 our preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 GPU and Intel Haswell CPU, respectively, with less than 1 drop (and up to 2.1x FLOPs reduction). Our code has been developed in PyTorch and has been open-sourced.


page 1

page 9

page 14


Make A Long Image Short: Adaptive Token Length for Vision Transformers

The vision transformer splits each image into a sequence of tokens with ...

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Vision transformers have recently received explosive popularity, but the...

A Study on Token Pruning for ColBERT

The ColBERT model has recently been proposed as an effective BERT based ...

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

The attention mechanism is becoming increasingly popular in Natural Lang...

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Vision transformer has emerged as a new paradigm in computer vision, sho...

DoT: An efficient Double Transformer for NLP tasks with tables

Transformer-based approaches have been successfully used to obtain state...

PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

In this paper, we observe two levels of redundancies when applying visio...

Code Repositories


Learned Token Pruning for Transformers

view repo


zero-vocab or low-vocab embeddings

view repo