Learned Token Pruning for Transformers

07/02/2021
by   Sehoon Kim, et al.
8

A major challenge in deploying transformer models is their prohibitive inference cost, which quadratically scales with the input sequence length. This makes it especially difficult to use transformers for processing long sequences. To address this, we present a novel Learned Token Pruning (LTP) method that reduces redundant tokens as the data passes through the different layers of the transformer. In particular, LTP prunes tokens with an attention score below a threshold value, which is learned during training. Importantly, our threshold based method avoids algorithmically expensive operations such as top-k token selection which are used in prior token pruning methods, and also leads to structured pruning. We extensively test the performance of our approach on multiple GLUE tasks and show that our learned threshold based method consistently outperforms the prior state-of-the-art top-k token based method by up to  2 our preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 GPU and Intel Haswell CPU, respectively, with less than 1 drop (and up to 2.1x FLOPs reduction). Our code has been developed in PyTorch and has been open-sourced.

READ FULL TEXT

page 1

page 9

page 14

12/03/2021

Make A Long Image Short: Adaptive Token Length for Vision Transformers

The vision transformer splits each image into a sequence of tokens with ...
08/03/2021

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Vision transformers have recently received explosive popularity, but the...
12/13/2021

A Study on Token Pruning for ColBERT

The ColBERT model has recently been proposed as an effective BERT based ...
12/17/2020

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

The attention mechanism is becoming increasingly popular in Natural Lang...
09/28/2022

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Vision transformer has emerged as a new paradigm in computer vision, sho...
06/01/2021

DoT: An efficient Double Transformer for NLP tasks with tables

Transformer-based approaches have been successfully used to obtain state...
08/07/2021

PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

In this paper, we observe two levels of redundancies when applying visio...

Code Repositories

LTP

Learned Token Pruning for Transformers


view repo

embeddings

zero-vocab or low-vocab embeddings


view repo