Learned Token Pruning for Transformers

07/02/2021
by   Sehoon Kim, et al.
8

A major challenge in deploying transformer models is their prohibitive inference cost, which quadratically scales with the input sequence length. This makes it especially difficult to use transformers for processing long sequences. To address this, we present a novel Learned Token Pruning (LTP) method that reduces redundant tokens as the data passes through the different layers of the transformer. In particular, LTP prunes tokens with an attention score below a threshold value, which is learned during training. Importantly, our threshold based method avoids algorithmically expensive operations such as top-k token selection which are used in prior token pruning methods, and also leads to structured pruning. We extensively test the performance of our approach on multiple GLUE tasks and show that our learned threshold based method consistently outperforms the prior state-of-the-art top-k token based method by up to  2 our preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 GPU and Intel Haswell CPU, respectively, with less than 1 drop (and up to 2.1x FLOPs reduction). Our code has been developed in PyTorch and has been open-sourced.

READ FULL TEXT

page 1

page 9

page 14

research
06/26/2023

Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

Deploying pre-trained transformer models like BERT on downstream tasks i...
research
08/09/2023

Which Tokens to Use? Investigating Token Reduction in Vision Transformers

Since the introduction of the Vision Transformer (ViT), researchers have...
research
05/27/2023

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Deployment of Transformer models on the edge is increasingly challenging...
research
04/04/2023

Attention Map Guided Transformer Pruning for Edge Device

Due to its significant capability of modeling long-range dependencies, v...
research
10/17/2022

Token Merging: Your ViT But Faster

We introduce Token Merging (ToMe), a simple method to increase the throu...
research
12/13/2021

A Study on Token Pruning for ColBERT

The ColBERT model has recently been proposed as an effective BERT based ...
research
12/17/2020

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

The attention mechanism is becoming increasingly popular in Natural Lang...

Please sign up or login with your details

Forgot password? Click here to reset