Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning

10/30/2021
by   Xuanli He, et al.
0

Pre-training and then fine-tuning large language models is commonly used to achieve state-of-the-art performance in natural language processing (NLP) tasks. However, most pre-trained models suffer from low inference speed. Deploying such large models to applications with latency constraints is challenging. In this work, we focus on accelerating the inference via conditional computations. To achieve this, we propose a novel idea, Magic Pyramid (MP), to reduce both width-wise and depth-wise computation via token pruning and early exiting for Transformer-based models, particularly BERT. The former manages to save the computation via removing non-salient tokens, while the latter can fulfill the computation reduction by terminating the inference early before reaching the final layer, if the exiting condition is met. Our empirical studies demonstrate that compared to previous state of arts, MP is not only able to achieve a speed-adjustable inference but also to surpass token pruning and early exiting by reducing up to 70 (GFLOPs) with less than 0.5 express distinctive preferences to sequences with different lengths. However, MP is capable of achieving an average of 8.06x speedup on two popular text classification tasks, regardless of the sizes of the inputs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/21/2023

Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model

The prevalence of Transformer-based pre-trained language models (PLMs) h...
research
06/26/2023

Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

Deploying pre-trained transformer models like BERT on downstream tasks i...
research
03/27/2022

Pyramid-BERT: Reducing Complexity via Successive Core-set based Token Selection

Transformer-based language models such as BERT have achieved the state-o...
research
05/28/2021

Accelerating BERT Inference for Sequence Labeling via Early-Exit

Both performance and efficiency are crucial factors for sequence labelin...
research
05/25/2021

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Existing pre-trained language models (PLMs) are often computationally ex...
research
07/01/2021

Elbert: Fast Albert with Confidence-Window Based Early Exit

Despite the great success in Natural Language Processing (NLP) area, lar...
research
04/07/2022

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Self-attention is a key enabler of state-of-art accuracy for various tra...

Please sign up or login with your details

Forgot password? Click here to reset