Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

05/25/2023
by   Sotiris Anagnostidis, et al.
0

Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational requirements during inference. Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context at any point across the generation process. By doing so, our approach not only addresses performance concerns but also enhances interpretability, providing valuable insight into the model's decision-making process. Our technique can be applied to existing pre-trained models through a straightforward fine-tuning process, and the pruning strength can be specified by a sparsity parameter. Notably, our empirical findings demonstrate that we can effectively prune up to 80% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to 2× increase in inference throughput and even greater memory savings.

READ FULL TEXT

page 9

page 16

page 20

research
03/29/2022

Fine-tuning Image Transformers using Learnable Memory

In this paper we propose augmenting Vision Transformer models with learn...
research
05/27/2023

Fine-Tuning Language Models with Just Forward Passes

Fine-tuning language models (LMs) has yielded success on diverse downstr...
research
09/21/2023

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

We present LongLoRA, an efficient fine-tuning approach that extends the ...
research
04/28/2022

Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers

Randomly masking and predicting word tokens has been a successful approa...
research
02/24/2022

Learning to Merge Tokens in Vision Transformers

Transformers are widely applied to solve natural language understanding ...
research
03/16/2022

AdapLeR: Speeding up Inference by Adaptive Length Reduction

Pre-trained language models have shown stellar performance in various do...
research
11/30/2022

Fast Inference from Transformers via Speculative Decoding

Inference from large autoregressive models like Transformers is slow - d...

Please sign up or login with your details

Forgot password? Click here to reset