SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference

07/05/2023
by   Luciano Del Corro, et al.
0

Autoregressive large language models (LLMs) have made remarkable progress in various natural language generation tasks. However, they incur high computation cost and latency resulting from the autoregressive token-by-token generation. To address this issue, several approaches have been proposed to reduce computational cost using early-exit strategies. These strategies enable faster text generation using reduced computation without applying the full computation graph to each token. While existing token-level early exit methods show promising results for online inference, they cannot be readily applied for batch inferencing and Key-Value caching. This is because they have to wait until the last token in a batch exits before they can stop computing. This severely limits the practical application of such techniques. In this paper, we propose a simple and effective token-level early exit method, SkipDecode, designed to work seamlessly with batch inferencing and KV caching. It overcomes prior constraints by setting up a singular exit point for every token in a batch at each sequence position. It also guarantees a monotonic decrease in exit points, thereby eliminating the need to recompute KV Caches for preceding tokens. Rather than terminating computation prematurely as in prior works, our approach bypasses lower to middle layers, devoting most of the computational resources to upper layers, allowing later tokens to benefit from the compute expenditure by earlier tokens. Our experimental results show that SkipDecode can obtain 2x to 5x inference speedups with negligible regression across a variety of tasks. This is achieved using OPT models of 1.3 billion and 6.7 billion parameters, all the while being directly compatible with batching and KV caching optimization techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2022

ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation

We study the text generation task under the approach of pre-trained lang...
research
02/08/2020

LAVA NAT: A Non-Autoregressive Translation Model with Look-Around Decoding and Vocabulary Attention

Non-autoregressive translation (NAT) models generate multiple tokens in ...
research
05/28/2021

Accelerating BERT Inference for Sequence Labeling via Early-Exit

Both performance and efficiency are crucial factors for sequence labelin...
research
05/20/2023

Dynamic Transformers Provide a False Sense of Efficiency

Despite much success in natural language processing (NLP), pre-trained l...
research
09/08/2023

LLMCad: Fast and Scalable On-device Large Language Model Inference

Generative tasks, such as text generation and question answering, hold a...
research
02/15/2023

Big Little Transformer Decoder

The recent emergence of Large Language Models based on the Transformer a...
research
01/19/2023

Batch Prompting: Efficient Inference with Large Language Model APIs

Performing inference on hundreds of thousands of samples with large lang...

Please sign up or login with your details

Forgot password? Click here to reset