Big Little Transformer Decoder

02/15/2023
by   Sehoon Kim, et al.
0

The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment, and which makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model's inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to review and correct the small model's inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, summarization on CNN/DailyMail, and language modeling on WikiText-2. On an NVIDIA Titan Xp GPU, our framework achieves a speedup of up to 2.13x without any performance drop, and it achieves up to 2.38x speedup with only  1 point degradation. Furthermore, our framework is fully plug-and-play as it does not require any training or modifications to model architectures. Our code will be open-sourced.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/22/2022

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Transformer is a deep learning language model widely used for natural la...
research
04/15/2023

Tractable Control for Autoregressive Language Generation

Despite the success of autoregressive large language models in text gene...
research
05/03/2023

Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

Large language models (LLMs) power many state-of-the-art systems in natu...
research
02/16/2021

Non-Autoregressive Text Generation with Pre-trained Language Models

Non-autoregressive generation (NAG) has recently attracted great attenti...
research
07/05/2023

SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference

Autoregressive large language models (LLMs) have made remarkable progres...
research
10/30/2022

DiffusER: Discrete Diffusion via Edit-based Reconstruction

In text generation, models that generate text from scratch one token at ...
research
05/25/2023

MERGE: Fast Private Text Generation

Recent years have seen increasing concerns about the private inference o...

Please sign up or login with your details

Forgot password? Click here to reset