SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

08/31/2023
by   Amey Agrawal, et al.
0

Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles. We present SARATHI to address these challenges. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. During inference, the prefill chunk saturates GPU compute, while the decode requests 'piggyback' and cost up to an order of magnitude less compared to a decode-only batch. Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware. For the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10x, and accelerates end-to-end throughput by up to 1.33x. For LLaMa-33B on A100 GPU, we achieve 1.25x higher end-to-end-throughput and up to 4.25x higher decode throughput. When used with pipeline parallelism on GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-end throughput improvement of 1.91x.

READ FULL TEXT

page 4

page 10

page 13

research
04/09/2021

Efficient Large-Scale Language Model Training on GPU Clusters

Large language models have led to state-of-the-art accuracies across a r...
research
10/24/2019

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

We propose XPipe, an efficient asynchronous pipeline model parallelism a...
research
03/13/2023

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

The high computational and memory requirements of large language model (...
research
06/09/2023

S^3: Increasing GPU Utilization during Generative Inference for Higher Throughput

Generating texts with a large language model (LLM) consumes massive amou...
research
03/31/2023

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

Hardware accelerators such as GPUs are required for real-time, low-laten...
research
11/08/2021

LMStream: When Distributed Micro-Batch Stream Processing Systems Meet GPU

This paper presents LMStream, which ensures bounded latency while maximi...
research
12/31/2018

Batch Size Influence on Performance of Graphic and Tensor Processing Units during Training and Inference Phases

The impact of the maximally possible batch size (for the better runtime)...

Please sign up or login with your details

Forgot password? Click here to reset