S^3: Increasing GPU Utilization during Generative Inference for Higher Throughput

06/09/2023
by   Yunho Jin, et al.
0

Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem. To this end, we propose S^3, which predicts the output sequence length, schedules generation queries based on the prediction to increase device resource utilization and throughput, and handle mispredictions. Our proposed method achieves 6.49× throughput over those systems that assume the worst case for the output sequence length.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/13/2023

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

The high computational and memory requirements of large language model (...
research
06/24/2023

H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Large Language Models (LLMs), despite their recent impressive accomplish...
research
08/31/2023

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Large Language Model (LLM) inference consists of two distinct phases - p...
research
12/15/2022

A Study on the Intersection of GPU Utilization and CNN Inference

There has been significant progress in developing neural network archite...
research
05/26/2023

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

Large language models(LLMs) have sparked a new wave of exciting AI appli...
research
10/09/2020

TurboTransformers: An Efficient GPU Serving System For Transformer Models

The transformer is the most critical algorithm innovation of the Nature ...
research
06/06/2023

Injecting knowledge into language generation: a case study in auto-charting after-visit care instructions from medical dialogue

Factual correctness is often the limiting factor in practical applicatio...

Please sign up or login with your details

Forgot password? Click here to reset