Fast Distributed Inference Serving for Large Language Models

05/10/2023
by   Bingyang Wu, et al.
0

Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demand low job completion time (JCT) for model inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long JCT. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize JCT with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate states between GPU memory and host memory for LLM inference. We build a system prototype of FastServe based on NVIDIA FasterTransformer. Experimental results show that compared to the state-of-the-art solution Orca, FastServe improves the average and tail JCT by up to 5.1× and 6.4×, respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2022

Aryl: An Elastic Cluster Scheduler for Deep Learning

Companies build separate training and inference GPU clusters for deep le...
research
05/06/2020

Analysis of the Symmetric Join the Shortest Orbit Queue

This work introduces the join the shortest queue policy in the retrial s...
research
05/16/2023

SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification

The high computational and memory requirements of generative large langu...
research
09/17/2021

Scheduling in Parallel Finite Buffer Systems: Optimal Decisions under Delayed Feedback

Scheduling decisions in parallel queuing systems arise as a fundamental ...
research
04/16/2021

The generalized join the shortest orbit queue system: Stability, exact tail asymptotics and stationary approximations

We introduce the generalized join the shortest queue model with retrials...
research
07/05/2023

Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models

Large language models (LLMs) such as ChatGPT have demonstrated unprecede...
research
10/09/2020

TurboTransformers: An Efficient GPU Serving System For Transformer Models

The transformer is the most critical algorithm innovation of the Nature ...

Please sign up or login with your details

Forgot password? Click here to reset