SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification

05/16/2023
by   Xupeng Miao, et al.
0

The high computational and memory requirements of generative large language models (LLMs) make it challenging to serve them quickly and cheaply. This paper introduces SpecInfer, an LLM serving system that accelerates generative LLM inference with speculative inference and token tree verification. A key insight behind SpecInfer is to combine various collectively boost-tuned small language models to jointly predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified by the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality.

READ FULL TEXT

page 2

page 12

research
09/08/2023

LLMCad: Fast and Scalable On-device Large Language Model Inference

Generative tasks, such as text generation and question answering, hold a...
research
07/05/2023

Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models

Large language models (LLMs) such as ChatGPT have demonstrated unprecede...
research
02/15/2017

Frustratingly Short Attention Spans in Neural Language Modeling

Neural language models predict the next token using a latent representat...
research
05/10/2023

Fast Distributed Inference Serving for Large Language Models

Large language models (LLMs) power a new generation of interactive AI ap...
research
08/15/2023

Forward-Backward Reasoning in Large Language Models for Verification

Chain-of-Though (CoT) prompting has shown promising performance in vario...
research
11/09/2022

Efficiently Scaling Transformer Inference

We study the problem of efficient generative inference for Transformer m...
research
02/27/2023

Systematic Rectification of Language Models via Dead-end Analysis

With adversarial or otherwise normal prompts, existing large language mo...

Please sign up or login with your details

Forgot password? Click here to reset