Efficiently Scaling Transformer Inference

11/09/2022
by   Reiner Pope, et al.
0

We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of multiquery attention (i.e. multiple query heads share single key/value head) enables scaling up to 32x larger context lengths. Finally, we achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76 while supporting a long 2048-token context length on the PaLM 540B parameter model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/11/2021

EL-Attention: Memory Efficient Lossless Attention for Generation

Transformer model with multi-head attention requires caching intermediat...
research
08/20/2021

Fastformer: Additive Attention Can Be All You Need

Transformer is a powerful model for text understanding. However, it is i...
research
06/30/2022

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

The past several years have witnessed the success of transformer-based m...
research
09/21/2019

Scale MLPerf-0.6 models on Google TPU-v3 Pods

The recent submission of Google TPU-v3 Pods to the industry wide MLPerf ...
research
05/16/2023

SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification

The high computational and memory requirements of generative large langu...
research
07/17/2023

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Scaling Transformers to longer sequence lengths has been a major problem...
research
09/10/2020

Sparsifying Transformer Models with Differentiable Representation Pooling

We propose a novel method to sparsify attention in the Transformer model...

Please sign up or login with your details

Forgot password? Click here to reset