DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

06/30/2022
by   Reza Yazdani Aminabadi, et al.
6

The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts; the target application scenarios can be latency-critical or throughput-oriented; the deployment hardware could be single- or multi-GPU systems with different types of memory and storage, etc. With such increasing diversity and the fast-evolving pace of transformer models, designing a highly performant and efficient inference system is extremely challenging. In this paper, we present DeepSpeed Inference, a comprehensive system solution for transformer model inference to address the above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU inference solution to minimize latency while maximizing the throughput of both dense and sparse transformer models when they fit in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. Moreover, it enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over 50% of A6000 peak).

READ FULL TEXT

page 1

page 5

page 7

page 12

research
04/16/2021

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

In the last three years, the largest dense deep learning models have gro...
research
03/13/2023

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

The high computational and memory requirements of large language model (...
research
03/15/2023

The Tiny Time-series Transformer: Low-latency High-throughput Classification of Astronomical Transients using Deep Model Compression

A new golden age in astronomy is upon us, dominated by data. Large astro...
research
01/27/2023

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

Improving the deployment efficiency of transformer-based language models...
research
11/09/2022

Efficiently Scaling Transformer Inference

We study the problem of efficient generative inference for Transformer m...
research
05/22/2023

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

In the rapidly evolving field of deep learning, the performance of model...
research
10/08/2021

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

Recent expeditious developments in deep learning algorithms, distributed...

Please sign up or login with your details

Forgot password? Click here to reset