Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

05/22/2023
by   Jinghan Yao, et al.
0

In the rapidly evolving field of deep learning, the performance of model inference has become a pivotal aspect as models become more complex and are deployed in diverse applications. Among these, autoregressive models stand out due to their state-of-the-art performance in numerous generative tasks. These models, by design, harness a temporal dependency structure, where the current token's probability distribution is conditioned on preceding tokens. This inherently sequential characteristic, however, adheres to the Markov Chain assumption and lacks temporal parallelism, which poses unique challenges. Particularly in industrial contexts where inference requests, following a Poisson time distribution, necessitate diverse response lengths, this absence of parallelism is more profound. Existing solutions, such as dynamic batching and concurrent model instances, nevertheless, come with severe overheads and a lack of flexibility, these coarse-grained methods fall short of achieving optimal latency and throughput. To address these shortcomings, we propose Flavor – a temporal fusion framework for efficient inference in autoregressive models, eliminating the need for heuristic settings and applies to a wide range of inference scenarios. By providing more fine-grained parallelism on the temporality of requests and employing an efficient memory shuffle algorithm, Flover achieves up to 11x faster inference on GPT models compared to the cutting-edge solutions provided by NVIDIA Triton FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to multi-node scenarios, thereby offering robust performance optimization that transcends hardware boundaries.

READ FULL TEXT

page 1

page 7

page 10

research
09/06/2022

EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models

Large transformer models display promising performance on a wide range o...
research
07/14/2021

Model-Parallel Model Selection for Deep Learning Systems

As deep learning becomes more expensive, both in terms of time and compu...
research
06/30/2022

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

The past several years have witnessed the success of transformer-based m...
research
02/08/2020

LAVA NAT: A Non-Autoregressive Translation Model with Look-Around Decoding and Vocabulary Attention

Non-autoregressive translation (NAT) models generate multiple tokens in ...
research
10/20/2017

Parallel Combining: Making Use of Free Cycles

There are two intertwined factors that affect performance of concurrent ...
research
11/10/2014

Model-Parallel Inference for Big Topic Models

In real world industrial applications of topic modeling, the ability to ...
research
11/13/2020

Reducing Inference Latency with Concurrent Architectures for Image Recognition

Satisfying the high computation demand of modern deep learning architect...

Please sign up or login with your details

Forgot password? Click here to reset