TurboTransformers: An Efficient GPU Serving System For Transformer Models

10/09/2020
by   Jiarui Fang, et al.
0

The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the RNN models, transformers are able to process on dimensions of sequence lengths in parallel, therefore leads to better accuracy on long sequences. However, efficient deployments of them for online services in data centers equipped with GPUs are not easy. First, more computation introduced by transformer structures makes it more challenging to meet the latency and throughput constraints of serving. Second, NLP tasks take in sentences of variable length. The variability of input dimensions brings a severe problem to efficient memory management and serving optimization. To solve the above challenges, this paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework. Three innovative features make it stand out from other similar works. An efficient parallel algorithm is proposed for GPU-based batch reduction operations, like Softmax and LayerNorm, which are major hot spots besides BLAS routines. A memory allocation algorithm, which better balances the memory footprint and allocation/free efficiency, is designed for variable-length input situations. A serving framework equipped with a new batching scheduler using dynamic programming achieves the optimal throughput on variable-length requests. The system can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into your PyTorch code with a few lines of code.

READ FULL TEXT
research
10/06/2022

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs

Transformer is the cornerstone model of Natural Language Processing (NLP...
research
08/07/2022

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

Transformers are considered one of the most important deep learning mode...
research
09/06/2022

EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models

Large transformer models display promising performance on a wide range o...
research
09/18/2021

Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs...
research
10/28/2018

SoaAlloc: A Lock-free Hierarchical Bitmap-based Object Allocator for GPUs

Designing dynamic memory allocators for GPUs is challenging because appl...
research
06/09/2023

S^3: Increasing GPU Utilization during Generative Inference for Higher Throughput

Generating texts with a large language model (LLM) consumes massive amou...
research
05/10/2023

Fast Distributed Inference Serving for Large Language Models

Large language models (LLMs) power a new generation of interactive AI ap...

Please sign up or login with your details

Forgot password? Click here to reset