Easy and Efficient Transformer : Scalable Inference Solution For large NLP mode

04/26/2021
by   Gongzheng li, et al.
0

The ultra-large-scale pre-training model can effectively improve the effect of a variety of tasks, and it also brings a heavy computational burden to inference. This paper introduces a series of ultra-large-scale pre-training model optimization methods that combine algorithm characteristics and GPU processor hardware characteristics, and on this basis, propose an inference engine – Easy and Efficient Transformer (EET), Which has a significant performance improvement over the existing schemes. We firstly introduce a pre-padding decoding mechanism that improves token parallelism for generation tasks. Then we design high optimized kernels to remove sequence masks and achieve cost-free calculation for padding tokens, as well as support long sequence and long embedding sizes. Thirdly a user-friendly inference system with an easy service pipeline was introduced which greatly reduces the difficulty of engineering deployment with high throughput. Compared to Faster Transformer's implementation for GPT-2 on A100, EET achieves a 1.5-15x state-of-art speedup varying with context length.EET is available https://github.com/NetEase-FuXi/EET.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2021

Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding

In this paper, we propose Shallow Aggressive Decoding (SAD) to improve t...
research
12/22/2022

GENIE: Large Scale Pre-training for Text Generation with Diffusion Model

In this paper, we propose a large-scale language pre-training for text G...
research
01/13/2020

ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training

In this paper, we present a new sequence-to-sequence pre-training model ...
research
07/17/2023

Retentive Network: A Successor to Transformer for Large Language Models

In this work, we propose Retentive Network (RetNet) as a foundation arch...
research
05/31/2021

Effective Batching for Recurrent Neural Network Grammars

As a language model that integrates traditional symbolic operations and ...
research
03/02/2022

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

Protein structure prediction is an important method for understanding ge...
research
06/24/2023

H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Large Language Models (LLMs), despite their recent impressive accomplish...

Please sign up or login with your details

Forgot password? Click here to reset