Easy and Efficient Transformer : Scalable Inference Solution For large NLP mode

by   Gongzheng li, et al.

The ultra-large-scale pre-training model can effectively improve the effect of a variety of tasks, and it also brings a heavy computational burden to inference. This paper introduces a series of ultra-large-scale pre-training model optimization methods that combine algorithm characteristics and GPU processor hardware characteristics, and on this basis, propose an inference engine – Easy and Efficient Transformer (EET), Which has a significant performance improvement over the existing schemes. We firstly introduce a pre-padding decoding mechanism that improves token parallelism for generation tasks. Then we design high optimized kernels to remove sequence masks and achieve cost-free calculation for padding tokens, as well as support long sequence and long embedding sizes. Thirdly a user-friendly inference system with an easy service pipeline was introduced which greatly reduces the difficulty of engineering deployment with high throughput. Compared to Faster Transformer's implementation for GPT-2 on A100, EET achieves a 1.5-15x state-of-art speedup varying with context length.EET is available https://github.com/NetEase-FuXi/EET.



There are no comments yet.


page 1

page 2

page 3

page 4


Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding

In this paper, we propose Shallow Aggressive Decoding (SAD) to improve t...

ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training

In this paper, we present a new sequence-to-sequence pre-training model ...

FastSeq: Make Sequence Generation Faster

Transformer-based models have made tremendous impacts in natural languag...

Effective Batching for Recurrent Neural Network Grammars

As a language model that integrates traditional symbolic operations and ...

FD-MobileNet: Improved MobileNet with a Fast Downsampling Strategy

We present Fast-Downsampling MobileNet (FD-MobileNet), an efficient and ...

On Efficient Transformer and Image Pre-training for Low-level Vision

Pre-training has marked numerous state of the arts in high-level compute...

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

With the success of language pretraining, it is highly desirable to deve...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.