Sequence processing and generation have always been very important tasks in natural language processing, such as machine translation, summarization, language modeling, etc(Luong et al., 2015; Yan et al., 2020; Dai et al., 2019). In recent years, with the introduction of Transformer model (Vaswani et al., 2017), many pre-trained language models (BERT, GPT, etc.) have also been widely used in these tasks (Devlin et al., 2018; Radford et al., 2019).
However, the parameters and operations of these models are very large, which causes the high latency of inference and brings great challenges to the inference deployment. Take an example of machine translation, the best machine translation model currently takes several seconds to translate a sentence, which is unacceptable in both academia and industry (Edunov et al., 2018). Therefore, we need a high-performance sequence inference library. There are several open-source inference libraries for Transformer model currently, such as Faster Transformer 111https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer, Turbo Transformers 222https://github.com/Tencent/TurboTransformers, etc. However, they only support several models and features, and there is still room for improvement in the speed of inference.
|Inference Libraries||Transformer||GPT||VAE||Beam Search||Diverse Beam Search||Sampling|
In this paper, we propose LightSeq, a high performance inference library for sequence processing and generation implemented in CUDA. This library is efficient, functional and convenient. LightSeq has many advantages compared to other open-source inference libraries.
Efficient. LightSeq has better inference performance. In machine translation, LightSeq
has up to 14 times speedup compared with TensorFlow and 1.4 times speedup compared with Faster Transformer.
Functional. LightSeq supports lots of models, such as BERT, GPT, Transformer and VAE. It also supports several decoding methods, such as beam search, diverse beam search and sampling. Table 1 shows the comparison between Faster Transformer, Turbo Transformers and our LightSeq
in text generation tasks.
Convenient. LightSeq is an out-of-the-box end-to-end model server based on TensorRT Inference Server and flexibly supports multi-level reuse. We can easily deploy efficient model services or develop our own model architectures just with a little code modification.
The source code of LightSeq is available 333https://github.com/bytedance/lightseq.
Taking Transformer models as an example, the inference process of a machine translation or text generation model includes two parts, the feature calculation of encoder and the auto-regressive decoding algorithm. The feature calculation part focuses on self-attention mechanism and feature transformation, which includes matrix multiplication and calculation-intensive operations, and is accompanied by a large number of IO-intensive operations such as Element-wise (e.g. Reshape) and Reduce (e.g. Layer Normalization). The decoding algorithm part contains Softmax, beam filtering and cache refreshing. It contains dynamic shapes which are more complex. This brings many challenges to model inference,
The fine-grained kernel function call of IO-intensive computing brings a lot of redundant video memory read and write, which becomes the bottleneck of feature computing performance.
Complex dynamic shapes bring challenges to the optimization of computational graphs, leading to a large number of dynamic applications for video memory during model inference, which consumes a lot of time.
The logic of each step of decoding and generating characters is complicated, which is difficult to parallelize to give full play to hardware advantages.
LightSeq uses a number of core technologies to optimize these challenges, such as fusion of multiple computing operations to reduce IO overhead, reuse of video memory to avoid dynamic applications and hierarchical optimization of decoding algorithms to improve inference speed. The following is a detailed introduction to these methods.
2.1 Operation fusion
In recent years, due to its efficient ability of feature extraction, Transformer encoder-decoder structure has been widely used in various NLP tasks, such as pre-training tasks without mass labeled text. Most deep learning frameworks, such as TensorFlow and PyTorch, implement encoder-decoder computation by calling kernel functions in basic libraries. These kernel functions tend to be fine-grained, and usually we need to call multiple kernel functions to implement a component.
Taking layer normalization implemented by TensorFlow as an example, we can find that even based on compilation optimization techniques (automatically merging broadcast and element-wise operations), there are still three kernel launches 444Two for reduce_mean operations and one for calculation of the final result.
and two read-write operations of intermediate results (mean and variance) in video memory. Based on CUDA, we can write a custom kernel function dedicated to layer normalization and write two intermediate results to the register, thus realize one kernel launch and no read-write operations of intermediate results in video memory, which greatly saves computational overhead.
Based on this idea, LightSeq implements the Transformer layers in combination with general matrix multiply (GEMM) provided by cuBLAS 555https://developer.nvidia.com/cublas and custom kernel functions. Figure 2 shows the detailed structure of optimized Transformer layers. It can be found that all the operations between GEMM operations are implemented with custom dedicated kernel functions.
2.2 Dynamic video memory reuse
In order to save video memory occupancy and avoid application or release of video memory during the calculation process, LightSeq pre-defines the maximum of dynamic shape, such as the maximal sequence length. At the start of the service, each intermediate result in the calculation process is allocated video memory to its maximum. Besides, video memory is shared for non-dependent intermediate results.
Through this memory reuse strategy, on a T4 graphics card, we can deploy up to 8 Transformer big models 666Under the configuration of 8 batch size, 256 sequence length, 4 beam size and 30000 vocabulary size. at the same time, so as to improve graphics card utilization in low frequency or peak-shifting scenarios.
2.3 Hierarchical decoding optimization
The most complicated part of the auto-regressive sequence generation task is decoding. LightSeq currently supports a variety of decoding methods such as beam search, diversity beam search, top-/top- sampling, etc. At the same time, it can be used with Transformer and GPT to achieve several times acceleration. Here we take the most used beam search method as an example to introduce the optimization of the decoding process by LightSeq.
In the deep learning framework, in order to implement decoding in one step, we need to pick the tokens with top-probability. This requires Softmax calculations and read-write operations of video memory on output matrices whose size related to vocabulary size. Usually the vocabulary size is in tens of thousands of scales, and this is the computational cost of decoding in only one step. Therefore, the decoding module accounts for a high proportion of cumulative latency (more than 30%) in auto-regressive sequence generation tasks.
Combined with the GPU computing characteristics and drawing on the two-segment strategy of retrieve and re-rank in recommended system, LightSeq designs an hierarchical output retrieve function, which successfully avoids the calculation of Softmax and the sorting of hundreds of thousands of elements.
Figure 4 is a detailed example of above two-segment strategy. In original beam search method, we need to sort all sixteen elements which is complex. However, by hierarchical decoding method, we only need to pick a few elements in each beam, and then sort these elements.
In this section, we will show the improvement of LightSeq after above optimizations.
3.1 Latency analysis
We perform a performance analysis of LightSeq’s inference process using the GPU profile tool. Figure 3 shows the latency proportion of each module under the precision of float32 and float16.
After optimization, we can find that:
GEMM operation latency in cuBLAS accounts for 82% and 88% respectively after optimization, accounting for most of the inference time. However, in the original TensorFlow model, GEMM operations account for only 25%. This shows that beam search optimization has achieved good results.
The latency of cache refreshing accounts for 10% and 6% respectively, accounting for a high proportion, but is hard to be optimized. Possible solutions include reducing the amount of cache, such as reducing the number of decoder layers, reducing cache precision, etc.
Other operations total 8% and 6%, including layer normalization, beam search, video memory read-write of intermediate results, etc.
All of the above shows that LightSeq has been optimized to a very good extent and greatly increases the speed of inference.
We test the performance of LightSeq on the latest generation of NVIDIA inference graphics card Tesla P4 and T4, choosing TensorFlow and Faster Transformer implementations as a comparison.
We can find that both Faster Transformer and LightSeq have accelerated more than 10 times compared to TensorFlow in the small batch size scenario. With the increase of batch size, the acceleration ratio of both decreases due to the increasing proportion of GEMM operations. However the decrease of LightSeq is relatively flat, especially in the large batch size scenario, even accelerating up to 1.4 times on top of the Faster Transformer.
This also provides guidance for some inference optimization work in the future. In the small batch size scenario, we can achieve high acceleration gains as long as optimizing the fusion of non-computational intensive operators. However in the large batch size scenario, we must continue to optimize computational intensive operators such as GEMM.
In WMT14 fr-en benchmark, we also compare them on NVIDIA Tesla P4 and T4. Experiments show that LightSeq achieves a speedup of 6.41x and 13.06x respectively, which further verifies the effectiveness of LightSeq.
In the text generation scenario, we often need to use sampling to improve the diversity of results. Figure 7 shows the performance comparison of Transformer base with top-/top- sampling.
It can be found that with sampling, LightSeq leads the Faster Transformer in most configurations and can achieve up to 1.4 times additional acceleration.
In the cloud service scenario, we test the latency change of model service from TensorFlow to LightSeq using GPT model. We observe a 3 to 5 times reduction in pct99 latency, as shown in Figure 8.
In this paper, we propose a functional, convenient and efficient inference library named LightSeq. In the future, we will continue to optimize it and utilize it to more research and industry applications.
- Transformer-xl: attentive language models beyond a fixed-length context. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 2978–2988. External Links: Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §1.
- Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), pp. 489–500. External Links: Cited by: §1.
Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton (Eds.), pp. 1412–1421. External Links: Cited by: §1.
- Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §1.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 6000–6010. External Links: Cited by: §1.
ProphetNet: predicting future n-gram for sequence-to-sequence pre-training. CoRR abs/2001.04063. External Links: Cited by: §1.