LightSeq2: Accelerated Training for Transformer-based Models on GPUs

Transformer-based models have proven to be powerful in many natural language, computer vision, and speech recognition applications. It is expensive to train these types of models due to unfixed input length, complex computation, and large numbers of parameters. Existing systems either only focus on efficient inference or optimize only BERT-like encoder models. In this paper, we present LightSeq2, a system for efficient training of Transformer-based models on GPUs. We propose a series of GPU optimization techniques tailored to computation flow and memory access patterns of neural layers in Transformers. LightSeq2 supports a variety of network architectures, including BERT (encoder-only), GPT (decoder-only), and Transformer (encoder-decoder). Our experiments on GPUs with varying models and datasets show that LightSeq2 is 1.4-3.5x faster than previous systems. In particular, it gains 308 existing systems on a large public machine translation benchmark (WMT14 English-German).



There are no comments yet.


page 1

page 2

page 3

page 4


Efficient Transformer for Direct Speech Translation

The advent of Transformer-based models has surpassed the barriers of tex...

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent...

A Dual-Decoder Conformer for Multilingual Speech Recognition

Transformer-based models have recently become very popular for sequence-...

Zero-Shot Controlled Generation with Encoder-Decoder Transformers

Controlling neural network-based models for natural language generation ...

BERT-JAM: Boosting BERT-Enhanced Neural Machine Translation with Joint Attention

BERT-enhanced neural machine translation (NMT) aims at leveraging BERT-e...

Controlling Computation versus Quality for Neural Sequence Models

Most neural networks utilize the same amount of compute for every exampl...

Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics

Much of recent progress in NLU was shown to be due to models' learning d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1.

Model size (number of parameters) and estimated training cost for popular Transformer-based models. Here MT represents Transformer-big models on WMT14 machine translation task.

Transformers (Vaswani et al., 2017)

have become a prevailing neural architecture in many natural language processing (NLP), computer vision (CV), and automatic speech recognition (ASR) applications

(Devlin et al., 2019; Radford et al., 2018; Yang et al., 2019; Zhai et al., 2021; Gulati et al., 2020). Variants of Transformer prove to achieve state-of-the-art accuracy in text classification, question answering, machine translation, and visual object recognition tasks (Devlin et al., 2019; Pan et al., 2021; Dosovitskiy et al., 2021)

. Transformer models typically require large model size and training data to perform well. For example, a GPT-3 model requires 3.1 million hours of training on modern GPUs and it is estimated to cost $4.6 million to complete a single trial 

(Brown et al., 2020). Figure 1 shows sizes and estimated training cost for several popular Transformer models. The training cost increases roughly in proportion to the number of model parameters. With the ever-growing model size, it becomes expensive to train them. There is a critical need for optimizing the computation for Transformers.

Training Components Sequence Length DL Frameworks
Libraries Embedding Encoder Decoder Criterion Trainer PyTorch TensorFlow
DeepSpeed Multiples of 16
LightSeq2 Arbitrary
Table 1. Comparing LightSeq2 and DeepSpeed.

Existing approaches for accelerating Transformers are limited to either inference-only or models with only encoder layers. LightSeq (Wang et al., 2021) and TurboTransformers (Fang et al., 2021) are two recent systems targeting the serving of Transformers. However, neither supports Transformer training since there are additional backward computation which is more complex than the forward pass.

DeepSpeed is a recent system for optimizing Transformer training (Rasley et al., 2020; Rajbhandari et al., 2020). It uses specially designed CUDA kernels, as well as quantized gradient update across multiple GPUs. However, DeepSpeed only supports Transformer encoder layer, thus can only be used to train BERT-like models (Devlin et al., 2019). The full Transformer includes additional modules such as the decoder, criterion layer for calculating generation loss, shared embedding, etc. These involve more complex computation flow, for example, the cross attention computation between decoder and encoder layers. It is nontrivial to enable accelerated training for the full Transformers.

There is an additional line of research with general computation acceleration for neural networks, including automatic hardware-oriented compilation and quantized computation. TVM 

(Chen et al., 2018) uses automatic compilation technology to search for candidate optimizations in the neural network according to specific patterns and automatically merge operators. However, most of automatic compilation works only support fixed-length input, and it isn’t easy for them to deal with a variable-length input such as natural sentences for Transformer. Jacob et al. (2018) and Apex111 improve the data I/O and calculation speed by reducing the precision using quantization. These are beneficial but should be used carefully, since the accuracy will decrease to a certain extent. While accuracy is important in training.

In this paper, we focus on accelerating the training for Transformer models on modern GPUs. We intend to provide a general system-level solution that works for all kinds of models based on Transformer, and all kinds of training algorithms such as stochastic gradient descent (SGD) and adaptive gradient methods. We aim to tackle these challenges, namely optimizing the backward computation in addition to the forward pass, supporting all Transformer layers including encoder and decoder, and supporting variable-length input. The challenge for accelerating training is that inference only requires forward computation, while training also requires back propagation of errors, computing gradients, synchronization among possible multiple GPUs, and updating the parameters. Among them, back-propagation and parameter update require relatively high calculation precision. It is also challenging for full Transformer due to its decoder’s cross attention on its encoder layers.

In this paper, we propose LightSeq2, an efficient library for both training and serving Transformer models. It provides system-level optimization without sacrificing accuracy or changing any training behavior (learning rate, convergence rate, initialization, numeric stability, etc.). LightSeq2 includes three techniques for speedup, namely layer-specific kernels to increase GPU utilization, fine-grain mixed-precision trainer, and a improved strategy for efficient GPU memory management. LightSeq2 is the first to accelerate the whole process of Transformer training. Table 1 lists the differences between our proposed LightSeq2 and existing accelerated Transformer training library. In summary, LightSeq2 enjoys the following advantages:

  • Highly efficient. LightSeq2 is fast and memory efficient for training Transformers, as validated in multiple experiments. Noticeably, LightSeq2 obtains up to 308% speedup and only requires 65% GPU memory on eight NVIDIA Ampere A100 GPUs in WMT14 English-German machine translation task compared to PyTorch.

  • Supporting rich models in Transformer family. LightSeq2 provides comprehensive efficient custom operators, including embedding, encoder layer, decoder layer, and criterion. These enables BERT (encoder-only), GPT (decoder-only), full Transformer with encoder-decoder, etc. Therefore it is suitable for almost all NLP tasks such as text classification, generation, summarization, and machine translation.

  • Flexible usage. In addition to manually integrating the custom layers in model codes, the users can also use LightSeq2 in popular training libraries without code modification. The library provides seamless integration with PyTorch and TensorFlow.

2. Background

Figure 2. Transformer architecture for German-English machine translation.

2.1. Transformer Models

The critical idea of Transformer is using multi-head attention to map the representation of the tokens in the sequence to different feature spaces. In Figure 2

, we take the machine translation task as an example. The encoder firstly calculates the attentions between any two input tokens and then obtains the attention outputs. Additionally, it uses feed forward network to enhance the representations. For the decoder, most calculations are the same as the encoder, except that each token only computes attention on previous tokens, and there exists a cross attention layer after the self-attention. To achieve the final embedding of each token, you also need an embedding layer to concatenate its position embedding. After obtaining the outputs of the decoder, the model requires an output layer to generate the probability of each token.

The Transformer-based models have variable-length intermediate tensors, which may lead to frequent allocation and release of GPU memory. There are many ways to deal with this problem. TurboTransformers

(Fang et al., 2021) uses a sequence-length-aware allocator in their inference library to maximize the sharing of non-dependent GPU memory and reduce memory usage. However, this also dynamically allocates and releases the GPU memory. Thus it will slow down the model inference. LightSeq (Wang et al., 2021) allocates the maximal GPU memory in advance to prevent dynamic allocation and release during training, so the speed is faster.

Figure 3. Data parallel training on two GPUs.

2.2. Model Training

As shown in Figure 3, there are four stages during each iteration of data parallel training.

  1. The model receives input data and then performs forward propagation, getting the final loss data.

  2. The model performs backward propagation using the loss calculated after forward propagation, generating the gradients of all parameters.

  3. Gradients are gathered from all devices and then the averaged gradients are computed and broadcast to each device. There are two major families to complete this process, all-reduce(Patarasuk and Yuan, 2009) and Parameter Server (PS)(Li et al., 2014).

  4. All parameters in each device are updated using the averaged gradients. Since the initial state and gradient of the parameters on each device are the same, the parameters remain the same after one update stage.

The speed bottleneck of the first and second stages is mainly in computing, which depends on fast CUDA kernels. However, the last two stages require better memory management, which will speed up the copies of parameters and gradients. Many works devote to accelerating these four stages, such as DeepSpeed, Apex, DDP (Li et al., 2020b), etc. However, there is no work to accelerate the complete training process currently. Figure 4 is a visualization of the four training stages. It can be found that model computing and parameter updates account for a large proportion. After training with LightSeq2, the time of the three stages is greatly reduced, especially the parameter updates.

Different from model inference which only has the forward propagation stage, model training is more challenging. First, model training requires higher accuracy. Otherwise, the error will be magnified after constant parameter updates. Second, model training requires better memory management due to the need for maintaining gradients and activation checkpointing used by backward propagation. Finally, model training requires a fast trainer for parameter updates. However, for Transformer models, the training does not need to deal with incremental length in auto regressive decoding, which is simpler than model inference.

Figure 4. Time cost for PyTorch and LightSeq2 based on WMT-14 English-German machine translation task, using Transformer-big (batch size=232, seq. length=30).

3. The LightSeq2 System

This section will introduce the three techniques used in LightSeq2 in detail, including kernel fusion, memory-efficient mixed-precision trainer, and memory management.

3.1. Kernel Fusion

Kernel fusion focuses on speedup the CUDA kernels, thereby accelerating the computing process of training. We optimize all kernels in Transformer layers (i.e., encoder and decoder layer), embedding layer, and criterion layer.

3.1.1. Transformer Layers

Existing training libraries (e.g., DeepSpeed) only fuse kernels for the encoder, thus can only be used in models like BERT. To expand the scope of use, we extend to the decoder to accelerate models requiring a Transformer decoder (e.g., GPT).

There are two types of kernels. One is GEMM kernels, including linear transformation and scaled dot product. The other is non-GEMM kernels, such as

Dropout, layer normalization, and Softmax. As GEMM has already been handled by the cuBLAS library efficiently, we focus on non-GEMM kernels here. Further subdivided, there are two categories of non-GEMM kernels. One is element-wise kernels (e.g., Dropout, ReLU, Reshape and bias adding). The independence between processing any two different tensor elements enables explicit parallelism and multi-kernel fusion. The other involves batch reduction operations, such as LayerNorm and Softmax, which require synchronization between threads.

Our optimized Transformer structure is depicted in Figure 5, where yellow boxes represent GEMM kernels and blue boxes represent custom non-GEMM kernels. Adjacent fine-grained element-wise kernels are fused into one coarse-grained kernel, resulting in fewer kernel launches and intermediate results. For example, the last kernel of the self-attention layer implements bias adding, dropout, and residual kernels with only one kernel launch.

Figure 5. The structure of optimized transformer with pre-LayerNorm

In the following, we focus on two batch reduction kernels that take a long time in the training, including LayerNorm and Softmax.

LayerNorm Recall that the LayerNorm normalizes the inputs using

where and

stand for the mean and standard variance of

respectively. Both are batch reductions. While warp-level parallelism provided by CUDA allows inter-thread communication for batch reduction over , it requires thread synchronization, which hampers instruction-level parallelism.

A native implementation of forward calculation introduces two sequential thread synchronizations for and respectively. To avoid the dependence between reductions, TurboTransformers proposed a new formula that runs two synchronizations in parallel:

This inspires us to rearrange the equation for backward calculation, which parallels synchronizations as well:

where and are coefficients that can be solved in parallel:

Here is the dimension of , and are the gradients of the -th elements of input and output tensors respectively. The two batch reductions and can be executed in parallel.

As LayerNorm is sensitive to the precision of floating points, we keep float point 16 (FP16) storage of the parameters and cast them to float point 32 (FP32) for computation to avoid additional I/O cost.

Softmax The next one is Softmax. The forward process of it can be expressed as

For numerical stability, especially for mixed-precision training, it takes three steps to avoid overflow:

  1. Find the maximal element of , denoted as .

  2. Deduct from each element in so that the exponential never overflows, and then calculate the partition function .

  3. Calculate Softmax using .

Both steps 1 and 2 are reduction operations. We use the CUB library222 for efficiency, which weakens the dependence between elements using an extra small buffer.

We tried different combinations of block size, grid size, and buffer size for various sequence lengths to find the best settings. During backward, we allocate four wraps in each block to run synchronizations in parallel. It allows higher latency for fadd instruction which is a source register required by the next shfl_down instruction (Fang et al., 2021).

In general, we have implemented all CUDA kernels using the float4 data type, increasing the bandwidth of data I/O. At the same time, we also optimize the kernels for commonly used input sizes. Compared with the implementation of DeepSpeed, FasterTransformer, and TurboTransformers, our kernels are much faster.

3.1.2. Embedding Layer

Figure 6. The forward computation of an embedding layer.

The embedding layer is widely used in most deep learning tasks to obtain the distributed representation of one sentence or image. Given a token embedding lookup table

and positional embedding lookup table , we can get the representation of one token with index and position :

where is the embedding scale. Here we use sinusoidal positional embedding which does not require training.

Considering an input sentence with length , let denote the Dropout mask generated in the forward propagation. We can efficiently compute the gradient of token in the embedding table:

where represents element-wise product. This means summing all gradients of tokens in the different positions in the sentence, which can be implemented in parallel by atomicAdd operation in CUDA to avoid interference from other threads.

3.1.3. Criterion Layer

The criterion layer is used to compute the loss between the model output and ground truth. First, let

, a vector of length

, denote the output of the decoder for one token, where is the vocabulary size, and denote the one-hot vector of length representing the ground truth. The prevalent cross-entropy loss with label smoothing can be formulated as


and is the smoothing parameter.

By plugging and

into the loss function and calculating the partial derivatives of

, we get the gradient of decoder output:

which is an element-wise kernel of and can be executed in parallel. Remind that our Softmax kernel takes three steps to compute. We can slightly modify the last step with additional logarithmic operations and bias adding for forward and backward calculation respectively.

Figure 7. Mixed-precision trainer in LightSeq2. Here orange and blue blocks represent the weights and gradients respectively. Dotted blocks represent symbolic tensor link which have no actual memory storage. We abandon the FP32 copies of weights/gradients and directly update the FP16 workspace with on-the-fly conversion.

3.2. Memory-Efficient Mixed-Precision Trainer

In mixed precision training (Micikevicius et al., 2018), weights and gradients are in FP16 during forward and backward propagation. Since the updated value, which is the product of learning rate and gradient, is often tiny, the trainer will maintain FP32 copies of weights and gradients to complete the update in FP32. As shown in the left part of Figure 7, each piece of gradients/weights in the model will be copied to/from its FP32 partner in one training step. The trainer kernel will load the FP32 gradient to update the FP32 weight. This mechanism introduces two defects:

  1. Numerous pieces of gradients/weights lead to multiple fast-returning GPU kernels like copying and updating, which reduce GPU utilization.

  2. Redundant memory footprints are caused by the FP32 copy of gradients/weights.

We alleviate them by symbolic tensor link and on-the-fly conversion kernel. As shown in the right part of Figure 7, during the initialization of the trainer, we copy all pieces of weights/gradients into one tensor called workspace orderly. Then we reset and link them as fragments of workspace. During each training step, we only need to execute the trainer kernel once to update the workspace, which prevents launching huge amount of chipped GPU kernels on every pieces of weights/gradients.

Our trainer kernel loads the FP16 weight/gradient from workspace to register and convert it on-the-fly to FP32. Then the weight on register will be updated as usual. Finally, the weight will be converted on-the-fly to FP16 and saved to workspace. Accessing memory with FP16 instead of FP32 reduces the data movement by half and avoids the FP32 copies of weights and gradients.

The cooperation between symbolic tensor link and on-the-fly conversion kernel leads to both memory savings and latency reduction without hurting accuracy. Experimental result on Transformer-big model shows that the proposed trainer reduces the memory usage by 2GB and the runtime by 54.9% compared to the Fairseq trainer with high kernel fusion from Apex.

Figure 8. Memory manager reuses pre-allocated memory as much as possible: tensors in each column on the right side share the same pre-allocated memory block.

3.3. Memory Management

Recent studies demonstrated that training with a large batch leads to fast convergence and higher accuracy. However, large batch training requires more GPUs, gradient accumulation, or memory offload due to the memory limit, which increases the time cost or demand for hardware resources.

We reduce the memory footprint by compacting memory with fewer allocations and releases at no extra cost. Like DeepSpeed, we divide the GPU memory into permanent memory with a fixed size, storing parameters and their gradients, and temporary memory with variable size to store intermediate states. To avoid the frequent allocation of temporary memory, we scan the training set and estimate the upper bound of the capacity. Thus temporary memory was allocated once with the maximal size before training starts, and it is reused for different batches and released after training finishes.

Figure 8

shows the temporary memory cost in the self-attention backward process. The left side describes the steps of backpropagation. Each row on the right side lists the memory occupancy of temporary tensors in one step. Tensors in the same column reuse the same memory block. The sizes of the orange and purple tensors are

and , respectively. The tensors in the dashed memory blocks are not updated in that step, and those in the solid blocks are updated. We can see that only (the first three blocks) (the last block) bytes of memory are required, where denote the batch size, hidden dimension, max sequence length and the number of heads respectively. In contrast, if not use the shared memory block strategy, a total of bytes of memory is required.

(a) 6e6d on V100.
(b) 12e12d on V100.
(c) 24e24d on V100.
(d) 6e6d on A100.
(e) 12e12d on A100.
(f) 24e24d on A100.
Figure 9. Training speed comparison on machine translation using different numbers of Transformer layers on both V100 and A100 GPUs. 6e6d denotes 6 encoder layers and 6 decoder layers.

4. Experiments

We evaluate the performance of LightSeq2 with different Transformer sizes on both NVIDIA Tesla V100 and Ampere A100, based on both PyTorch and TensorFlow. See each subsection for specific experimental configurations.

4.1. Example Usage

Figure 10. An example of using LightSeq2.

We provide C++ and Python APIs for convenient usage. Figure 10 is a code snippet of creating PyTorch based encoder layer using LightSeq2. Line 2-8 is used to provide the configuration information of the encoder layer. Then one can create the encoder layer using the above configuration in Line 9.

4.2. Main Results

We do experiments on three different tasks to evaluate the speed of LightSeq2. The first is machine translation to show the ability in NLP tasks. The second is image classification to show the ability in CV tasks. The third is BERT fine-tune tasks to show the ability of encoders in NLP tasks.

4.2.1. Performance on Machine Translation

We compare our LightSeq2 with PyTorch implementation and Apex optimization on WMT14 English-German machine translation task using 8 V100 and A100 GPUs. Here we compare the speeds of three structures with different numbers of layers (e.g., 6e6d represents six encoder and decoder layers) of Transformer (Vaswani et al., 2017). We use Fairseq(Ott et al., 2019) as the PyTorch baseline and enable the Apex optimization for further speedup.

The results are shown in Figure 9. On both V100 and A100, LightSeq2 is much faster than PyTorch and Apex. The speedup ratio decreases as the batch token size increases. Thus LightSeq2 obtains a speedup of 1.4-2.8x in V100 and 1.5-3.5x in A100. Larger models can obtain larger speedup in our case because the deeper the number of layers, the higher the proportion of model calculation time, and the greater the space left for LightSeq2 optimization. However, as the number of layers increases, the GPU memory occupied by the model increases too. Therefore, thanks to the efficient memory management of LightSeq2, it can still be trained under several large batch token sizes. Compared between V100 and A100, we can find that A100 can obtain a higher speedup ratio due to its brand new Ampere architecture. Finally, Apex can slightly improve the performance of PyTorch but still has a big gap with LightSeq2.

Additionally, we compare the speedup on both PyTorch and TensorFlow using different numbers of V100 GPUs. We use Fairseq as the PyTorch baseline and NeurST333 as the TensorFlow baseline, enabling XLA444 optimization. We train the same machine translation task as above using standard Transformer-big models.

As is shown in Figure 11, speedup ratios on 8 GPUs are lower than 1 GPU, mainly due to the gradient synchronization between multiple GPUs. As the batch token size increases, the gap is gradually narrowing. The speedup of TensorFlow is slightly lower than PyTorch under different batch token sizes because we only integrate the encoder and decoder into NeurST.

(a) PyTorch.
(b) TensorFlow.
Figure 11. LightSeq2 speedup on different numbers of GPUs based on both PyTorch and TensorFlow.

4.2.2. Performance on Image Classification

We compare LightSeq2 with PyTorch implementation of Vision Transformer (ViT) (Dosovitskiy et al., 2021)

on image classification task using CIFAR-10 dataset and 8 V100 GPUs. We use two different model architectures with patch size 32. ViT-B-32 and ViT-L-32 represent the base and large ViT models with patch size 32, respectively. The resolutions of the images are all

. Thus the sequence lengths of the inputs of the ViT are all 50.

(a) ViT-B-32.
(b) ViT-L-32.
Figure 12. LightSeq2 speedup on image classification compared with PyTorch.

The results are shown in Figure 12. In all cases, LightSeq2 outperforms the PyTorch implementation. As the batch size increases, the speedup ratio decreases, corresponding to the NLP tasks trends. With a batch size of 16 using ViT-B-32, LightSeq2 obtains the highest speedup ratio of 1.7x.

4.2.3. Performance of Training BERT

We compare our LightSeq2 with DeepSpeed on Microsoft Research Paraphrase Corpus (MRPC) task of General Language Understanding Evaluation (GLUE) benchmark using Hugging Face Transformers BERT examples

555, for that DeepSpeed only supports Transformer encoders.

The results are shown in Table 2. From this table, we can draw a few conclusions:

  • The speedup of BERT-Base is higher than BERT-Large, mainly due to the smaller proportion of matrix multiplication.

  • LightSeq2 is much more helpful for FP16 training than FP32 training, for that FP16 training can make full use of the latest features of the V100 GPUs.

  • LightSeq2 is much faster than DeepSpeed due to our better optimized CUDA kernels and memory-efficient trainer. In the typical (BERT-Base, 8 GPUs, FP16) case, LightSeq2 can obtain a speedup of 1.64x compared to original Hugging Face BERT.

We do not integrate the LightSeq2 embedding, criterion, and trainer in this experiment for a fair comparison. Otherwise, it will be faster on this basis.

Models nGPUs Libraries FP32 FP16
BERT-Base 1 PyTorch 124 224
DeepSpeed 136 276
LightSeq2 148 380
8 PyTorch 928 1582
DeepSpeed 1019 1804
LightSeq2 1097 2601
BERT-Large 1 PyTorch 41 95
DeepSpeed 44 114
LightSeq2 45 136
8 PyTorch 322 680
DeepSpeed 344 838
LightSeq2 354 1074
Table 2. Speed comparison on MRPC task using different BERT structures and different numbers of GPUs. Here we use samples per second to measure the speed.

4.3. Kernel Efficiency

Figure 13. Comparison between different implementations of LayerNorm. The -axis coordinate values are all logarithms to 2 (e.g., represents batch token size 4096 and hidden dimension 256).
(a) Dropout.
(b) Softmax.
Figure 14. Comparisons between different implementations of Dropout and Softmax.
(a) Forward.
(b) Backward.
Figure 15. LightSeq2 speedup of different layers.

We provide convenient secondary development tools to evaluate the running time and correctness of custom CUDA kernels and layers. We evaluate four different implementations of three common operations: element-wise kernel Dropout, batch reduction kernels Softmax and LayerNorm on one V100 GPU.

First, in Figure 13, we compare the speedup of LayerNorm. We can find that LightSeq2 keeps a speedup of about 4x despite the batch token size and hidden dimension. However, with the increase of batch token size or hidden dimension, the acceleration ratio of DeepSpeed will drop significantly. This is because we use a faster CUDA implementation, which can process more elements in parallel. If the number of elements is huge, the speed of DeepSpeed is not even as good as PyTorch. On the other hand, TensorFlow is not as fast as PyTorch in most cases, except when there are too many elements.

Then in Figure 14(a), we compare the speedup of Dropout. As the number of elements increases, both DeepSpeed and LightSeq2 become slower. When the number of elements is greater than 5 million, DeepSpeed becomes slower than PyTorch. The gap between TensorFlow and PyTorch becomes smaller, but LightSeq2 still has a speedup of 1.2x to 1.5x.

Finally, in Figure 14(b), we compare the speedup of Softmax. Unlike the other kernels, as the batch size and sequence length increase, the speedup of LightSeq2 becomes larger, mainly due to the specific optimization for different input shapes. The trends of DeepSpeed and TensorFlow are similar to the other kernels.

4.4. Layer Speed

We compare the layer speeds of LightSeq2 with PyTorch (Fairseq) implementations on one V100 GPU. We use batch size 32 in all cases and use only one layer for encoder and decoder. The hidden dimensions are all 1024, which are the same as in Transformer-big models.

The results of forward and backward propagation are shown in Figure 15. We can draw several conclusions from it:

  • LightSeq2 can obtain higher speedup in forward propagation than in backward. This may be because that the time of backpropagation contains the part of gradient copies.

  • The speedup ratios of the encoder and decoder decrease rapidly as the sequence length becomes larger. However, The speedups of embedding and criterion are stable. This is mainly due to the relatively small overall calculation of embedding and criterion, so that the GPU can process more elements for long sequences in parallel.

In all cases, LightSeq2 layers are faster than PyTorch, especially when the sequence length is small. We provide Python wrappers of these layers for convenient usage, and the users can flexibly create and train them.

(a) Transformer-base.
(b) Transformer-big.
Figure 16. Comparison of GPU memory on machine translation task with one V100.
(a) Transformer-base.
(b) Transformer-big.
Figure 17. Comparison of GPU utilization on machine translation task with one V100.

4.5. Memory Usage

We compare the GPU memory and utilization between LightSeq2 and PyTorch on WMT14 English-German machine translation task using the same V100 GPU. We use Fairseq as the code base of PyTorch, and both use a batch token size of 8192. All experiments ran for 40 minutes to fairly compare the GPU situation.

Figure 16 illustrates the GPU memory occupation of both Transformer-base (6e6d, 512d, 8 heads) and Transformer-big (6e6d, 1024d, 16 heads) models. PyTorch consumes about 6 GB GPU memory more than LightSeq2 in both cases. For example, Transformer-base models based on PyTorch can not run on a GPU with only 16 GB memory. Another phenomenon that can be found is that the GPU memory of PyTorch will gradually increase as it runs. This is because PyTorch dynamically allocates and releases the GPU memory when the sequence lengths differ. Thus, PyTorch needs to apply for additional GPU memory if a long sequence is input into the model. In contrast, LightSeq2 allocates the maximum GPU memory in advance, so there will be no memory change during the training, and it also saves the time for the allocation and release.

Figure 17 illustrates the GPU utilization of both Transformer-base and Transformer-big models. In the whole training process, LightSeq2 keeps a utilization rate of about 99% in both cases. However, for PyTorch, the utilization of the Transformer-base model is very unstable. The lowest is only 80%, and most of the time, it fluctuates between 87% and 93%, mainly due to frequent memory allocation and release. The utilization of the Transformer-big model is much more stable, but the highest is only 95%.

If the batch token size is smaller, the gap between the two implementations will be more obvious. For example, when the batch token size is reduced to 4096, the memory utilization of the Transformer-base model based on PyTorch is only 73%, while the LightSeq2 is as high as 96%.

5. Related Work

Many approaches have been proposed to improve the training efficiency for deep models, which can be divided into two categories, algorithm-specific and algorithm-agnostic.

Algorithm-specific methods accelerate the training process by improving the model architectures (Vyas et al., 2020; Ying et al., 2021; Wang et al., 2020; Fan et al., 2020; Zhang and He, 2020; Peng et al., 2021; Choromanski et al., 2021), training strategies (Liu et al., 2020; Gong et al., 2019; Li et al., 2020a), optimization algorithms (You et al., 2019; Yao et al., 2020), data precision (Zhang et al., 2020; Sun et al., 2019; Wang et al., 2018), and others (Li et al., 2021, 2020c). Vyas et al. (2020) uses clustered attention to group queries into clusters and compute attention just for the centroids, resulting in a linear complexity model. Gong et al. (2019) proposes a stacking algorithm to transfer knowledge from a shallow model to a deep model, then applies stacking progressively to accelerate BERT training. You et al. (2019) proposes ADAHESSIAN, a second-order stochastic optimization algorithm that dynamically incorporates the curvature of the loss function via adaptively estimates of the Hessian matrix. These techniques can speed up the model training to a certain extent but may affect the model structure and effect, so the universality is not good.

Algorithm-agnostic optimization may solve this problem. Apex666 developed a set of commonly used GPU kernels using C++ and CUDA programming, including LayerNorm, Softmax and Adam optimizer. It supports automatic mixed precision computation and distributed training. Unlike previous works that change the training behavior, the engineering level optimization strictly follows the training algorithm and has no impact on anything other than speed. LightSeq2 further enhances the performance of trainer with memory-efficient mixed precision computation without sacrificing the accuracy.

DeepSpeed (Rasley et al., 2020)

integrates these small kernels into Transformer encoders, which boosts the training throughput on a single GPU and scales well on multiple GPUs. However, DeepSpeed has several limitations which hinder its usage in NLP, CV, and ASR tasks, especially in sequence generation tasks. First, DeepSpeed only optimizes Transformer encoder, thus is not suitable for tasks requiring decoding modules (e.g., machine translation). Second, DeepSpeed does not optimize the other module like embedding and criterion, which prevents it achieving higher performance. Third, DeepSpeed requires that the input length be an integer multiple of 16 due to the implementations of some kernels, which introduces unnecessary padding and computation. In contrast, LightSeq2 supports the arbitrary shape of inputs. Fourth, DeepSpeed does not support TensorFlow, which is also widely used in practice.

LightSeq (version 1.2) (Wang et al., 2021), TurboTransformers (version 0.5) (Fang et al., 2021), and FasterTransformer (version 4.0)777 are three recent systems targeting the serving of Transformers. All systems exploit manually written CUDA kernels for accelerated forward computation of layers in a Transformer. They also improve the serving throughput by enhanced batch decoding strategies on GPUs. However, neither supports Transformer training since there are additional backward computation which is more complex than the forward pass.

TVM (Chen et al., 2018) is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning models. However, limited by the fixed sequence length, it is difficult to be applied to Transformer-based models.

6. Conclusion

In this paper, we describe a series of engineering-based GPU optimization techniques for fast Transformer training. Compared with existing approaches, our system strictly follows the standard training algorithm, therefore guarantees the quality and reproducibility of existing models. We systematically compare our work with existing state-of-the-art systems with various settings and analyze each component’s performance, demonstrating the solidness and scalability of the contribution. Compared to PyTorch and TensorFlow implementations, LightSeq2 can obtain a speedup of up to 3x under different configurations.

In the future, we will unify the training and inference libraries to simplify the process from model training to deployment. Besides, we will apply padding removing888, a more efficient memory management strategy, and other acceleration techniques to achieve faster speedup.


  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Proc. of NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §1.
  • T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Q. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy (2018) TVM: an automated end-to-end optimizing compiler for deep learning. In PRoc. of OSDI, A. C. Arpaci-Dusseau and G. Voelker (Eds.), pp. 578–594. Cited by: §1, §5.
  • K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. J. Colwell, and A. Weller (2021) Rethinking attention with performers. In Proc. of ICLR, Cited by: §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, pp. 4171–4186. Cited by: §1, §1.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In Proc. of ICLR, Cited by: §1, §4.2.2.
  • A. Fan, E. Grave, and A. Joulin (2020) Reducing transformer depth on demand with structured dropout. In Proc. of ICLR, Cited by: §5.
  • J. Fang, Y. Yu, C. Zhao, and J. Zhou (2021) TurboTransformers: an efficient GPU serving system for transformer models. In Proc. of PPoPP, J. Lee and E. Petrank (Eds.), pp. 389–402. Cited by: §1, §2.1, §3.1.1, §5.
  • L. Gong, D. He, Z. Li, T. Qin, L. Wang, and T. Liu (2019) Efficient training of BERT by progressively stacking. In Proc. of ICML, K. Chaudhuri and R. Salakhutdinov (Eds.),

    Proceedings of Machine Learning Research

    , Vol. 97, pp. 2337–2346.
    Cited by: §5.
  • A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §1.
  • B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proc. of CVPR, pp. 2704–2713. Cited by: §1.
  • B. Li, Z. Wang, H. Liu, Q. Du, T. Xiao, C. Zhang, and J. Zhu (2021) Learning light-weight translation models from deep transformer. In Proc. of AAAI, pp. 13217–13225. Cited by: §5.
  • B. Li, Z. Wang, H. Liu, Y. Jiang, Q. Du, T. Xiao, H. Wang, and J. Zhu (2020a)

    Shallow-to-deep training for neural machine translation

    In Proc. of EMNLP, pp. 995–1005. Cited by: §5.
  • M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su (2014) Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 583–598. Cited by: item 3.
  • S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala (2020b) PyTorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow. 13 (12), pp. 3005–3018. Cited by: §2.2.
  • Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez (2020c) Train big, then compress: rethinking model size for efficient training and inference of transformers. In Proc. of ICML, Proceedings of Machine Learning Research, Vol. 119, pp. 5958–5968. Cited by: §5.
  • L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2020) On the variance of the adaptive learning rate and beyond. In Proc. of ICLR, Cited by: §5.
  • P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu (2018) Mixed precision training. In Proc. of ICLR, Cited by: §3.2.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proc. of NAACL-Demonstrations, pp. 48–53. Cited by: §4.2.1.
  • X. Pan, L. Wu, M. Wang, and L. Li (2021) Contrastive learning for many-to-many multilingual neural machine translation. In Proc. of ACL, Cited by: §1.
  • P. Patarasuk and X. Yuan (2009) Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing 69 (2), pp. 117–124. Cited by: item 3.
  • H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. Smith, and L. Kong (2021) Random feature attention. In Proc. of ICLR, Cited by: §5.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2018) Language models are unsupervised multitask learners. Cited by: §1.
  • S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He (2020) ZeRO: memory optimizations toward training trillion parameter models. In Proc. of SC, C. Cuicchi, I. Qualters, and W. T. Kramer (Eds.), pp. 20. Cited by: §1.
  • J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He (2020) DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proc. of KDD, R. Gupta, Y. Liu, J. Tang, and B. A. Prakash (Eds.), pp. 3505–3506. Cited by: §1, §5.
  • X. Sun, J. Choi, C. Chen, N. Wang, S. Venkataramani, V. Srinivasan, X. Cui, W. Zhang, and K. Gopalakrishnan (2019) Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In Proc. of NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 4901–4910. Cited by: §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. of NeurIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §1, §4.2.1.
  • A. Vyas, A. Katharopoulos, and F. Fleuret (2020) Fast transformers with clustered attention. In Proc. of NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §5.
  • N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan (2018) Training deep neural networks with 8-bit floating point numbers. In Proc. of NeurIPS, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 7686–7695. Cited by: §5.
  • S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma (2020) Linformer: self-attention with linear complexity. CoRR abs/2006.04768. External Links: 2006.04768 Cited by: §5.
  • X. Wang, Y. Xiong, Y. Wei, M. Wang, and L. Li (2021) LightSeq: a high performance inference library for transformers. In Proc. of NAACL-HLT: Industry Papers, pp. 113–120. Cited by: §1, §2.1, §5.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Proc. of NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 5754–5764. Cited by: §1.
  • Z. Yao, A. Gholami, S. Shen, K. Keutzer, and M. W. Mahoney (2020) ADAHESSIAN: an adaptive second order optimizer for machine learning. CoRR abs/2006.00719. External Links: 2006.00719 Cited by: §5.
  • C. Ying, G. Ke, D. He, and T. Liu (2021) LazyFormer: self attention with lazy update. CoRR abs/2102.12702. External Links: 2102.12702 Cited by: §5.
  • Y. You, J. Li, J. Hseu, X. Song, J. Demmel, and C. Hsieh (2019) Reducing BERT pre-training time from 3 days to 76 minutes. CoRR abs/1904.00962. External Links: 1904.00962 Cited by: §5.
  • X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer (2021) Scaling vision transformers. CoRR abs/2106.04560. External Links: 2106.04560 Cited by: §1.
  • M. Zhang and Y. He (2020) Accelerating training of transformer-based language models with progressive layer dropping. In Proc. of NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §5.
  • X. Zhang, S. Liu, R. Zhang, C. Liu, D. Huang, S. Zhou, J. Guo, Q. Guo, Z. Du, T. Zhi, and Y. Chen (2020) Fixed-point back-propagation training. In Proc. of CVPR, pp. 2327–2335. Cited by: §5.