1. Introduction
Transformers (Vaswani et al., 2017)
have become a prevailing neural architecture in many natural language processing (NLP), computer vision (CV), and automatic speech recognition (ASR) applications
(Devlin et al., 2019; Radford et al., 2018; Yang et al., 2019; Zhai et al., 2021; Gulati et al., 2020). Variants of Transformer prove to achieve stateoftheart accuracy in text classification, question answering, machine translation, and visual object recognition tasks (Devlin et al., 2019; Pan et al., 2021; Dosovitskiy et al., 2021). Transformer models typically require large model size and training data to perform well. For example, a GPT3 model requires 3.1 million hours of training on modern GPUs and it is estimated to cost $4.6 million to complete a single trial
(Brown et al., 2020). Figure 1 shows sizes and estimated training cost for several popular Transformer models. The training cost increases roughly in proportion to the number of model parameters. With the evergrowing model size, it becomes expensive to train them. There is a critical need for optimizing the computation for Transformers.Training  Components  Sequence Length  DL Frameworks  

Libraries  Embedding  Encoder  Decoder  Criterion  Trainer  PyTorch  TensorFlow  
DeepSpeed  Multiples of 16  
LightSeq2  Arbitrary 
Existing approaches for accelerating Transformers are limited to either inferenceonly or models with only encoder layers. LightSeq (Wang et al., 2021) and TurboTransformers (Fang et al., 2021) are two recent systems targeting the serving of Transformers. However, neither supports Transformer training since there are additional backward computation which is more complex than the forward pass.
DeepSpeed is a recent system for optimizing Transformer training (Rasley et al., 2020; Rajbhandari et al., 2020). It uses specially designed CUDA kernels, as well as quantized gradient update across multiple GPUs. However, DeepSpeed only supports Transformer encoder layer, thus can only be used to train BERTlike models (Devlin et al., 2019). The full Transformer includes additional modules such as the decoder, criterion layer for calculating generation loss, shared embedding, etc. These involve more complex computation flow, for example, the cross attention computation between decoder and encoder layers. It is nontrivial to enable accelerated training for the full Transformers.
There is an additional line of research with general computation acceleration for neural networks, including automatic hardwareoriented compilation and quantized computation. TVM
(Chen et al., 2018) uses automatic compilation technology to search for candidate optimizations in the neural network according to specific patterns and automatically merge operators. However, most of automatic compilation works only support fixedlength input, and it isn’t easy for them to deal with a variablelength input such as natural sentences for Transformer. Jacob et al. (2018) and Apex^{1}^{1}1https://github.com/NVIDIA/apex improve the data I/O and calculation speed by reducing the precision using quantization. These are beneficial but should be used carefully, since the accuracy will decrease to a certain extent. While accuracy is important in training.In this paper, we focus on accelerating the training for Transformer models on modern GPUs. We intend to provide a general systemlevel solution that works for all kinds of models based on Transformer, and all kinds of training algorithms such as stochastic gradient descent (SGD) and adaptive gradient methods. We aim to tackle these challenges, namely optimizing the backward computation in addition to the forward pass, supporting all Transformer layers including encoder and decoder, and supporting variablelength input. The challenge for accelerating training is that inference only requires forward computation, while training also requires back propagation of errors, computing gradients, synchronization among possible multiple GPUs, and updating the parameters. Among them, backpropagation and parameter update require relatively high calculation precision. It is also challenging for full Transformer due to its decoder’s cross attention on its encoder layers.
In this paper, we propose LightSeq2, an efficient library for both training and serving Transformer models. It provides systemlevel optimization without sacrificing accuracy or changing any training behavior (learning rate, convergence rate, initialization, numeric stability, etc.). LightSeq2 includes three techniques for speedup, namely layerspecific kernels to increase GPU utilization, finegrain mixedprecision trainer, and a improved strategy for efficient GPU memory management. LightSeq2 is the first to accelerate the whole process of Transformer training. Table 1 lists the differences between our proposed LightSeq2 and existing accelerated Transformer training library. In summary, LightSeq2 enjoys the following advantages:

Highly efficient. LightSeq2 is fast and memory efficient for training Transformers, as validated in multiple experiments. Noticeably, LightSeq2 obtains up to 308% speedup and only requires 65% GPU memory on eight NVIDIA Ampere A100 GPUs in WMT14 EnglishGerman machine translation task compared to PyTorch.

Supporting rich models in Transformer family. LightSeq2 provides comprehensive efficient custom operators, including embedding, encoder layer, decoder layer, and criterion. These enables BERT (encoderonly), GPT (decoderonly), full Transformer with encoderdecoder, etc. Therefore it is suitable for almost all NLP tasks such as text classification, generation, summarization, and machine translation.

Flexible usage. In addition to manually integrating the custom layers in model codes, the users can also use LightSeq2 in popular training libraries without code modification. The library provides seamless integration with PyTorch and TensorFlow.
2. Background
2.1. Transformer Models
The critical idea of Transformer is using multihead attention to map the representation of the tokens in the sequence to different feature spaces. In Figure 2
, we take the machine translation task as an example. The encoder firstly calculates the attentions between any two input tokens and then obtains the attention outputs. Additionally, it uses feed forward network to enhance the representations. For the decoder, most calculations are the same as the encoder, except that each token only computes attention on previous tokens, and there exists a cross attention layer after the selfattention. To achieve the final embedding of each token, you also need an embedding layer to concatenate its position embedding. After obtaining the outputs of the decoder, the model requires an output layer to generate the probability of each token.
The Transformerbased models have variablelength intermediate tensors, which may lead to frequent allocation and release of GPU memory. There are many ways to deal with this problem. TurboTransformers
(Fang et al., 2021) uses a sequencelengthaware allocator in their inference library to maximize the sharing of nondependent GPU memory and reduce memory usage. However, this also dynamically allocates and releases the GPU memory. Thus it will slow down the model inference. LightSeq (Wang et al., 2021) allocates the maximal GPU memory in advance to prevent dynamic allocation and release during training, so the speed is faster.2.2. Model Training
As shown in Figure 3, there are four stages during each iteration of data parallel training.

The model receives input data and then performs forward propagation, getting the final loss data.

The model performs backward propagation using the loss calculated after forward propagation, generating the gradients of all parameters.

All parameters in each device are updated using the averaged gradients. Since the initial state and gradient of the parameters on each device are the same, the parameters remain the same after one update stage.
The speed bottleneck of the first and second stages is mainly in computing, which depends on fast CUDA kernels. However, the last two stages require better memory management, which will speed up the copies of parameters and gradients. Many works devote to accelerating these four stages, such as DeepSpeed, Apex, DDP (Li et al., 2020b), etc. However, there is no work to accelerate the complete training process currently. Figure 4 is a visualization of the four training stages. It can be found that model computing and parameter updates account for a large proportion. After training with LightSeq2, the time of the three stages is greatly reduced, especially the parameter updates.
Different from model inference which only has the forward propagation stage, model training is more challenging. First, model training requires higher accuracy. Otherwise, the error will be magnified after constant parameter updates. Second, model training requires better memory management due to the need for maintaining gradients and activation checkpointing used by backward propagation. Finally, model training requires a fast trainer for parameter updates. However, for Transformer models, the training does not need to deal with incremental length in auto regressive decoding, which is simpler than model inference.
3. The LightSeq2 System
This section will introduce the three techniques used in LightSeq2 in detail, including kernel fusion, memoryefficient mixedprecision trainer, and memory management.
3.1. Kernel Fusion
Kernel fusion focuses on speedup the CUDA kernels, thereby accelerating the computing process of training. We optimize all kernels in Transformer layers (i.e., encoder and decoder layer), embedding layer, and criterion layer.
3.1.1. Transformer Layers
Existing training libraries (e.g., DeepSpeed) only fuse kernels for the encoder, thus can only be used in models like BERT. To expand the scope of use, we extend to the decoder to accelerate models requiring a Transformer decoder (e.g., GPT).
There are two types of kernels. One is GEMM kernels, including linear transformation and scaled dot product. The other is nonGEMM kernels, such as
Dropout, layer normalization, and Softmax. As GEMM has already been handled by the cuBLAS library efficiently, we focus on nonGEMM kernels here. Further subdivided, there are two categories of nonGEMM kernels. One is elementwise kernels (e.g., Dropout, ReLU, Reshape and bias adding). The independence between processing any two different tensor elements enables explicit parallelism and multikernel fusion. The other involves batch reduction operations, such as LayerNorm and Softmax, which require synchronization between threads.Our optimized Transformer structure is depicted in Figure 5, where yellow boxes represent GEMM kernels and blue boxes represent custom nonGEMM kernels. Adjacent finegrained elementwise kernels are fused into one coarsegrained kernel, resulting in fewer kernel launches and intermediate results. For example, the last kernel of the selfattention layer implements bias adding, dropout, and residual kernels with only one kernel launch.
In the following, we focus on two batch reduction kernels that take a long time in the training, including LayerNorm and Softmax.
LayerNorm Recall that the LayerNorm normalizes the inputs using
where and
stand for the mean and standard variance of
respectively. Both are batch reductions. While warplevel parallelism provided by CUDA allows interthread communication for batch reduction over , it requires thread synchronization, which hampers instructionlevel parallelism.A native implementation of forward calculation introduces two sequential thread synchronizations for and respectively. To avoid the dependence between reductions, TurboTransformers proposed a new formula that runs two synchronizations in parallel:
This inspires us to rearrange the equation for backward calculation, which parallels synchronizations as well:
where and are coefficients that can be solved in parallel:
Here is the dimension of , and are the gradients of the th elements of input and output tensors respectively. The two batch reductions and can be executed in parallel.
As LayerNorm is sensitive to the precision of floating points, we keep float point 16 (FP16) storage of the parameters and cast them to float point 32 (FP32) for computation to avoid additional I/O cost.
Softmax The next one is Softmax. The forward process of it can be expressed as
For numerical stability, especially for mixedprecision training, it takes three steps to avoid overflow:

Find the maximal element of , denoted as .

Deduct from each element in so that the exponential never overflows, and then calculate the partition function .

Calculate Softmax using .
Both steps 1 and 2 are reduction operations. We use the CUB library^{2}^{2}2https://nvlabs.github.io/cub/ for efficiency, which weakens the dependence between elements using an extra small buffer.
We tried different combinations of block size, grid size, and buffer size for various sequence lengths to find the best settings. During backward, we allocate four wraps in each block to run synchronizations in parallel. It allows higher latency for fadd instruction which is a source register required by the next shfl_down instruction (Fang et al., 2021).
In general, we have implemented all CUDA kernels using the float4 data type, increasing the bandwidth of data I/O. At the same time, we also optimize the kernels for commonly used input sizes. Compared with the implementation of DeepSpeed, FasterTransformer, and TurboTransformers, our kernels are much faster.
3.1.2. Embedding Layer
The embedding layer is widely used in most deep learning tasks to obtain the distributed representation of one sentence or image. Given a token embedding lookup table
and positional embedding lookup table , we can get the representation of one token with index and position :where is the embedding scale. Here we use sinusoidal positional embedding which does not require training.
Considering an input sentence with length , let denote the Dropout mask generated in the forward propagation. We can efficiently compute the gradient of token in the embedding table:
where represents elementwise product. This means summing all gradients of tokens in the different positions in the sentence, which can be implemented in parallel by atomicAdd operation in CUDA to avoid interference from other threads.
3.1.3. Criterion Layer
The criterion layer is used to compute the loss between the model output and ground truth. First, let
, a vector of length
, denote the output of the decoder for one token, where is the vocabulary size, and denote the onehot vector of length representing the ground truth. The prevalent crossentropy loss with label smoothing can be formulated aswhere
and is the smoothing parameter.
By plugging and
into the loss function and calculating the partial derivatives of
, we get the gradient of decoder output:which is an elementwise kernel of and can be executed in parallel. Remind that our Softmax kernel takes three steps to compute. We can slightly modify the last step with additional logarithmic operations and bias adding for forward and backward calculation respectively.
3.2. MemoryEfficient MixedPrecision Trainer
In mixed precision training (Micikevicius et al., 2018), weights and gradients are in FP16 during forward and backward propagation. Since the updated value, which is the product of learning rate and gradient, is often tiny, the trainer will maintain FP32 copies of weights and gradients to complete the update in FP32. As shown in the left part of Figure 7, each piece of gradients/weights in the model will be copied to/from its FP32 partner in one training step. The trainer kernel will load the FP32 gradient to update the FP32 weight. This mechanism introduces two defects:

Numerous pieces of gradients/weights lead to multiple fastreturning GPU kernels like copying and updating, which reduce GPU utilization.

Redundant memory footprints are caused by the FP32 copy of gradients/weights.
We alleviate them by symbolic tensor link and onthefly conversion kernel. As shown in the right part of Figure 7, during the initialization of the trainer, we copy all pieces of weights/gradients into one tensor called workspace orderly. Then we reset and link them as fragments of workspace. During each training step, we only need to execute the trainer kernel once to update the workspace, which prevents launching huge amount of chipped GPU kernels on every pieces of weights/gradients.
Our trainer kernel loads the FP16 weight/gradient from workspace to register and convert it onthefly to FP32. Then the weight on register will be updated as usual. Finally, the weight will be converted onthefly to FP16 and saved to workspace. Accessing memory with FP16 instead of FP32 reduces the data movement by half and avoids the FP32 copies of weights and gradients.
The cooperation between symbolic tensor link and onthefly conversion kernel leads to both memory savings and latency reduction without hurting accuracy. Experimental result on Transformerbig model shows that the proposed trainer reduces the memory usage by 2GB and the runtime by 54.9% compared to the Fairseq trainer with high kernel fusion from Apex.
3.3. Memory Management
Recent studies demonstrated that training with a large batch leads to fast convergence and higher accuracy. However, large batch training requires more GPUs, gradient accumulation, or memory offload due to the memory limit, which increases the time cost or demand for hardware resources.
We reduce the memory footprint by compacting memory with fewer allocations and releases at no extra cost. Like DeepSpeed, we divide the GPU memory into permanent memory with a fixed size, storing parameters and their gradients, and temporary memory with variable size to store intermediate states. To avoid the frequent allocation of temporary memory, we scan the training set and estimate the upper bound of the capacity. Thus temporary memory was allocated once with the maximal size before training starts, and it is reused for different batches and released after training finishes.
Figure 8
shows the temporary memory cost in the selfattention backward process. The left side describes the steps of backpropagation. Each row on the right side lists the memory occupancy of temporary tensors in one step. Tensors in the same column reuse the same memory block. The sizes of the orange and purple tensors are
and , respectively. The tensors in the dashed memory blocks are not updated in that step, and those in the solid blocks are updated. We can see that only (the first three blocks) (the last block) bytes of memory are required, where denote the batch size, hidden dimension, max sequence length and the number of heads respectively. In contrast, if not use the shared memory block strategy, a total of bytes of memory is required.4. Experiments
We evaluate the performance of LightSeq2 with different Transformer sizes on both NVIDIA Tesla V100 and Ampere A100, based on both PyTorch and TensorFlow. See each subsection for specific experimental configurations.
4.1. Example Usage
We provide C++ and Python APIs for convenient usage. Figure 10 is a code snippet of creating PyTorch based encoder layer using LightSeq2. Line 28 is used to provide the configuration information of the encoder layer. Then one can create the encoder layer using the above configuration in Line 9.
4.2. Main Results
We do experiments on three different tasks to evaluate the speed of LightSeq2. The first is machine translation to show the ability in NLP tasks. The second is image classification to show the ability in CV tasks. The third is BERT finetune tasks to show the ability of encoders in NLP tasks.
4.2.1. Performance on Machine Translation
We compare our LightSeq2 with PyTorch implementation and Apex optimization on WMT14 EnglishGerman machine translation task using 8 V100 and A100 GPUs. Here we compare the speeds of three structures with different numbers of layers (e.g., 6e6d represents six encoder and decoder layers) of Transformer (Vaswani et al., 2017). We use Fairseq(Ott et al., 2019) as the PyTorch baseline and enable the Apex optimization for further speedup.
The results are shown in Figure 9. On both V100 and A100, LightSeq2 is much faster than PyTorch and Apex. The speedup ratio decreases as the batch token size increases. Thus LightSeq2 obtains a speedup of 1.42.8x in V100 and 1.53.5x in A100. Larger models can obtain larger speedup in our case because the deeper the number of layers, the higher the proportion of model calculation time, and the greater the space left for LightSeq2 optimization. However, as the number of layers increases, the GPU memory occupied by the model increases too. Therefore, thanks to the efficient memory management of LightSeq2, it can still be trained under several large batch token sizes. Compared between V100 and A100, we can find that A100 can obtain a higher speedup ratio due to its brand new Ampere architecture. Finally, Apex can slightly improve the performance of PyTorch but still has a big gap with LightSeq2.
Additionally, we compare the speedup on both PyTorch and TensorFlow using different numbers of V100 GPUs. We use Fairseq as the PyTorch baseline and NeurST^{3}^{3}3https://github.com/bytedance/neurst as the TensorFlow baseline, enabling XLA^{4}^{4}4https://www.tensorflow.org/xla optimization. We train the same machine translation task as above using standard Transformerbig models.
As is shown in Figure 11, speedup ratios on 8 GPUs are lower than 1 GPU, mainly due to the gradient synchronization between multiple GPUs. As the batch token size increases, the gap is gradually narrowing. The speedup of TensorFlow is slightly lower than PyTorch under different batch token sizes because we only integrate the encoder and decoder into NeurST.
4.2.2. Performance on Image Classification
We compare LightSeq2 with PyTorch implementation of Vision Transformer (ViT) (Dosovitskiy et al., 2021)
on image classification task using CIFAR10 dataset and 8 V100 GPUs. We use two different model architectures with patch size 32. ViTB32 and ViTL32 represent the base and large ViT models with patch size 32, respectively. The resolutions of the images are all
. Thus the sequence lengths of the inputs of the ViT are all 50.The results are shown in Figure 12. In all cases, LightSeq2 outperforms the PyTorch implementation. As the batch size increases, the speedup ratio decreases, corresponding to the NLP tasks trends. With a batch size of 16 using ViTB32, LightSeq2 obtains the highest speedup ratio of 1.7x.
4.2.3. Performance of Training BERT
We compare our LightSeq2 with DeepSpeed on Microsoft Research Paraphrase Corpus (MRPC) task of General Language Understanding Evaluation (GLUE) benchmark using Hugging Face Transformers BERT examples
^{5}^{5}5https://github.com/huggingface/transformers/tree/master/examples/pytorch/textclassification, for that DeepSpeed only supports Transformer encoders.The results are shown in Table 2. From this table, we can draw a few conclusions:

The speedup of BERTBase is higher than BERTLarge, mainly due to the smaller proportion of matrix multiplication.

LightSeq2 is much more helpful for FP16 training than FP32 training, for that FP16 training can make full use of the latest features of the V100 GPUs.

LightSeq2 is much faster than DeepSpeed due to our better optimized CUDA kernels and memoryefficient trainer. In the typical (BERTBase, 8 GPUs, FP16) case, LightSeq2 can obtain a speedup of 1.64x compared to original Hugging Face BERT.
We do not integrate the LightSeq2 embedding, criterion, and trainer in this experiment for a fair comparison. Otherwise, it will be faster on this basis.
Models  nGPUs  Libraries  FP32  FP16 

BERTBase  1  PyTorch  124  224 
DeepSpeed  136  276  
LightSeq2  148  380  
8  PyTorch  928  1582  
DeepSpeed  1019  1804  
LightSeq2  1097  2601  
BERTLarge  1  PyTorch  41  95 
DeepSpeed  44  114  
LightSeq2  45  136  
8  PyTorch  322  680  
DeepSpeed  344  838  
LightSeq2  354  1074 
4.3. Kernel Efficiency
We provide convenient secondary development tools to evaluate the running time and correctness of custom CUDA kernels and layers. We evaluate four different implementations of three common operations: elementwise kernel Dropout, batch reduction kernels Softmax and LayerNorm on one V100 GPU.
First, in Figure 13, we compare the speedup of LayerNorm. We can find that LightSeq2 keeps a speedup of about 4x despite the batch token size and hidden dimension. However, with the increase of batch token size or hidden dimension, the acceleration ratio of DeepSpeed will drop significantly. This is because we use a faster CUDA implementation, which can process more elements in parallel. If the number of elements is huge, the speed of DeepSpeed is not even as good as PyTorch. On the other hand, TensorFlow is not as fast as PyTorch in most cases, except when there are too many elements.
Then in Figure 14(a), we compare the speedup of Dropout. As the number of elements increases, both DeepSpeed and LightSeq2 become slower. When the number of elements is greater than 5 million, DeepSpeed becomes slower than PyTorch. The gap between TensorFlow and PyTorch becomes smaller, but LightSeq2 still has a speedup of 1.2x to 1.5x.
Finally, in Figure 14(b), we compare the speedup of Softmax. Unlike the other kernels, as the batch size and sequence length increase, the speedup of LightSeq2 becomes larger, mainly due to the specific optimization for different input shapes. The trends of DeepSpeed and TensorFlow are similar to the other kernels.
4.4. Layer Speed
We compare the layer speeds of LightSeq2 with PyTorch (Fairseq) implementations on one V100 GPU. We use batch size 32 in all cases and use only one layer for encoder and decoder. The hidden dimensions are all 1024, which are the same as in Transformerbig models.
The results of forward and backward propagation are shown in Figure 15. We can draw several conclusions from it:

LightSeq2 can obtain higher speedup in forward propagation than in backward. This may be because that the time of backpropagation contains the part of gradient copies.

The speedup ratios of the encoder and decoder decrease rapidly as the sequence length becomes larger. However, The speedups of embedding and criterion are stable. This is mainly due to the relatively small overall calculation of embedding and criterion, so that the GPU can process more elements for long sequences in parallel.
In all cases, LightSeq2 layers are faster than PyTorch, especially when the sequence length is small. We provide Python wrappers of these layers for convenient usage, and the users can flexibly create and train them.
4.5. Memory Usage
We compare the GPU memory and utilization between LightSeq2 and PyTorch on WMT14 EnglishGerman machine translation task using the same V100 GPU. We use Fairseq as the code base of PyTorch, and both use a batch token size of 8192. All experiments ran for 40 minutes to fairly compare the GPU situation.
Figure 16 illustrates the GPU memory occupation of both Transformerbase (6e6d, 512d, 8 heads) and Transformerbig (6e6d, 1024d, 16 heads) models. PyTorch consumes about 6 GB GPU memory more than LightSeq2 in both cases. For example, Transformerbase models based on PyTorch can not run on a GPU with only 16 GB memory. Another phenomenon that can be found is that the GPU memory of PyTorch will gradually increase as it runs. This is because PyTorch dynamically allocates and releases the GPU memory when the sequence lengths differ. Thus, PyTorch needs to apply for additional GPU memory if a long sequence is input into the model. In contrast, LightSeq2 allocates the maximum GPU memory in advance, so there will be no memory change during the training, and it also saves the time for the allocation and release.
Figure 17 illustrates the GPU utilization of both Transformerbase and Transformerbig models. In the whole training process, LightSeq2 keeps a utilization rate of about 99% in both cases. However, for PyTorch, the utilization of the Transformerbase model is very unstable. The lowest is only 80%, and most of the time, it fluctuates between 87% and 93%, mainly due to frequent memory allocation and release. The utilization of the Transformerbig model is much more stable, but the highest is only 95%.
If the batch token size is smaller, the gap between the two implementations will be more obvious. For example, when the batch token size is reduced to 4096, the memory utilization of the Transformerbase model based on PyTorch is only 73%, while the LightSeq2 is as high as 96%.
5. Related Work
Many approaches have been proposed to improve the training efficiency for deep models, which can be divided into two categories, algorithmspecific and algorithmagnostic.
Algorithmspecific methods accelerate the training process by improving the model architectures (Vyas et al., 2020; Ying et al., 2021; Wang et al., 2020; Fan et al., 2020; Zhang and He, 2020; Peng et al., 2021; Choromanski et al., 2021), training strategies (Liu et al., 2020; Gong et al., 2019; Li et al., 2020a), optimization algorithms (You et al., 2019; Yao et al., 2020), data precision (Zhang et al., 2020; Sun et al., 2019; Wang et al., 2018), and others (Li et al., 2021, 2020c). Vyas et al. (2020) uses clustered attention to group queries into clusters and compute attention just for the centroids, resulting in a linear complexity model. Gong et al. (2019) proposes a stacking algorithm to transfer knowledge from a shallow model to a deep model, then applies stacking progressively to accelerate BERT training. You et al. (2019) proposes ADAHESSIAN, a secondorder stochastic optimization algorithm that dynamically incorporates the curvature of the loss function via adaptively estimates of the Hessian matrix. These techniques can speed up the model training to a certain extent but may affect the model structure and effect, so the universality is not good.
Algorithmagnostic optimization may solve this problem. Apex^{6}^{6}6https://github.com/NVIDIA/apex developed a set of commonly used GPU kernels using C++ and CUDA programming, including LayerNorm, Softmax and Adam optimizer. It supports automatic mixed precision computation and distributed training. Unlike previous works that change the training behavior, the engineering level optimization strictly follows the training algorithm and has no impact on anything other than speed. LightSeq2 further enhances the performance of trainer with memoryefficient mixed precision computation without sacrificing the accuracy.
DeepSpeed (Rasley et al., 2020)
integrates these small kernels into Transformer encoders, which boosts the training throughput on a single GPU and scales well on multiple GPUs. However, DeepSpeed has several limitations which hinder its usage in NLP, CV, and ASR tasks, especially in sequence generation tasks. First, DeepSpeed only optimizes Transformer encoder, thus is not suitable for tasks requiring decoding modules (e.g., machine translation). Second, DeepSpeed does not optimize the other module like embedding and criterion, which prevents it achieving higher performance. Third, DeepSpeed requires that the input length be an integer multiple of 16 due to the implementations of some kernels, which introduces unnecessary padding and computation. In contrast, LightSeq2 supports the arbitrary shape of inputs. Fourth, DeepSpeed does not support TensorFlow, which is also widely used in practice.
LightSeq (version 1.2) (Wang et al., 2021), TurboTransformers (version 0.5) (Fang et al., 2021), and FasterTransformer (version 4.0)^{7}^{7}7https://github.com/NVIDIA/FasterTransformer are three recent systems targeting the serving of Transformers. All systems exploit manually written CUDA kernels for accelerated forward computation of layers in a Transformer. They also improve the serving throughput by enhanced batch decoding strategies on GPUs. However, neither supports Transformer training since there are additional backward computation which is more complex than the forward pass.
TVM (Chen et al., 2018) is a compiler that exposes graphlevel and operatorlevel optimizations to provide performance portability to deep learning models. However, limited by the fixed sequence length, it is difficult to be applied to Transformerbased models.
6. Conclusion
In this paper, we describe a series of engineeringbased GPU optimization techniques for fast Transformer training. Compared with existing approaches, our system strictly follows the standard training algorithm, therefore guarantees the quality and reproducibility of existing models. We systematically compare our work with existing stateoftheart systems with various settings and analyze each component’s performance, demonstrating the solidness and scalability of the contribution. Compared to PyTorch and TensorFlow implementations, LightSeq2 can obtain a speedup of up to 3x under different configurations.
In the future, we will unify the training and inference libraries to simplify the process from model training to deployment. Besides, we will apply padding removing^{8}^{8}8https://github.com/bytedance/effective_transformer, a more efficient memory management strategy, and other acceleration techniques to achieve faster speedup.
References
 Language models are fewshot learners. In Proc. of NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §1.
 TVM: an automated endtoend optimizing compiler for deep learning. In PRoc. of OSDI, A. C. ArpaciDusseau and G. Voelker (Eds.), pp. 578–594. Cited by: §1, §5.
 Rethinking attention with performers. In Proc. of ICLR, Cited by: §5.
 BERT: pretraining of deep bidirectional transformers for language understanding. In Proc. of NAACLHLT, pp. 4171–4186. Cited by: §1, §1.
 An image is worth 16x16 words: transformers for image recognition at scale. In Proc. of ICLR, Cited by: §1, §4.2.2.
 Reducing transformer depth on demand with structured dropout. In Proc. of ICLR, Cited by: §5.
 TurboTransformers: an efficient GPU serving system for transformer models. In Proc. of PPoPP, J. Lee and E. Petrank (Eds.), pp. 389–402. Cited by: §1, §2.1, §3.1.1, §5.

Efficient training of BERT by progressively stacking.
In Proc. of ICML, K. Chaudhuri and R. Salakhutdinov (Eds.),
Proceedings of Machine Learning Research
, Vol. 97, pp. 2337–2346. Cited by: §5.  Conformer: convolutionaugmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §1.
 Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proc. of CVPR, pp. 2704–2713. Cited by: §1.
 Learning lightweight translation models from deep transformer. In Proc. of AAAI, pp. 13217–13225. Cited by: §5.

Shallowtodeep training for neural machine translation
. In Proc. of EMNLP, pp. 995–1005. Cited by: §5.  Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 583–598. Cited by: item 3.
 PyTorch distributed: experiences on accelerating data parallel training. Proc. VLDB Endow. 13 (12), pp. 3005–3018. Cited by: §2.2.
 Train big, then compress: rethinking model size for efficient training and inference of transformers. In Proc. of ICML, Proceedings of Machine Learning Research, Vol. 119, pp. 5958–5968. Cited by: §5.
 On the variance of the adaptive learning rate and beyond. In Proc. of ICLR, Cited by: §5.
 Mixed precision training. In Proc. of ICLR, Cited by: §3.2.
 Fairseq: a fast, extensible toolkit for sequence modeling. In Proc. of NAACLDemonstrations, pp. 48–53. Cited by: §4.2.1.
 Contrastive learning for manytomany multilingual neural machine translation. In Proc. of ACL, Cited by: §1.
 Bandwidth optimal allreduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing 69 (2), pp. 117–124. Cited by: item 3.
 Random feature attention. In Proc. of ICLR, Cited by: §5.
 Language models are unsupervised multitask learners. Cited by: §1.
 ZeRO: memory optimizations toward training trillion parameter models. In Proc. of SC, C. Cuicchi, I. Qualters, and W. T. Kramer (Eds.), pp. 20. Cited by: §1.
 DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proc. of KDD, R. Gupta, Y. Liu, J. Tang, and B. A. Prakash (Eds.), pp. 3505–3506. Cited by: §1, §5.
 Hybrid 8bit floating point (HFP8) training and inference for deep neural networks. In Proc. of NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 4901–4910. Cited by: §5.
 Attention is all you need. In Proc. of NeurIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §1, §4.2.1.
 Fast transformers with clustered attention. In Proc. of NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §5.
 Training deep neural networks with 8bit floating point numbers. In Proc. of NeurIPS, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 7686–7695. Cited by: §5.
 Linformer: selfattention with linear complexity. CoRR abs/2006.04768. External Links: 2006.04768 Cited by: §5.
 LightSeq: a high performance inference library for transformers. In Proc. of NAACLHLT: Industry Papers, pp. 113–120. Cited by: §1, §2.1, §5.
 XLNet: generalized autoregressive pretraining for language understanding. In Proc. of NeurIPS, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 5754–5764. Cited by: §1.
 ADAHESSIAN: an adaptive second order optimizer for machine learning. CoRR abs/2006.00719. External Links: 2006.00719 Cited by: §5.
 LazyFormer: self attention with lazy update. CoRR abs/2102.12702. External Links: 2102.12702 Cited by: §5.
 Reducing BERT pretraining time from 3 days to 76 minutes. CoRR abs/1904.00962. External Links: 1904.00962 Cited by: §5.
 Scaling vision transformers. CoRR abs/2106.04560. External Links: 2106.04560 Cited by: §1.
 Accelerating training of transformerbased language models with progressive layer dropping. In Proc. of NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §5.
 Fixedpoint backpropagation training. In Proc. of CVPR, pp. 2327–2335. Cited by: §5.
Comments
There are no comments yet.