substation
Research and development for optimizing transformers
view repo
Transformers have become widely used for language modeling and sequence learning tasks, and are one of the most important machine learning workloads today. Training one is a very computeintensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementations do not efficiently utilize GPUs. We find that data movement is the key bottleneck when training. Due to Amdahl's Law and massive improvements in compute performance, training has now become memorybound. Further, existing frameworks use suboptimal data layouts. Using these insights, we present a recipe for globally optimizing data movement in transformers. We reduce data movement by up to 22.91 1.30x performance improvement over stateoftheart frameworks when training BERT. Our approach is applicable more broadly to optimizing deep neural networks, and offers insight into how to tackle emerging performance bottlenecks.
READ FULL TEXT VIEW PDFResearch and development for optimizing transformers
Transformers [vaswani2017attention] are a class of deep neural network architecture for sequence transduction [graves2012sequence]
, similar to recurrent neural networks
[rumelhart1986learning] and LSTMs [hochreiter1997long]. They have recently had a major impact on natural language processing (NLP), including language modeling
[radford2018improving, wang2018glue, wang2019superglue], questionanswering [rajpurkar2018know], translation [vaswani2017attention], and many other applications. The significant improvement in accuracy brought by transformers to NLP tasks is comparable to the improvement brought to computer vision by AlexNet
[krizhevsky2012imagenet]and subsequent convolutional neural networks. Transformers have also begun to be applied to domains beyond NLP where RNNs would previously have been used, including speech recognition
[yeh2019transformer][parisotto2019stabilizing], molecular property prediction [maziarka2020molecule], and symbolic mathematics [lample2019deep].Training transformers is very computeintensive, often taking days on hundreds of GPUs or TPUs [devlin2018bert, yang2019xlnet, liu2019roberta, keskar2019ctrl, shoeybi2019megatron, lan2019albert, raffel2019exploring]. Further, generalization only improves with model size [radford2019language, shoeybi2019megatron, raffel2019exploring, microsoft2020turingnlg], number of training samples [raffel2019exploring, liu2019roberta], and total iterations [liu2019roberta, kaplan2020scaling]. These all significantly increase compute requirements. Indeed, transformers are becoming the dominant task for machine learning compute where training a model can cost tens of thousands to millions of dollars and even cause environmental concerns [strubell2019energy]. These trends will only accelerate with pushes toward models with tens of billions to trillions of parameters [microsoft2020turingnlg, microsoft2020zero], their corresponding compute requirements [openai2018aiandcompute], and increasing corporate investment towards challenges such as artificial general intelligence [openai2019microsoft]. Thus, improving transformer performance has been in the focus of numerous research and industrial groups.
Significant attention has been given to optimizing transformers: local and fixedwindow attention [bahdanau2014neural, luong2015effective, shen2018bi, parmar2018image, tay2019simple], more general structured sparsity [child2019generating], learned sparsity [correia2019adaptively, sukhbaatar2019adaptive, tay2020sparse], and other algorithmic techniques [lan2019albert, kitaev2020reformer]
improve the performance of transformers. Major hardware efforts, such as Tensor Cores and TPUs
[jouppi2017datacenter] have accelerated tensor operations like matrixmatrix multiplication (MMM), a core transformer operation. Despite this, existing implementations do not efficiently utilize GPUs. Even optimized implementations such as Megatron [shoeybi2019megatron] report achieving only 30% of peak GPU flop/s.

We find that the key bottleneck when training transformers is data movement. Improvements in compute performance have reached the point that, due to Amdahl’s Law and the acceleration of tensor contractions, training is now memorybound. Over a third (37%) of the runtime in a BERT training iteration is spent in memorybound operators: While tensor contractions account for over 99% of the flop performed, they are only 61% of the runtime. By optimizing these, we show that the overhead of data movement can be reduced by up to 22.91%. Further, while MMM is highly tuned by BLAS libraries and hardware, we also find that existing frameworks use suboptimal data layouts. Using better layouts enables us to speed up MMM by up to 52%. Combining these insights requires moving beyond peepholestyle optimizations and globally optimizing data movement, as selecting a single layout is insufficient. Overall, we achieve at least performance improvements in training over generalpurpose deep learning frameworks, and over DeepSpeed [deepspeed], the state of the art manuallytuned implementation of BERT. For robustly training BERT [liu2019roberta]
, this translates to a savings of over $85,000 on AWS using PyTorch. For the GPT3 transformer model
[brown2020language] with a training cost of $12M [venture], our optimizations could save $3.6M and more than 120 MWh energy. To do this, we develop a recipe for systematically optimizing data movement in DNN training.Our approach constructs a dataflow graph for the training process, which we use to identify operator dependency patterns and data volume. With this representation, we identify opportunities for data movement reduction to guide optimization. We aim to maximize data reuse using various forms of fusion. Then we select performant data layouts, which is particularly impactful for normalization and tensor contraction operators, where it provides opportunities for vectorization and different tiling strategies. The performance data gathered is then used to find operator configurations that produce a fast endtoend optimized implementation of training.
We evaluate these implementations first for multihead attention, a core primitive within transformers and one that has significant applications beyond transformers [bello2019attention, parmar2019stand, cordonnier2019relationship]. We then consider the encoder layer from BERT [devlin2018bert], a widelyused transformer architecture. In each case, we compare against existing highly optimized implementations to provide strong baselines. Using this recipe, we are able to demonstrate significant performance improvements in both microbenchmarks and endtoend training, outperforming PyTorch [paszke2019pytorch]
, TensorFlow+XLA
[tensorflow2015whitepaper], cuDNN [chetlur2014cudnn], and DeepSpeed [deepspeed]. While we focus our work on particular transformer models, our approach is generally applicable to other DNN models and architectures. We summarize our contributions as follows:We find transformer training to be memorybound and significantly underperforming on GPUs.
We develop a generic recipe for optimizing training using dataflow analyses.
Using this recipe, we systematically explore the performance of operators and the benefits of different optimizations.
We demonstrate significant performance improvements, reducing data movement overheads by up to 22.91% over existing implementations, and achieving at least performance improvements over specialized libraries, and over generalpurpose frameworks.
We make our code available at https://github.com/spcl/substation.
Here we provide a brief overview of our terminology, transformers, and datacentric programming. We assume the reader is generally familiar with training deep neural networks (see Goodfellow et al. [Goodfellowetal2016] for an overview).
We use the standard minibatch dataparallel approach to training, wherein a minibatch of samples is partitioned among many GPUs. During backpropagation, we distinguish between two stages: computing the gradients with respect to a layer’s input (
), and computing the gradients with respect to the layer’s parameters (). Note that the second stage is relevant only for layers with learnable parameters.The transformer architecture [vaswani2017attention], originally developed for machine translation, is a neural network architecture for sequence transduction, or transforming an input sequence into an output sequence. Transformers build upon a long sequence of work within natural language processing, most relevantly beginning with word embeddings [mikolov2013linguistic, mikolov2013efficient, mikolov2013distributed]
[kalchbrenner2013recurrent, cho2014learning], and sequencetosequence learning [sutskever2014sequence]. A key component is attention [bahdanau2014neural, luong2015effective], which enables a model to learn to focus on particular parts of a sequence.The transformer makes two key contributions. First, it generalizes attention to multihead attention, which we discuss below. Second, the transformer neglects recurrent or convolutional mechanisms for processing sequences, and relies entirely on attention. Critically, this enables significantly more parallelism during training, as the model can process every element of a sequence simultaneously, instead of having a serial dependence on the prior element.
Multihead attention (MHA) generalizes attention, and uses h attention “heads” in parallel to attend to different learned projections of a sequence. We provide Python code and an illustration of MHA forward propagation in Fig. 1.
Each attention head is an instance of scaled dotproduct attention, and takes three input tensors: queries (q), keys (k), and values (v). Conceptually, attention finds values corresponding to the keys closest to the input queries. Heads are also augmented with learned linear layers that project their inputs into a lowerdimensional space. The three inputs are first multiplied by weight tensors wq, wk, wv, respectively, as a learned input projection (we use Einsteinnotation sums, or einsums, for tensor contractions). The query and key tensors are subsequently multiplied together and scaled (stored in beta), followed by applying the softmax operation in order to weight and select the most important results. This is then multiplied with vv to produce the perhead output (gamma). The outputs of all the heads are finally concatenated and linearly projected back to the input dimension size (i).
The respective dataflow graph (Fig. 0(b)) immediately exposes coarse (whole graph) and finegrained (within rectangular nodes) parallelism, as well as data reuse. As every edge represents exact data movement, their characteristics (access sets and movement volume) can be inspected for guided bottlenecks and potential solution analysis.
There are three broad classes of MHA, distinguished by their inputs. General attention uses distinct tensors as the queries, keys, and values. Encoder/decoder attention uses the same tensor for both keys and values (typically produced by an encoder layer). Selfattention uses the same tensor for all three inputs. MHA may also have a masking step, which is used during training to prevent a model from “seeing the future” and using information from a later part of a sequence.
BERT [devlin2018bert] is a widelyused transformer for NLP tasks. Fig. 2
illustrates the forward and backward pass for a single BERT encoder layer. The layer essentially consists of MHA (as selfattention) followed by a feedforward neural network. The feedforward network consists of two linear layers with bias and ReLU activations after the first layer. Dropout
[srivastava2014dropout], layer normalization [ba2016layer], and residual connections
[he2016deep] are also used.Transformers also incorporate several other layers that we will not discuss in detail: embedding layers for input sequences and various output layers, depending on the task. Other transformer architectures, such as the original Transformer and GPT2/3
[radford2019language, brown2020language] have very similar architectures.As DNN processing is among the most popular computeintensive applications today, considerable efforts have been made to optimize its core operators [ddlsurvey]. This has driven the field to the point where optimization today is almost exclusively performed beyond the individual operator, either on the whole network [xla, torchscript] or repeated modules.
Performance optimization on modern architectures consists of mutations to the original code, sometimes algorithmic [chellapilla2006high, mathieu2013fast, lavin2016fast] but mostly relating to hardwarespecific mapping of computations and caching schemes. This includes tiling computations for specific memory hierarchies, using specialized units (e.g., Tensor Cores) for bulkprocessing of tiles, modifying data layout to enable parallel reductions, hiding memory latency via multiple buffering, pipelining, and using vectorized operations. It is thus apparent that all current optimization techniques revolve around careful tuning of data movement and maximizing data reuse.
The DataCentric (DaCe) parallel programming framework [dace] enables performance optimization on heterogeneous architectures by defining a development workflow that enforces a separation between computation and data movement. The core concept powering program transformation is the Stateful Dataflow multiGraph (SDFG), which is a graph intermediate representation that defines containers and computation as nodes, and data movement as edges. DaCe takes input code written in Python or DSLs, and outputs corresponding SDFGs. Subsequently, programmers can mutate the dataflow via userextensible graphrewriting transformations to change the schedule of operations, the layout of data containers, mapping of data and computation to certain processing elements, or any adaptation to the data movement that does not change the underlying computation.
As opposed to traditional optimizing compilers and deep learning frameworks (e.g., XLA, TorchScript), DaCe promotes a whitebox approach for performance optimization. The framework provides an API to programmatically instrument and explore, e.g., different layouts and kernel fusion strategies, all without modifying the original code. DaCe was shown to map applications to different hardware architectures, including CPUs, GPUs, and FPGAs [dace], enabling both wholeprogram and microoptimizations of nontrivial applications to stateoftheart performance [daceomen].
The combination of the separation of the algorithm from the transformed representation and whitebox approach for optimization enables us to inspect and optimize the data movement characteristics of Transformer networks. As we shall show in the next sections, this drives a methodical approach to optimizing a complex composition of linear algebraic operations beyond the current state of the art.
We now apply our recipe to optimize data movement in training, using a BERT encoder layer as an example. We focus on a single encoder layer, since these are simply repeated throughout the network, and other components (e.g., embedding layers) are not a significant component of the runtime. In this section, we discuss dataflow and our operator classification. Sections IV and V discuss our optimizations and Section VI presents endtoend results for transformers.
At a high level, our recipe consists of the following steps:
Construct a dataflow graph for the training process and analyze the computation to identify common operator classes.
Identify opportunities for data movement reduction within each operator class using data reuse as a guide.
Systematically evaluate the performance of operators with respect to data layout to find nearoptimal layouts.
Find the best configurations to optimize endtoend performance of the training process.
We use SDFGs and the DaCe environment to construct and analyze dataflow graphs. Fig. 2 provides a simplified representation of dataflow in a transformer encoder layer. Each node represents an operator, which is a particular computation along with its associated input and output, which may vary in size. An operator may be implemented as multiple compute kernels, but is logically one operation for our analysis. To produce an SDFG, all that is required is a simple implementation using NumPy. As the goal of this stage is to understand the dataflow, we do not need to optimize this implementation: It is simply a specification of the computations and data movement.
Using DaCe, we can easily estimate data access volume and the number of floating point operations (flop) required for each computation. Fig.
2 is annotated with the number of flop and the ratio of flop to data volume, and we provide a precise comparison with PyTorch in Tab. III. The key observation is that the ratio of data movement to operations performed varies significantly among operators. In many cases, the runtime of an operator is dominated by data movement, rather than useful computation, and this should be the target for optimization.Operator class  % flop  % Runtime 

Tensor contraction  99.80  61.0 
Stat. normalization  0.17  25.5 
Elementwise  0.03  13.5 
With this analysis, we can now identify highlevel patterns that allow us to classify operators. We base our classification both on the data movement to operation ratio and the structure of the computations. This classification is useful as it allows us to consider optimizations at a more general level, as opposed to working on each operator (or kernel) individually. For transformers, we find three classes: tensor contractions, statistical normalizations, and elementwise operations. The border of each operator in Fig.
2 indicates its class and Tab. I gives the proportion of flop and runtime for a BERT encoder layer for each class.These are matrixmatrix multiplications (MMMs), batched MMMs, and in principle could include arbitrary tensor contractions. We consider only MMMs and batched MMMs for simplicity, as these are efficiently supported by cuBLAS. In transformers, these are linear layers and components of MHA. These operations are the most computeintensive part of training a transformer. For good performance, data layout and algorithm selection (e.g., tiling strategy) are critical.
These are operators such as softmax and layer normalization. These are less computeintensive than tensor contractions, and involve one or more reduction operation, the result of which is then applied via a map. This compute pattern means that data layout and vectorization is important for operator performance.
These are the remaining operators, and include biases, dropout, activations, and residual connections. These are the least computeintensive operations.
The MUE metric [fuhrer2018near] provides a measure of the memory efficiency of an operation, both with respect to its implementation and achieved memory performance. This provides another method for understanding performance beyond flop/s that is particularly relevant for applications that are bottlenecked by data movement. MUE evaluates the efficiency of a particular implementation by comparing the amount of memory moved () to the theoretical I/O lower bound [jia1981complexity] () and the ratio of achieved () to peak () memory bandwidth:
If an implementation both performs the minimum number of operations and fully utilizes the memory bandwidth, it achieves . This can be thought of as similar to the percent of peak memory bandwidth.
All our results were produced on the Lassen supercomputer [lassen], which consists of 795 nodes, each with two IBM Power9 CPUs and four Nvidia V100 GPUs with NVLINK2 and 16 GB of memory. We use CUDA 10.1.243 and GCC 7.3.1 to build our code. For comparison, we use cuDNN 7.6.5, PyTorch 1.5.0 (PT) (builtin transformer implementation), TensorFlow 2.1 from IBM PowerAI with XLA enabled (transformer adapted from [wolf2019huggingfacests]) (TF+XLA), and a recent development version of DeepSpeed (DS). Unless noted, our results use mixedprecision training [micikevicius2018mixed], with FP16 data and accumulations performed at FP32. In PyTorch, we use Apex [apex] for mixedprecision; in TensorFlow and DeepSpeed, we use the builtin automatic mixedprecision. Results are the mean of 100 measurements. When we compute the percent of peak performance, we use the 125 Gflop/s Tensor Core peak on our V100s for tensor contraction operations, and the 31.4 Gflop/s halfprecision peak for other operations.
Our running example is training BERTlarge. We use a minibatch size of , sequence length , embedding size , heads, and a projection size of .
A significant portion of the runtime in existing transformer implementations is in statistical normalization and elementwise operators (Tab. I
). Thus, fusion is a major opportunity for promoting data reuse. We find fusion opportunities with a combination of standard performance engineering heuristics and by using DaCe to identify conformant iteration spaces.
We develop a set of fusion rules applicable to any application with operators similar to the three here. The process consists of two steps: detecting which operations can be fused and then deciding whether it is beneficial to fuse them.
To discover fusion opportunities, we analyze their iteration spaces. Every operator has independent dimensions. Statistical normalization operators also contain reduction dimensions. Tensor contractions additionally have reduction dimensions, and special independent dimensions for the two input tensors.
The type of iteration space determines which tools are used to implement them. Independent dimensions can be implemented using GPU block or thread parallelism, or with "for" loops within a single thread. Reduction dimensions use these techniques but also specify how the reduction is to be performed: accumulating to the same memory location ("for" loops), or as grid, block, or warpreductions.
Two operators can be fused if their iteration space implementations are compatible: They are either the same or the only difference is that one operator performs a reduction. The order and size of dimensions and the implementation for each must match. If the first operator produces output the second uses as input, partial fusion can be done: The outermost independent dimensions can be shared, but the innermost iteration spaces are put in sequential order inside.
When a fusion opportunity is detected, we take it in two cases: First, if we can perform fewer kernel launches by merging iteration spaces. Second, if we achieve less data movement by avoiding loads and stores between operators. Theoretically, the first case could increase memory pressure, but we observe it provides benefits in practice.
We attempt to fuse maximally. There are four structural patterns (Fig. 3) in the dataflow graph for the encoder layer when fusion rules explained above are applied to a pair of nontensor contraction operators. Using the SDFG, we fuse two adjacent operators whenever we detect these patterns and continue until we cannot fuse further. This means we fuse until either a reduction dimension or iteration space changes. As a further constraint, we only fuse simple elementwise scaling operations into tensor contraction operators (see Sec. IVC).
We implement each fused operator as a single custom CUDA kernel and specialize it for a specific data layout using templates to maximize opportunities for compiler optimization. To correctly handle data dependencies, if a reduction is the first operator in a fusion kernel, it is implemented with two loops: the first computes the reduction and the second uses it. Otherwise, each kernel is implemented as a single loop.
Reduction operations in statistical normalizations use a warp allreduce among all threads in a warp, implemented with CUDA intrinsics. If the number of elements to be reduced exceeds the warp size, we perform a sequential reduction over smaller chunks first. Layoutpermuting, we use vectorized loads, stores, and arithmetic within a single thread, and fall back to wordwise implementations otherwise. For dropout operators, which must generate a random mask, we use cuRAND for random number generation.
After fusion, we have the following fused elementwise and normalization operations. Fig. 3 illustrates several cases.
aib: Attention input bias.
baob: Backward attention output bias.
baib: Backward attention input bias.
sm: Softmax with scaling and dropout.
brd: Bias, ReLU, and dropout.
bdrln: Bias, dropout, residual connection, and layernorm.
bsb: Backward layernorm scale and bias.
blnrd: Backward layernorm dX and dropout, saving the intermediate result for the residual connection.
bdrb: Backward dropout, ReLU, and bias.
ebsb: Backward residual and layernorm scale and bias.
bs: Backward dropout and softmax with scaling.
bei: Backward encoder input residual connection.
Unfused  fused  fused  

Forward  345  294  275 
Backward  342  312  291 
Input  Output  PyTorch  Ours  
Operator  Gflop  (1e6)  (1e6)  Gflop  Time (s)  % peak  Time (s)  % peak  MUE  Speedup  Kernel  
Forward  Q, K, V  24  7.3  12.5  24.012  333  56.2  306  61.2  12  1.08  — 
Input bias  0.012  12.5  12.5  0.023  90  0.4  66  0.5  78  1.35  }1*[aib]  
4  8.3  33.5  4.031  189  16.5  143  21.8  50  1.32  —  
Scaled softmax  0.188  33.5  100.6  0.89  453  1.3  433  1.3  32  1.04  }1*[sm]  
Gamma  4  37.7  4.1  8.008  142  21.9  160  19.4  6  0.88  —  
Out  8  5.2  4.1  8.09  136  45.9  120  52  10  1.13  —  
Output bias  0.004  4.1  4.1  0.008  34  0.4  102  0.1  42  1.68  }4*[drln]  
Dropout  0.004  4.1  8.3  0.013  37  0.3  
Residual  0.004  8.3  4.1  0.008  36  0.3  
LayerNorm  0.027  4.1  4.1  0.048  63  1.3  
Linear  32  8.3  16.7  32.016  451  55.4  402  62.1  12  1.12  —  
Bias  0.016  16.7  16.7  0.031  116  0.4  183  0.3  76  1.90  }3*[brd]  
ReLU  —  16.7  16.7  —  112  0  
Dropout  0.016  16.7  33.5  0.048  120  0.4  
Linear  32  20.9  4.1  32.09  449  55.6  369  67.6  6  1.21  —  
Bias  0.004  4.1  4.1  0.008  35  0.3  101  0.1  43  1.70  }4*[bdrln]  
Dropout  0.004  4.1  8.3  0.013  37  0.3  
Residual  0.004  8.3  4.1  0.008  37  0.3  
LayerNorm  0.027  8.3  4.1  0.048  63  1.3  
Backward  LayerNorm dW  0.016  8.3  <0.01  0.02  184  0.3  150  0.3  6  1.22  }1*[bsb] 
LayerNorm dX  0.035  8.3  4.1  0.06  78  1.4  71  1.5  37  1.58  }2*[blnrd]  
Dropout dX  0.004  8.3  4.1  0.008  34  0.4  
Linear+Bias dX  32  8.3  16.7  32.016  427  58.4  414  60.3  5  1.03  —  
Linear dW  32  20.9  4.1  32.027  424  58.9  378  66  13  1.11  —  
Bias dW  0.004  4.1  <0.1  0.005  24  0.5  362  <0.1  38  1.05  }4*[bdrb]  
Dropout dX  0.016  33.5  16.7  0.031  129  0.4  
ReLU dX  —  33.5  16.7  —  166  0  
Bias dW  0.016  16.7  <0.1  0.02  61  0.8  
Linear+Bias dX  32  20.9  4.1  32.027  417  59.9  398  62.7  6  1.04  —  
Linear dW  32  20.9  4.1  32.09  437  57.2  372  67.2  6  1.17  —  
Residual  0.004  8.3  4.1  0.008  36  0.3  250  <0.1  17  0.89  }2*[ebsb]  
LayerNorm dW  0.016  8.3  <0.1  0.02  186  0.3  
LayerNorm dX  0.035  8.3  4.1  0.06  80  1.4  69  1.6  37  1.64  }2*[blnrd]  
Dropout dX  0.004  8.3  4.1  0.008  34  0.4  
Output bias dW  0.004  4.1  <0.1  0.005  23  0.5  38  0.3  22  0.60  }1*[baob]  
Out dX  8  4.3  4.1  8.044  131  47.6  119  52.2  10  1.09  —  
Out dW  8  8.3  1.0  8.09  136  45.9  113  54.8  5  1.19  —  
Gamma dX1  4  8.3  33.5  8.008  136  22.8  147  21.2  7  0.93  —  
Gamma dX2  4  67.1  33.5  4.031  188  16.6  123  25.2  8  1.52  —  
Scaled softmax dX  0.156  12.5  4.1  0.199  790  0.6  426  1.1  49  1.85  }1*[bs]  
dX1  4  37.7  4.1  4.004  135  23.1  155  20  7  0.86  —  
dX2  4  37.7  4.1  8.008  139  22.3  115  26.9  9  1.20  —  
Q, K, V dX  24  15.7  4.1  24.027  344  54.4  274  68.2  6  1.25  —  
Q, K, V dW  24  20.4  1.0  24.132  329  57  293  64  6  1.12  —  
Input bias dW  0.012  12.5  <0.1  0.015  52  0.7  39  0.9  66  1.32  }1*[baib]  
Residual  0.004  8.3  4.1  0.008  35  0.3  31  0.4  83  1.14  }1*[bei]  
Tensor contractions  312  —  —  324.75  4951  43.1  4411  48.5  —  1.12  
Stat. normalizations  0.535  —  —  1.389  2063  0.9  1591  0.6  —  1.29  
Elementwise  0.098  —  —  0.223  1096  0.3  735  0.1  —  1.49  
Total  312.633  —  —  326.362  8110  31.1  6739  35  —  1.20 
Tab. III presents our comprehensive results, including operator fusion. In this, we show a detailed breakdown of the required and observed flop, data movement, runtime, and MUE for each operator within the encoder layer, for both PyTorch and our implementation, with our fused operators marked. We can easily observe that while the vast majority of flop are in tensor contractions, much of the runtime is in statistical normalization and elementwise operators (see also Tab. I). These operators are indeed memorybound.
In forward propagation, every fused operator outperforms PyTorch’s. In backpropagation, this trend generally holds, but ebsb and baob are slower due to our configuration selection algorithm choosing a layout that is suboptimal for some operators to optimize the overall performance (see Sec VI).
By studying the MUE and flop/s, we can reason about the bottlenecks behind each operator. For the fused operators, we see that high MUE rates are often achieved. In fact, the MUE from Tab. III and the theoretical flop/IO ratio from Fig. 2 are highly correlated across operators. We say that a kernel is memory bound if its MUE is larger than the achieved peak flop/s, and compute bound otherwise. This insight aids in analyzing the bottlenecks of general DNNs and automated tuning of operators, prior to measuring their performance. We note that for our operators, which involve multiple tensors of different shapes, 100% MUE is potentially unattainable as achieving peak memory bandwidth requires a specific, highly regular access pattern into DRAM.
As for the tensor contraction results, we see that the attained MUE is consistently under 50%. This is acceptable, since the underlying matrix multiplications are generally computebound. However, we note that some cases, such as , are both low in flop/s and MUE. A more indepth look into the dimensions of the contraction reveals that the dimensions are small, which then indicates that the tensor cores are underutilized. This may result from insufficient scheduled threads, or low memory throughput to compute ratio. We thus try to increase hardware utilization by fusing additional operators into the contractions next.
Because cuBLAS does not support fusing arbitrary operators into (batched) MMMs, we evaluated CUTLASS [cutlass] version 2.1 as an alternative, which does support fusing elementwise operators. We conducted a simple benchmark comparing cuBLAS with a separate bias kernel to CUTLASS for the first linear layer of BERT. We found that the batched MMM in CUTLASS is approximately 40 s slower than cuBLAS. The reduction in runtime by fusing the bias is less than this. Hence, we only consider cuBLAS for tensor contractions. cuBLAS does support simple scaling operations, which we use to implement the scaling for the scaled softmax operator.
There is an additional opportunity for fusion among certain tensor contraction operators. Using domain knowledge and the dataflow graph, we can identify some operations that can be combined into a single algebraically equivalent operation. For example, there are several different ways to perform the input projections (batched MMMs) in selfattention, since the input queries, keys, and values are the same tensor, :
can be multiplied by each of the projection matrices: , , and .
The and projection matrices can be stacked and two multiplications performed: and . Similarly, the and matrices can be stacked.
All three can be stacked: .
In backpropagation, the and computations for each of the projection matrices can be similarly fused: and , respectively.
There are different tradeoffs to these approaches, which must be determined empirically. Performing separate MMMs may enable task parallelism. On the other hand, this stacking enables data reuse, since is used only once.
Tab. II presents results for each of these cases. Fully fusing this batched MMM performs the best. Unfortunately, cuBLAS launches kernels that utilize the entire GPU, so task parallelism is not profitable. This specific example can also be adapted to fuse keys and values in encoder/decoder attention.
We now consider data layout selection, which enables efficient access patterns, vectorization, and tiling strategies for tensor contraction and statistical normalization operators. To study this systematically, for each operator, including the fused operators produced in the prior step, we benchmark every feasible data layout to measure its performance, as well as varying other parameters depending on the specific operator. The best parameterization of an operator is highly dependent on the GPU model and tensor sizes, and may not be obvious a priori; hence it is important to empirically consider this.
Because there are a myriad of potential configurations for each operator, we summarize the distribution of runtimes over all configurations using violin plots. The width of the violin represents the relative number of configurations with the same runtime. This allows us to see not only the best runtime, but to see how sensitive operators are to layouts, an important factor when globally optimizing the layout in Step 4 of our recipe.
Using the Einsum notation for tensor contractions, we consider all equivalent permutations of the summation string. The einsum is then mapped to a single cuBLAS MMM or batched MMM call. While this notation allows us to express arbitrary tensor contractions, as cuBLAS does not support all configurations, we limit ourselves to these two types.
In addition, we consider every possible cuBLAS algorithm for each layout, as we have observed that the heuristic selection provided by its default algorithm is not always optimal. We use the cublasGemmEx API to manually select algorithms. We support both regular and Tensor Core operations, and perform all accumulations at singleprecision, in line with best practices for mixedprecision training [micikevicius2018mixed].
Fig. 4 presents performance distributions over all data layouts for every tensor contraction in the encoder layer training, including algebraic fusion variants. Since input matrices for cuBLAS MMs can be easily swapped, results for both orders are merged into the same tile of the figure. All tiles are labeled with . As expected, in many cases using tensor cores offers significant performance advantages, but interestingly, in several cases (where one of the matrix dimensions is 64) the performance is quite close to the regular floating point units, due to a failure to saturate the tensor cores. Among the tensor core results, we can typically see there are several modes in the performance distributions; these correspond to particularly important axes for data layout. Indeed, for many configurations, one of these is near or contains the bestperforming configuration, indicating that many slightly different data layouts could be used with little impact on performance depending on the needs of our global optimization pass. However, this does not mean that any data layout is acceptable; in every case, the majority of the mass does not perform well, illustrating the importance of careful tuning.
We also investigated how well the builtin cuBLAS algorithm heuristics compare to the bestperforming configuration. On half precision, we found that the algorithm chosen by cuBLAS’s heuristic was up to 14.24% worse than the best algorithm (in dX1). We also investigated the performance at singleprecision and found similar results with up to 7.18% worse performance. This demonstrates the importance of carefully tuning for the particular hardware and workload.
For our fused operators, we consider all combinations of input and output layout permutations. This enables us to include transposes of the output data as part of the operator, should the next operator perform better in a different layout than the input. The data layout is especially critical for statistical normalization operators, where the appropriate data layout can enable vectorization opportunities, especially vectorized loads and stores for more efficient memory access. Where relevant, we therefore also consider vectorization dimensions, the mapping of dimensions to GPU warps, etc. Our implementation takes advantage of these layouts when feasible.
Fig. 5 presents the runtime distribution for all configurations of our fused operators (note some are used twice). The distributions here are qualitatively similar to those in Fig. 4, except these have much longer tails: A bad configuration is, relatively, much worse, potentially by orders of magnitude.
All kernels support changing the layouts of tensors they use. This change is done via template parameterization, so the compiler can generate efficient code. All kernels support selecting different vectorization dimensions. The brd and bei kernels can select the dimension used for CUDA thread distribution; bsb, ebsb, and bdrb can select the warp reduction dimension, as they reduce over two dimensions.
The most noticeable performance improvement is made by layouts that enable vectorized memory access, showing that the main performance limitation is the amount of accessed data volume. The second notable category are layouts with the same reduce and vector dimensions. Joining these dimensions decreases the number of registers required to store partially reduced values from the vector size (eight at FP16) to one.
We can expect to get good performance restricting ourselves to configurations in the two groups described above. Usually, the best layout discovered follows them. For example, the sm kernel has the same warp and reduction dimensions, and these dimensions are the last and sequential ones for involved arrays. However, this intuition is not always correct. Examining the results in Fig. 5, we find that there are kernel configurations that both satisfy these intuitive rules yet are very slow. For example, the best configuration of aib takes 65 s, and the worst "intuitively good" configuration takes 771 s.
Configurations discovered through exhaustive search can enhance our intuitive understanding of requirements that should be satisfied to achieve better performance. The brd
kernel provides an example of that. It uses four 3D tensors for which we can choose one among six possible layouts. According to our intuitive rules, we want to put the vectorized dimension last to make it sequential for all of them. However, the best configuration has only two out of four arrays vectorized while the others are not vectorized. With this information, we can refine our intuition. Probably the factor that limits vectorization over all arrays is excessive GPU register usage. The important point here is that even this new intuitive knowledge doesn’t help to find the exact number of tensors that should be vectorized, but the exhaustive search does.
The final step is to assemble fused components and select data layouts for each operator to yield a complete implementation. This is the culmination of the prior steps performing dataflow analysis, fusion, and layout selection. From these, we have performance data for all data layouts as well as algebraic fusion strategies. One cannot simply pick a single data layout a priori, as the benefit of running two operators in different layouts may outweigh the overhead of transposing data.
Our assembled implementation is structured using the SDFGs produced in Step 1. We integrate it into PyTorch [paszke2019pytorch] via its C++/CUDA operator API.
We develop an automatic configuration selection algorithm to globally optimize our implementation using the performance data. We construct a directed graph based on the dataflow graph of the operators. Beginning from the input data and proceeding in the order given by a preorder depthfirst search, we add a node to the graph for each input and output data layout of the operator. An edge is added from the input to the output layout, weighted with the minimum runtime of any configuration with that layout. Determining this simply requires a linear search over the matching performance data. This allows us to select both the best data layout and any other operator knobs (e.g., vectorization dimension). To minimize the size of the graph, we only add a configuration from an operator when it has at least one input and output edge. We then run a singlesource shortestpath (SSSP) algorithm from the input to the output in the graph; the resulting path gives our final configuration. Because this graph is a DAG and the number of feasible input/output configurations for each operator is small, SSSP takes linear time asymptotically and seconds for BERT. We illustrate the graph used in Fig. 6.
To simplify the implementation, we make two assumptions on the global configuration space. First, we omit residual connections. Second, we run SSSP only for the forward propagation dataflow graph, and then infer the layouts of the backpropagation operators from those selected (including weight and gradient layouts). Both of these assumptions could be relaxed in a future version of this algorithm. Although this means we are not guaranteed to find a globally optimal data layout, the runtime of our configuration is nevertheless within 4% of the sum of the best possible configuration of each operator (which ignores data layout constraints).
We first analyze the performance of multihead selfattention. While it is a key primitive in BERT, MHA is also used outside of transformers, so understanding its performance in isolation can inform other models too. Tab. IV compares our globallyoptimized implementation with PyTorch, TensorFlow+XLA, and cuDNN. cuDNN’s (experimental) MHA implementation (cudnnMultiHeadAttnForward and related) supports six different data layouts; we report the fastest.
cuDNN’s performance is orders of magnitude worse than the others; as it is a black box, our ability to understand it is limited. However, profiling shows that its implementation launches very large numbers of softmax kernels, which dominate the runtime, indicating additional fusion would be profitable. TensorFlow+XLA finds several fusion opportunities for softmax. However, its implementation does not perform algebraic fusion for the queries, keys, and values, and it uses subpar data layouts for tensor contractions.
Our performance results in Tab. III illustrate the source of our performance advantage over PyTorch: Our data layout and algorithm selection results in faster tensor contractions overall. This is despite the Gamma stage actually being slower than PyTorch’s: Sometimes locally suboptimal layouts need to be selected to improve performance globally.
TF+XLA  PT  cuDNN  Ours  

Forward (ms)  1.60  1.90  131  1.25 
Backward (ms)  2.25  2.77  652  1.86 
PT  TF+XLA  DS  Ours  

Forward (ms)  3.45  3.2  2.8  2.63 
Backward (ms)  5.69  5.2  4.8  4.38 
We present overall performance results for the encoder layer in Tab. V. For forward and backpropagation combined, we are faster than PyTorch and faster than TensorFlow+XLA, including unoptimized framework overheads. At a high level, this is because we perform a superset of the optimizations used by both frameworks, and globally combine them to get all the advantages while minimizing drawbacks. As a general guideline, we use flop and MUE rates as proxies for which operators require the most attention and their corresponding bottlenecks. This ensures a guided optimization rather than tuning all operators aggressively.
We also include performance results from DeepSpeed, which we are faster than. This is despite DeepSpeed being a manuallyoptimized library tuned specifically for BERT. Note also that DeepSpeed modifies some operations, e.g., to be reversible or to exploit output sparsity, and so is not always strictly comparable to the other implementations. This also provides it opportunities for optimization that we do not consider.
The total data movement reduction we attain is 22.91% over the standard implementation. We obtain this information from Tab. III, where for each fused kernel we omit the interim outputs and inputs that are not part of the overall I/O that the fused kernels perform. TensorFlow+XLA’s automatic kernel fusion reduces data movement similarly to ours. However, the data layouts used for tensor contractions are not optimal, and its BERT encoder implementation does not use algebraic fusion in MHA. PyTorch’s data layouts enable faster tensor contractions and it implements the algebraic fusion, but it has higher overheads for other operators.
Our fusion pass finds all the opportunities that TF+XLA does, plus several additional ones; for example, we implement layernorm as a single fused kernel. Our data layout selection picks better layouts than PyTorch in nearly every individual instance; when it does not, this is because the layout change enables greater performance gains downstream. In Tab. III
, we also see that PyTorch performs more flop than predicted. Some of this is due to padding in cuBLAS operations, and generic methods performing excess operations. However, we also discovered that some cuBLAS GEMM algorithms, including ones called by PyTorch, incorrectly perform twice as many FLOPs as necessary; our recipe avoids these cases automatically.
We also briefly considered another configuration for training BERT, where we change the batch size to and the sequence length to , and retuned our layout selection. In this case, forward and backpropagation for a single encoder layer takes 18.43 ms in PyTorch, 16.19 ms in DeepSpeed, and 16.22 ms in our implementation. We significantly outperform PyTorch, and match the performance of DeepSpeed in this case (even with its additional optimizations). We believe that with further improvements to our layout selection algorithm, the performance of our implementation will improve further.
Beyond BERT, other transformers have very similar layers, such as decoder layers in GPT2/3. With very few changes, our recipe and implementations are directly applicable to these. Our implementation can also be extended to support a full training pipeline by stacking our optimized layers.
There has been significant work optimizing both transformers in particular and deep learning in general. For a recent comprehensive overview, see BenNun & Hoefler [ddlsurvey]. To help guide training regimes for transformers, recent work has provided empirical recommendations on model size, batch size, and so on [li2020train, kaplan2020scaling, rosenfeld2019constructive]. Many of the subsequent techniques we review are complementary to our work.
Most directly relevant are other approaches specifically to accelerate transformer training. Distributedmemory techniques, such as ZeRO [rajbhandari2019zero], Megatron [shoeybi2019megatron], and MeshTensorFlow [shazeer2018mesh] scale training to many GPUs to accelerate it. MeshTensorFlow also presents a classification of operators similar to ours. Large batches have also been used to accelerate training via LAMB [you2019large] or NVLAMB [nvlamb]. None of these directly address the performance of a single GPU as done in this paper. DeepSpeed [deepspeed], which we compare with in Section VIC, is closest to our work, but performs all optimizations and layout selections manually.
Transformer architectures to enable improved training have also been the subject of significant recent work. ALBERT [lan2019albert] used a combination of weight sharing and factorized embedding layers to reduce memory requirements; however compute times are unaffected. TransformerXL [dai2019transformer] caches prior hidden states to learn long sequences. RevNets [gomez2017reversible], a variant of ResNets which allow activations to be reconstructed during backpropagation, have been applied to transformers. Notably, Reformer [kitaev2020reformer] combines reversible residual layers with localitysensitive hashing to improve the efficiency of multihead attention. Sparsity optimizations for attention [bahdanau2014neural, luong2015effective, shen2018bi, parmar2018image, tay2019simple, child2019generating, correia2019adaptively, sukhbaatar2019adaptive, tay2020sparse] reduce memory and compute requirements. We view these as orthongal to our work: the same principles of dataflow analysis can be applied to optimize for sparsity and reuse.
There has also been much work on optimizing deep learning in general. Many frameworks provide implementations of transformers or their components, such as PyTorch [paszke2019pytorch], TensorFlow [tensorflow2015whitepaper], cuDNN [chetlur2014cudnn], and others built atop these [ott2019fairseq, wolf2019huggingfacests]. Optimizing frameworks can also be applied to transformers [xla, frostig2018compiling, jax2018github, torchscript, rotem2018glow, jia2019beyond, jia2019optimizing, jia2019taso, chen2018tvm, nnvm, sivathanu2019astra, mirhoseini2017device, cyphers2018intel, baghdadi2019tiramisu, vasilache2018tensor, lethin2019polyhedral, wei2017dlvm, truong2016latte, venkat2019swirl, dong2019acorns, elango2018diesel]. None of these frameworks provide all the optimizations or the systematic study of data movement and its impact on performance. We have specifically compared against some of the most popular production frameworks: PyTorch, TensorFlow with XLA, and cuDNN. Beyond these, TASO [jia2019taso] targets similar optimizations to ours by using graph substitutions, but considers only inference and does not exhaustively explore the search space.
Other optimizations, including model parallelism [van2015lbann, jia2019beyond, gholami2018integrated, dean2012large, chilimbi2014project, shazeer2018mesh, buchlovsky2019tf], pipeline parallelism [chen2012pipelined, li2018pipe, narayanan2019pipedream, huang2019gpipe], microbatching [oyama2018accelerating], and recomputation for memory reduction [chen2016training, jain2019checkmate] are all also applicable. Communication can also be a major bottleneck for training transformers, due to the large model size [shoeybi2019megatron, shazeer2018mesh]. Frameworks for inference, including TensorRT [tensorrt], Caffe2 [caffe2], and the ONNX Runtime [onnxruntime], all help to enable a suite of optimizations primarily applicable during inference. Pruning [shazeer2019fast, voita2019analyzing] and distillation [sanh2019distilbert] has also been used to accelerate inference.
Neural network architecturespecific optimizations have a long history outside of transformers, and have primarily targeted CNNs [krizhevsky2012imagenet, coates2013deep, goyal2017accurate, akiba2017extremely, you2018imagenet, mikami2018imagenet, ying2018image, dryden2019improving, dryden2019channel]. Notably, Li et al. [li2016optimizing] optimized data layouts for convolution.
In general, data movement reduction is a core component of highlevel optimization [tpdslocality]. Optimizing compilers, most notably components that specialize in polyhedral programs [polly, pluto]
, apply loop transformations (e.g., tiling, skewing) that belong to the class of data movement optimization. Other whitebox approaches for separation of program definition from data optimization passes include Halide
[ragan2013halide], JAX [frostig2018compiling, jax2018github], Legion [bauer2012legion], Lift [lift], and MLIR [lattner2020mlir]. The datacentric approach proposed here enables userextensible coarse and finegrained data movement optimization via the flat (yet parametric) dataflow graph representation [dace]. This allows us to perform and tune complex data layout and fusion transformations that span multiple granularity levels, surpassing the optimization capabilities of the aforementioned tools and achieving stateoftheart performance.The recipe we propose in this paper can be directly adopted in other DNN architectures. Additional transformer networks, such as MegatronLM [shoeybi2019megatron] and GPT3 [brown2020language], only differ by dimensions and minor aspects in the encoder and decoder blocks (e.g., dropout position, biases). Once a datacentric graph is constructed from them, the recipe remains unchanged.
The classification of operators into three groups covers a wide variety of operators that span beyond transformers.
Large tensors and their contraction are ubiquitous in modern DNNs. For fully connected networks (MLPs) and recurrent neural networks (RNNs), there is little change, as the core operator types are essentially the same. Convolutions, pooling, and other local spatial operators can be treated similarly to tensor contractions, owing to their arithmetic intensity properties and abundance of optimized implementations. Therefore, the same considerations we take here are just as critical in CNNs. However, as opposed to contractions (see Section IVC), further fusion is typically considered for convolutions.
Statistical normalization also takes different forms in DNNs. This includes a variety of reductions, as well as Instance, Group, and Batch Normalization, where the latter constitutes the second largest computation in ResNets after convolutions. When varying data layouts, these operators share properties (normalizing a dimension) and are optimized in exactly the same way. Lastly, elementwise operators always exist in DNNs and benefit from the same fusion and bulk data movement optimizations as we perform here. For graph neural networks
[bronstein2017geometric], capsule neural networks [sabour2017dynamic], and other emerging architectures, the operators change more significantly, but the basic procedure applies.Due to the prohibitively large search space of configurations in transformers, writing manuallyoptimized kernels becomes infeasible. Instead, each of our datacentric implementations chooses an optimization scheme (e.g., tiling, vectorization, warpaggregated reductions) automatically, according to the input data layout and the operator type. Combined with automated configuration selection (Section VIA), we rely only on the dataflow graph structure to choose the best feasible data layout configuration.
As many networks are bound by data movement rather than compute performance [li2016optimizing]. This leads us to believe that our recipe, regardless of the objective (e.g., classification, regression, RL) and the constituent operators, is generically applicable for optimizing any neural network architecture.
The implications of data movement reduction extend beyond software. Given that the highest performance for different operators is achieved with different data layouts, there would be significant benefits if future machine learning accelerators included builtin support for fast data layout changes. We confirm this in our MUE results (Tab. III), showing that even the most computeintensive tensor contractions are bounded by the hardware’s capability of transferring data to Tensor Cores.
Hardware trends indicate a similar situation. New architectures are moving towards reducing data format conversion (e.g., TF32 [tf32]), increased onchip memory and lowlatency interconnects [graphcore], and coarsegrained spatial hardware [cerebras]. For the latter two, the recipe and analysis provided here is crucial to maintain high utilization in pipelined DNN execution.
Despite the importance of transformer neural networks, training them is memorybound and underutilizes GPUs. Using our recipe for data movement analysis, we identified bottlenecks and optimizations, yielding improved implementations that outperform the already highly tuned stateoftheart. As training transformers is already a major compute workload that will only grow larger, our improvements offer significant realworld impacts for both research and industry.
Our approach is applicable more broadly to deep learning; many neural networks easily fit within our operator classification. This is especially important for guiding the optimization of emerging or novel model architectures, which do not benefit from existing acceleration libraries. Our results also highlight the importance of considering data movement at every level of the training stack—from the application down to hardware.
This project received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 programme (grant agreements DAPP, No. 678880, and EPiGRAMHS, No. 801039). N.D. is supported by the ETH Postdoctoral Fellowship. T.B.N. is supported by the Swiss National Science Foundation (Ambizione Project #185778). Experiments were performed at the Livermore Computing facility.