Recurrent neural networks (RNNs), long-short memory (LSTM), and gated recurrent (GRU), used to be the best solutions in the natural language processing (NLP) area. This situation was changed when the Transformer model was invented in 2017, which outperforms previous RNN models in multiple tasks. By avoiding the recurrent calculations and taking full advantage of the attention mechanism, the Transformer and Transformer-based pre-trained language models (such as BERT, ALBERT, T5, ERINE, and structBERT) have achieved state-of-the-art accuracy in various NLP tasks.
In spite of making great progress in relative fields, the high computation complexity and huge memory requirements of these powerful Transformer networks are making them hard to be operated in mobile devices or embedded systems. More and more researchers are paying attention to this problem, and one way to solve it is through model compression. Several techniques have been used to compress these networks, including data quantization, pruning, knowledge distillation and Architecture-Invariant Compression (AIC).
Recently, building FPGA or ASIC hardware accelerators for deep neural networks (DNNs) has achieved great success in both academic and industrial societies, which makes us believe that designing efficient hardware architectures for these Transformer networks must be an important topic as well. By implementing them on hardware platforms, the inference systems of many NLP applications, such as machine translation, question answering, and sentiment analysis, are able to achieve higher speed or lower power consumption or both. However, intense matrix computations, complicated data flow, and complex non-linear functions are making it hard to design efficient hardware architecture for the Transformer. To the best of our knowledge, we are the first to propose a specific hardware accelerator for the Transformer. In the open literature, the is the only hardware architecture for accelerating the attention mechanism in various neural networks, which is not specifically designed for the Transformer.
As mentioned in  and , most of the trainable parameters and the computations are in the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which is discussed by Section II in detail. In this work, we design a reconfigurable hardware architecture based on systolic array (SA) for the MHA ResBlock and the FFN ResBlock, which are the two most complex layers in the Transformer.
Main contributions of this work can be summarized as follows:
We provide an efficient method to partition the huge matrices in the Transformer, which allows the MHA ResBlock and the FFN ResBlock to share most of the hardware resources.
We propose the first hardware architecture design which can complete the calculations for both these two ResBlocks. To ensure the high hardware utilization of the SA, which is the biggest module in our design, the computation flow is well designed.
Two most complicated nonlinear functions, including the scaled masked-softmax and the layer normalization, are highly optimized to become more hardware-friendly. As the “bottle-neck” in the proposed architecture, the latency of layer normalization is reduced as much as possible.
After quantizing the Transformer base model in  (distinguished from the Transformer big model) with 8-bit integers (INT8), we also evaluate our design on the Xilinx xcvu13p-fhga2104-3-e FPGA, when the max sequence length (denoted as ) is equal to 64 and the batch size is equal to 1. The hardware experimental results demonstrate a speed-up of 14.6 in the MHA ResBlock, and a speed-up of 3.4 in the FFN ResBlock, compared to a GPU implementation on an NVIDIA V100.
The rest of this paper is organized as follows. Section II gives a brief review of the Transformer networks, and explains the importance of accelerating the MHA ResBlock and the FFN ResBlock. Section III presents the method of matrix partitioning. Section IV describes the proposed hardware architecture. Experimental results are given in Section V. Section VI concludes this paper.
Ii Background and Motivation
Ii-a The Model Architecture of the Transformer
The model architecture of the Transformer is described in Fig. 1, containing an encoder stack and a decoder stack. Notice that most of the trainable parameters and the computations are in these two stacks, and other components beside the stacks such as the embedding layers and the softmax output layer have not been taken into account by this work. As is shown in Fig. 1, all the encoder layers and the decoder layers are composed of two kinds of ResBlocks, the MHA ResBlock and the FFN ResBlock.
Fig. 2 shows the structure of the MHA ResBlock. An MHA ResBlock has
“Attention Heads”, and the input of each Head is the same as the input of the ResBlock, including three tensors: V (values), K (keys), and Q (queries). The Scaled Dot-Product Attention function in the MHA can be expressed as follows:
The Mask operation is used to mask out all values in the input of the softmax corresponding to illegal connections, and the parameter , which is equal to 64 in both the Transformer base model and the Transformer big model. The parameter is equal to 8 in the base model, or equal to 16 in the big model.
The FFN ResBlock contains a fully connected feed-forward network, consisting of two linear sublayers and a ReLU activation between them:
Ii-B Transformer-Based Pre-Trained Models
An important pre-trained model is Bidirectional Encoder Representations from Transformers (BERT). Analyses in  also point out that, the MHA and the FFN ResBlocks still occupy most of the storage space and have the highest numbers of FLOPs.
The General Language Understanding Evaluation (GLUE) benchmark  is a collection of diverse natural language understanding tasks. Recently, many Transformer-based pre-trained models have obtained top placements on the GLUE score list. Most of these models, such as T5, ERINE, and structBERT, have very similar structure to the BERT. These facts all prove the necessity of designing efficient hardware accelerators for the MHA and the FFN ResBlocks, which are two commonly used structures in these models.
Iii Partitioning Matrices in the FFN and the MHA
Considering the characteristics of the Transformer architecture, we believe that the proposed hardware accelerator should be able to accelerate not only the MHA ResBlock, but also the FFN ResBlock. To make sure that the MHA ResBlock and the FFN ResBlock can reuse the hardware resources, we first analyze these two ResBlocks from the perspective of matrix operations, and then give a method to partition the matrices so that all the general matrix-matrix multiplications (GEMMs) can be done with one and the same systolic array (SA), the size of which is limited to .
Assuming that the input of the FFN is called X, the shape of the tensor X is the same as Q (one of the input tensors of the MHA), which is . Additionally, Fig. 1 shows that the tensor K is always equal to the tensor V, the shape of which is . In normal circumstances, is equal to , so the shape of all these four tensors can be expressed as . Supposing that the batch size is equal to 1, the computations of these two ResBlocks can be considered sets of matrix operations, which are represented in Fig. 3. Obviously, an SA can support all the matrix multiplications in the Linear sublayers of all the Heads. However, how to complete other multiplications between larger matrices, including , , and , is another important issue to be considered.
Table I shows that in these Transformer networks, we all have , and . On the basis of this pattern, the three large weight matrices , , and can be partitioned as shown in Fig. 4. Thus, most of the GEMMs can be done with an SA.
The only one left is the operation of in each Head of the MHA. The ratio of the number of multiplications in this operation to the entire MHA ResBlock can be roughly calculated as follows:
Since is no smaller than 16,384 and is usually no bigger than 128, this ratio should be very small, which illustrates that the management of this single operation will not influence the overall hardware utilization much. If is smaller than 64, it can be done with the
SA through zero padding to the. Otherwise by partitioning the , the SA can still support this operation with little impact on the utilization of the SA.
Iv Hardware Architecture Design for the Proposed Accelerator
Using the proposed method of partitioning these weight matrices, the complete hardware accelerator is designed. The top-level architecture is illustrated in Fig. 5.
The SA is made up of a 2D array of processing elements (PE), with rows and 64 columns. It is designed to output the product matrix column by column, so each column has elements. Connected to the SA output, adders are required to add the bias to the product matrix, and another adders are required to add the residual before calculating the layer normalization function. Overall, the SA Module has the highest computational complexity, containing at least multipliers and adders. To increase the hardware utilization, we make the calculations of the Softmax Module running parallel to (line 6 in Algorithm 1). Owing to carefully designing the computation flow of the entire system, the SA Module will hardly stop running until the LayerNorm Module starts. As long as the Softmax module can give the output no later than the SA module finishing calculating “”, the latency of the entire system will be determined by the SA module and the LayerNorm module. The architectures of these two nonlinear modules are introduced in detail as follows.
Iv-a Scaled Masked-Softmax
The Softmax module in the proposed architecture is used to calculate the scaled masked-softmax function. For the convenience of discussion, we named the input matrix (refer to line 5 in Algorithm 1) as , the shape of which is . The output matrix is defined as , and the mask matrix is defined as . Therefore, the scaled masked-softmax function can be expressed as:
Although the computational complexity of this Softmax Module is lower than the SA module, the exponentiation and division calculations are still quite expensive. In , by making good use of the log sum-exp trick and designing algorithmic reduction strategies for exponential function and logarithmic function, a high-speed and low-complexity hardware architecture for softmax function was proposed. These tricks and strategies are also used in this work to build an efficient architecture for scaled masked-softmax. The division calculation and numerical underflow can be avoided by using the log-sum-exp trick ():
According to Equation (5), the computation of this module can be broken into four different phases, which is described in Fig. 6. The transformations of exponential function and logarithmic function allow us to build the Softmax module without using any regular multipliers and lookup tables. The detailed architectures of the EXP Unit and the LN Unit are the same as .
Iv-B Layer Normalization
As discussed in Section II, both of these two ResBlock have to calculate the layer normalization function before the output starts. This means that the LayerNorm module is always on the critical path of the system latency. In this subsection, we propose a method to minimize its latency.
Unlike the batch normalization, the layer normalization does not impose any restriction on the size of a mini-batch. So it is able to be used in the pure online regime with the batch size equal to 1.. The layer normalization function used in these two ResBlocks is:
where the constant is equal to , which is used to avoid the denominator from being zero. The variable is the mean value of all the elements in the -th row of matrix G ():
The variance of these elements is defined as:
According to these above equations, the straightforward way to calculate the layer normalization is described in Fig. 7. To calculate and , at least cycles are added to the whole system latency.
As is shown in Fig. 7, there are two steps in our method of minimizing the delay of this module, and the key is to make the LayerNorm module start running in advance. The first step is using accumulators to calculate , and keeping them connected directly to the input of this module. The second step is choosing another way to calculate the variance:
At last, very few cycles are required between the system finishing calculating all the elements of matrix G and starting the output, which also means the latency of the entire system is further reduced. The architecture of the LayerNorm module is described in Fig. 8. The “x(-0.5)” unit is implemented with a lookup table in our experiment.
V Experimental Results
V-a Quantization of Transformer Base Model
Before evaluating our complete design with FGPA, we quantize a Transformer base model for a machine translation task 111https://github.com/Kyubyong/transformer. This model has been trained and tested with IWSLT 2016 German-English parallel corpus, and the test BLEU score is 23.88 on “tst2014”. Learning from , replacing FP32 with INT8 in the Transformer can greatly reduce the computational complexity with limited accuracy loss.
Since linear approximation is used in the exponential function and the logarithmic function of the Softmax module, the process of the quantization is divided into two steps. First, all the trainable variable matrices and activation matrices in Fig. 3 are all quantized with INT8, but the internal calculations in the Scaled Masked-Softmax operation are still implemented with FP32. After that the BLEU score drops to 23.48, proving that quantizing with INT8 in this network is acceptable. Second, the Softmax module is quantized based on the fixed-point model built in the first step. The previously mentioned log-sum-exp trick and the transformations of exponential function and logarithmic function are used. The final BLEU score of the quantized Transformer base model is 23.57, which is even a little higher than 23.48. These results also show that using the simplified architecture for softmax designed in  has little impact on the accuracy of this translation task.
V-B Hardware Implementation Results
By setting the batch size to 1 and the max sequence length to 64, the proposed architecture is evaluated on Xilinx xcvu13p-fhga2104-3-e FPGA by using the Vivado 2018.2. The simulation results show that it takes 21,344 cycles and 42,099 cycles to finish the calculation of MHA ResBlock and FFN ResBlock, respectively. The Vivado implementation results show that our design can run up to 200MHz, and the total on-chip power is 16.7W (13.3W dynamic power and 3.4W device static power). The utilization report is presented in TABLE II.
Using the same hyper parameters (batch size equal to 1 and max sequence length equal to 64), we also measure the latency of these two layers in a GPU implementation of the Transformer base model 222https://github.com/jadore801120/attention-is-all-you-need-pytorch on an NVIDIA V100. The comparison results are shown in TABLE III, proving that our design is able to accelerate the inference for the Transformer on FPGA platform.
|FPGA Latency||GPU Latency||Speed-Up|
Vi Conclusion and Future Work
In this work, we present the first hardware accelerator for the MHA ResBlock and the FFN ResBlock in the Transformer. The FPGA implementation shows promising results in terms of both speed and power, which demonstrates that this design can contribute to operating the Transformer network in mobile device or embedded systems. In the future, we will build a FPGA or ASIC accelerator for the complete Transformer inference.
-  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffery E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
-  Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun Choi, Kushal Datta, and Vikram Saletore. Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532, 2019.
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio.
Empirical evaluation of gated recurrent neural networks on sequence
NIPS 2014 Workshop on Deep Learning, December 2014, 2014.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
-  Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. Compressing large-scale transformer-based models: A case study on bert. arXiv preprint arXiv:2002.11985, 2020.
-  Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W Lee, et al. : Accelerating attention mechanisms in neural networks with approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 328–341. IEEE, 2020.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and
Albert: A lite bert for self-supervised learning of language representations.In International Conference on Learning Representations, 2019.
-  Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
-  Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223, 2019.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6000–6010, 2017.
-  Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018.
-  Meiqi Wang, Siyuan Lu, Danyang Zhu, Jun Lin, and Zhongfeng Wang. A high-speed and low-complexity architecture for softmax function in deep learning. In 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), pages 223–226. IEEE, 2018.
-  Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei Peng, and Luo Si. Structbert: Incorporating language structures into pre-training for deep language understanding. arXiv preprint arXiv:1908.04577, 2019.
Efficient hardware architecture of softmax layer in deep neural network.2016 29th IEEE International System-on-Chip Conference (SOCC), pages 323–326, 2016.