Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer

09/18/2020 ∙ by Siyuan Lu, et al. ∙ Nanjing University 0

Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing accelerators are built for either convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Recently, the Transformer model is replacing the RNN in the natural language processing (NLP) area. However, because of intensive matrix computations and complicated data flow being involved, the hardware design for the Transformer model has never been reported. In this paper, we propose the first hardware accelerator for two key components, i.e., the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which are the two most complex layers in the Transformer. Firstly, an efficient method is introduced to partition the huge matrices in the Transformer, allowing the two ResBlocks to share most of the hardware resources. Secondly, the computation flow is well designed to ensure the high hardware utilization of the systolic array, which is the biggest module in our design. Thirdly, complicated nonlinear functions are highly optimized to further reduce the hardware complexity and also the latency of the entire system. Our design is coded using hardware description language (HDL) and evaluated on a Xilinx FPGA. Compared with the implementation on GPU with the same setting, the proposed design demonstrates a speed-up of 14.6x in the MHA ResBlock, and 3.4x in the FFN ResBlock, respectively. Therefore, this work lays a good foundation for building efficient hardware accelerators for multiple Transformer networks.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recurrent neural networks (RNNs), long-short memory (LSTM)[7], and gated recurrent (GRU)[3], used to be the best solutions in the natural language processing (NLP) area. This situation was changed when the Transformer model[11] was invented in 2017, which outperforms previous RNN models in multiple tasks. By avoiding the recurrent calculations and taking full advantage of the attention mechanism, the Transformer and Transformer-based pre-trained language models (such as BERT[4], ALBERT[8], T5[9], ERINE[10], and structBERT[14]) have achieved state-of-the-art accuracy in various NLP tasks.

In spite of making great progress in relative fields, the high computation complexity and huge memory requirements of these powerful Transformer networks are making them hard to be operated in mobile devices or embedded systems. More and more researchers are paying attention to this problem, and one way to solve it is through model compression[5]. Several techniques have been used to compress these networks, including data quantization[2], pruning, knowledge distillation and Architecture-Invariant Compression (AIC)[8].

Recently, building FPGA or ASIC hardware accelerators for deep neural networks (DNNs) has achieved great success in both academic and industrial societies, which makes us believe that designing efficient hardware architectures for these Transformer networks must be an important topic as well. By implementing them on hardware platforms, the inference systems of many NLP applications, such as machine translation, question answering, and sentiment analysis, are able to achieve higher speed or lower power consumption or both. However, intense matrix computations, complicated data flow, and complex non-linear functions are making it hard to design efficient hardware architecture for the Transformer. To the best of our knowledge, we are the first to propose a specific hardware accelerator for the Transformer. In the open literature, the

[6] is the only hardware architecture for accelerating the attention mechanism in various neural networks, which is not specifically designed for the Transformer.

As mentioned in [11] and [8], most of the trainable parameters and the computations are in the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which is discussed by Section II in detail. In this work, we design a reconfigurable hardware architecture based on systolic array (SA) for the MHA ResBlock and the FFN ResBlock, which are the two most complex layers in the Transformer.

Main contributions of this work can be summarized as follows:

  • We provide an efficient method to partition the huge matrices in the Transformer, which allows the MHA ResBlock and the FFN ResBlock to share most of the hardware resources.

  • We propose the first hardware architecture design which can complete the calculations for both these two ResBlocks. To ensure the high hardware utilization of the SA, which is the biggest module in our design, the computation flow is well designed.

  • Two most complicated nonlinear functions, including the scaled masked-softmax and the layer normalization, are highly optimized to become more hardware-friendly. As the “bottle-neck” in the proposed architecture, the latency of layer normalization is reduced as much as possible.

After quantizing the Transformer base model in [11] (distinguished from the Transformer big model) with 8-bit integers (INT8), we also evaluate our design on the Xilinx xcvu13p-fhga2104-3-e FPGA, when the max sequence length (denoted as ) is equal to 64 and the batch size is equal to 1. The hardware experimental results demonstrate a speed-up of 14.6 in the MHA ResBlock, and a speed-up of 3.4 in the FFN ResBlock, compared to a GPU implementation on an NVIDIA V100.

The rest of this paper is organized as follows. Section II gives a brief review of the Transformer networks, and explains the importance of accelerating the MHA ResBlock and the FFN ResBlock. Section III presents the method of matrix partitioning. Section IV describes the proposed hardware architecture. Experimental results are given in Section V. Section VI concludes this paper.

Ii Background and Motivation

Ii-a The Model Architecture of the Transformer

Fig. 1: The model architecture of the Transformer.

The model architecture of the Transformer is described in Fig. 1, containing an encoder stack and a decoder stack. Notice that most of the trainable parameters and the computations are in these two stacks, and other components beside the stacks such as the embedding layers and the softmax output layer have not been taken into account by this work. As is shown in Fig. 1, all the encoder layers and the decoder layers are composed of two kinds of ResBlocks, the MHA ResBlock and the FFN ResBlock.

Fig. 2: The structure of the MHA ResBlock.

Fig. 2 shows the structure of the MHA ResBlock. An MHA ResBlock has

“Attention Heads”, and the input of each Head is the same as the input of the ResBlock, including three tensors: V (values), K (keys), and Q (queries). The Scaled Dot-Product Attention function in the MHA can be expressed as follows:


The Mask operation is used to mask out all values in the input of the softmax corresponding to illegal connections, and the parameter , which is equal to 64 in both the Transformer base model and the Transformer big model. The parameter is equal to 8 in the base model, or equal to 16 in the big model.

The FFN ResBlock contains a fully connected feed-forward network, consisting of two linear sublayers and a ReLU activation between them:


Ii-B Transformer-Based Pre-Trained Models

An important pre-trained model is Bidirectional Encoder Representations from Transformers (BERT). Analyses in [5] also point out that, the MHA and the FFN ResBlocks still occupy most of the storage space and have the highest numbers of FLOPs.

(a) The MHA ResBlock.
(b) The FFN ResBlock.
Fig. 3: Matrix Operations in the MHA and the FFN ResBlocks. Note that all the multiply operations marked in this figure are dealing with cross products.

The General Language Understanding Evaluation (GLUE) benchmark [12] is a collection of diverse natural language understanding tasks. Recently, many Transformer-based pre-trained models have obtained top placements on the GLUE score list. Most of these models, such as T5[9], ERINE[10], and structBERT[14], have very similar structure to the BERT. These facts all prove the necessity of designing efficient hardware accelerators for the MHA and the FFN ResBlocks, which are two commonly used structures in these models.

Iii Partitioning Matrices in the FFN and the MHA

Considering the characteristics of the Transformer architecture, we believe that the proposed hardware accelerator should be able to accelerate not only the MHA ResBlock, but also the FFN ResBlock. To make sure that the MHA ResBlock and the FFN ResBlock can reuse the hardware resources, we first analyze these two ResBlocks from the perspective of matrix operations, and then give a method to partition the matrices so that all the general matrix-matrix multiplications (GEMMs) can be done with one and the same systolic array (SA), the size of which is limited to .

Assuming that the input of the FFN is called X, the shape of the tensor X is the same as Q (one of the input tensors of the MHA), which is []. Additionally, Fig. 1 shows that the tensor K is always equal to the tensor V, the shape of which is []. In normal circumstances, is equal to , so the shape of all these four tensors can be expressed as []. Supposing that the batch size is equal to 1, the computations of these two ResBlocks can be considered sets of matrix operations, which are represented in Fig. 3. Obviously, an SA can support all the matrix multiplications in the Linear sublayers of all the Heads. However, how to complete other multiplications between larger matrices, including , , and , is another important issue to be considered.

Transformer-base 512 2048 8
Transformer-big 1024 4096 16
768 3072 12
1024 4096 16
TABLE I: Variations on the Transformer and the BERT architectures.

Table I shows that in these Transformer networks, we all have , and . On the basis of this pattern, the three large weight matrices , , and can be partitioned as shown in Fig. 4. Thus, most of the GEMMs can be done with an SA.

Fig. 4: Partition , , and .

The only one left is the operation of in each Head of the MHA. The ratio of the number of multiplications in this operation to the entire MHA ResBlock can be roughly calculated as follows:


Since is no smaller than 16,384 and is usually no bigger than 128, this ratio should be very small, which illustrates that the management of this single operation will not influence the overall hardware utilization much. If is smaller than 64, it can be done with the

SA through zero padding to the

. Otherwise by partitioning the , the SA can still support this operation with little impact on the utilization of the SA.

1 if Calculating MHA ResBlock then
2       for  do
3             Temp1=;
4             Temp2=;
5             Softmax Input=;
6             Temp1=Softmax output, Temp2=;
7             =;
9       end for
10      for  do
11             =P;
13       end for
14      Output=LayerNorm(G);
16 end if
17if Calculating FFN ResBlock then
18       for  do
19             =ReLU;
21       end for
22      for  do
23             =;
25       end for
26      Output=LayerNorm(G);
28 end if
return Output
Algorithm 1 The Overall Computation Flow

Iv Hardware Architecture Design for the Proposed Accelerator

Using the proposed method of partitioning these weight matrices, the complete hardware accelerator is designed. The top-level architecture is illustrated in Fig. 5.

The SA is made up of a 2D array of processing elements (PE), with rows and 64 columns. It is designed to output the product matrix column by column, so each column has elements. Connected to the SA output, adders are required to add the bias to the product matrix, and another adders are required to add the residual before calculating the layer normalization function. Overall, the SA Module has the highest computational complexity, containing at least multipliers and adders. To increase the hardware utilization, we make the calculations of the Softmax Module running parallel to (line 6 in Algorithm 1). Owing to carefully designing the computation flow of the entire system, the SA Module will hardly stop running until the LayerNorm Module starts. As long as the Softmax module can give the output no later than the SA module finishing calculating “”, the latency of the entire system will be determined by the SA module and the LayerNorm module. The architectures of these two nonlinear modules are introduced in detail as follows.

Fig. 5: The top-level architecture of our design.
Fig. 6: The architecture of Softmax module. The “” denotes right shift operation.

Iv-a Scaled Masked-Softmax

The Softmax module in the proposed architecture is used to calculate the scaled masked-softmax function. For the convenience of discussion, we named the input matrix (refer to line 5 in Algorithm 1) as , the shape of which is . The output matrix is defined as , and the mask matrix is defined as . Therefore, the scaled masked-softmax function can be expressed as:


Although the computational complexity of this Softmax Module is lower than the SA module, the exponentiation and division calculations are still quite expensive. In [13], by making good use of the log sum-exp trick[15] and designing algorithmic reduction strategies for exponential function and logarithmic function, a high-speed and low-complexity hardware architecture for softmax function was proposed. These tricks and strategies are also used in this work to build an efficient architecture for scaled masked-softmax. The division calculation and numerical underflow can be avoided by using the log-sum-exp trick ():


According to Equation (5), the computation of this module can be broken into four different phases, which is described in Fig. 6. The transformations of exponential function and logarithmic function allow us to build the Softmax module without using any regular multipliers and lookup tables. The detailed architectures of the EXP Unit and the LN Unit are the same as [13].

Iv-B Layer Normalization

Fig. 7: The method to minimize the latency of the LayerNorm module.

As discussed in Section II, both of these two ResBlock have to calculate the layer normalization function before the output starts. This means that the LayerNorm module is always on the critical path of the system latency. In this subsection, we propose a method to minimize its latency.

Unlike the batch normalization, the layer normalization does not impose any restriction on the size of a mini-batch. So it is able to be used in the pure online regime with the batch size equal to 1.

[1]. The layer normalization function used in these two ResBlocks is:


where the constant is equal to , which is used to avoid the denominator from being zero. The variable is the mean value of all the elements in the -th row of matrix G ():


The variance of these elements is defined as:


According to these above equations, the straightforward way to calculate the layer normalization is described in Fig. 7. To calculate and , at least cycles are added to the whole system latency.

As is shown in Fig. 7, there are two steps in our method of minimizing the delay of this module, and the key is to make the LayerNorm module start running in advance. The first step is using accumulators to calculate , and keeping them connected directly to the input of this module. The second step is choosing another way to calculate the variance:


At last, very few cycles are required between the system finishing calculating all the elements of matrix G and starting the output, which also means the latency of the entire system is further reduced. The architecture of the LayerNorm module is described in Fig. 8. The “x(-0.5)” unit is implemented with a lookup table in our experiment.

Fig. 8: The architecture of LayerNorm module.

V Experimental Results

V-a Quantization of Transformer Base Model

Before evaluating our complete design with FGPA, we quantize a Transformer base model for a machine translation task 111https://github.com/Kyubyong/transformer. This model has been trained and tested with IWSLT 2016 German-English parallel corpus, and the test BLEU score is 23.88 on “tst2014”. Learning from [2], replacing FP32 with INT8 in the Transformer can greatly reduce the computational complexity with limited accuracy loss.

Since linear approximation is used in the exponential function and the logarithmic function of the Softmax module, the process of the quantization is divided into two steps. First, all the trainable variable matrices and activation matrices in Fig. 3 are all quantized with INT8, but the internal calculations in the Scaled Masked-Softmax operation are still implemented with FP32. After that the BLEU score drops to 23.48, proving that quantizing with INT8 in this network is acceptable. Second, the Softmax module is quantized based on the fixed-point model built in the first step. The previously mentioned log-sum-exp trick and the transformations of exponential function and logarithmic function are used. The final BLEU score of the quantized Transformer base model is 23.57, which is even a little higher than 23.48. These results also show that using the simplified architecture for softmax designed in [13] has little impact on the accuracy of this translation task.

V-B Hardware Implementation Results

By setting the batch size to 1 and the max sequence length to 64, the proposed architecture is evaluated on Xilinx xcvu13p-fhga2104-3-e FPGA by using the Vivado 2018.2. The simulation results show that it takes 21,344 cycles and 42,099 cycles to finish the calculation of MHA ResBlock and FFN ResBlock, respectively. The Vivado implementation results show that our design can run up to 200MHz, and the total on-chip power is 16.7W (13.3W dynamic power and 3.4W device static power). The utilization report is presented in TABLE II.

Available 1728000 3456000 2688 12288
Top 471563 217859 498 129
6464 SA 420867 173110 0 0
Softmax 21190 32623 0 0
LayerNorm 10551 5325 27.5 129
Weight Memory 3379 80 456 0
TABLE II: Utilization Report for the Proposed Hardware Accelerator and its Primary Modules

Using the same hyper parameters (batch size equal to 1 and max sequence length equal to 64), we also measure the latency of these two layers in a GPU implementation of the Transformer base model 222https://github.com/jadore801120/attention-is-all-you-need-pytorch on an NVIDIA V100. The comparison results are shown in TABLE III, proving that our design is able to accelerate the inference for the Transformer on FPGA platform.

FPGA Latency GPU Latency Speed-Up
MHA ResBlock 106.7us 1557.8us 14.6
FFN ResBlock 210.5us 713.4us 3.4
TABLE III: Comparisons between FPGA and GPU Latency Results

Vi Conclusion and Future Work

In this work, we present the first hardware accelerator for the MHA ResBlock and the FFN ResBlock in the Transformer. The FPGA implementation shows promising results in terms of both speed and power, which demonstrates that this design can contribute to operating the Transformer network in mobile device or embedded systems. In the future, we will build a FPGA or ASIC accelerator for the complete Transformer inference.