I Introduction
Recently, Transformer [30]
has gained popularity and achieved recordbreaking results on major natural language processing (NLP) tasks, including question answering, sentiment analysis and language inference
[32, 6, 29, 25]. Although stateoftheart Transformer models offer great prediction accuracy, they have a large number of parameters. For example, the model has 340M parameters [6] and the DistilBERT, a compact model, has 67M parameters [26]. Moreover, with the ongoing democratization of machine learning
[7], there are increasing needs to execute such giant models on embedded devices [17, 36], e.g., fieldprogrammable gate array (FPGA) or applicationspecific integrated circuit (ASIC). Using these devices as an acceleration platform for Transformer is challenging as they offer a limited onchip memory and often possess limited offchip bandwidth, both of which are critical for high performance. This restriction is particularly limiting for FPGA due to its extremely small onchip memory, approximately 5MB for lowend FPGA (e.g., ZCU104) and 35MB for highend FPGA (e.g., Alveo U200). Therefore, when Transformer comes to embedded devices, the primary challenge is in accommodating giant models onto these devices, along with the requirement of low inference latency.Methods  AlgorithmHardware  NAS  Hardware  Ours  

FTRANS[16]  [8]  HAT [33]  A^{3} [9]  
AC & LC  ✓  
Hardware Type  ✓  ✓  
Resource Uti.  ✓  ✓  ✓  
Compression  ✓  ✓  ✓ 
Three research trends have attracted enormous interests to improve the performance of Transformer, as Table I shows. The first trend is hardware acceleration on ASIC, e.g., A^{3} [9], where researchers mainly focus on hardware acceleration. The second trend is algorithm optimization on CPU and GPU, such as neural architecture search (NAS) and model compression algorithms, e.g., block structured pruning [14], lottery ticket hypothesis [5, 24, 33]. The third trend is the algorithmhardware sequential design flow [16, 8], which compress the model first and then implement compressed model to a existing device. This sequential design flow has no hardware performance feedback to software optimization. In this paper, we propose an algorithmhardware closedloop framework, which can trade off between the sparsity ratio and hardware resources to achieve coexploration of model compression and hardware acceleration. Our framework simultaneously considers hardware type, resource utilization, model compression, and and .
Moreover, with the development of technology, more and more hardware devices are available to run a Transformer model, such as various types of mobile device (e.g., Apple Bionic, Qualcomm Snapdragon, HiSilicon Kirin, Samsung Exynos, …), FPGAs (e.g., ZCU102, VC707, Alveo U200, Versal, …), ASICs and son on. These devices have different computing power and storage capacities, which are critical to the performance of Transformer. Moreover, for a Transformer model with different constraint requirements (tight or loose), it can be deployed onto devices with different computing power. However, in previous work, designers did not choose the best one among multiple devices. Instead, they just used an existing device to deploy the model, which was not necessarily the best fit and may lead to underutilization of resources. Therefore, with the surging of various types of devices and constraint requirements for models, it is becoming increasingly difficult for designers to select the best device for their application.
To address the deployment challenge of Transformer and the problem to select the best device, as the first attempt, we propose an algorithmhardware closedloop framework, which can provide a best device under different constraints. Our framework makes a tradeoff between the sparsity ratio of model and hardware resources to achieve coexploration to accelerate Transformer inference. We use FPGA to illustrate our design, and it can also be applied to other hardware devices, such as mobile devices, ASICs.
The main contributions of this paper are: (1) An Algorithm hardware closedloop framework. We provide a coexploration framework from constraints (,) to device. User can input some constraints, , , backbone model and dataset, our framework can output the best device to deploy this model meanwhile satisfying both constraints. (2) A Hardwarefriendly Hierarchical Pruning (HP) Technique.
We propose HP, a novel sparsity pattern, which is a twolevel pruning technique and takes a advantage of two existing pruning techniques, block structured pruning (BP) and vectorwise pruning (VW). HP is hardwarefriendly and can achieve high sparsity ratio. (3)
A Sparse Matrix Storage Format Optimization. We optimize a sparse weight format for our HP matrix on FPGA implementation. Our format can significantly reduce memory usage and perform better than commonly used formats. (4) Sparsityaware Accelerator.We design a FPGAbased accelerator for HP and abstract a performance predictor to build a bridge between the software and hardware for efficient clock cycles and resource usage estimation.
Ii Related Work
Transformer. Transformer has been highly optimized at the software level for CPU and GPU. A research trend is to modify the architecture of Transformer to improve the performance on CPU and GPU [33, 4]
. These work exploit Neural Architecture Search (NAS) to search a best model architecture. However, the cost is usually high in the search process, since massive computations and neural network samples are required for an optimized network architecture. However, little work has been published related to custom hardware acceleration for transformerbased model, particularly on FPGAs.
[9] has been proposed to accelerate different parts of transformer model, attention and fullyconnected layers, to achieve efficient processing on ASIC. [16] is the only currently published FPGA accelerator, which proposes a acceleration framework to enable compression on FPGA. This work sequentially first compress model and then deploy the compressed model on FPGA. This sequential design flow has no hardware performance feedback to software optimization and is not the optimal. In this paper, we trade off between the sparsity ratio and hardware resources to achieve coexploration of model compression and hardware acceleration.Model Compression. [5, 24] applied Lottery ticket hypothesis on model compression on BERT, based on an observation that a subnetwork of randomlyinitialized network can replace the original network with the same performance. However, the nonstructure pruning is not hardwarefriendly. For hardwarefriendly weight pruning, [15] proposes a hardwarefriendly block structured pruning technique for transformer. However this technique will result in a significant accuracy loss when pruning ratio increases or block size is larger. [19] proposes pattern pruning to make a better balance between accuracy and pruning ratio. But this pruning technique cannot directly apply to hardware due to parallelism limit.
Sparse Matrix Compression Formats. A variety of sparse matrix representation formats have been proposed to compress the sparse matrix. Prior works take two major approaches to design such compression scheme. The first approach is to devise general compression formats, such as Compressed Sparse Row (CSR) [18], Coordinate (COO) [1]. They both record the row/column indices of each nonzero elements, which cause excessive memory usage. The second approach is to leverage a certain known structure in a given type of sparse matrix. For example, the DIA format [2] is highly efficient in matrices where the nonzero elements are centrated along the diagonals of the matrix. The CSB format [27] is devised for the proposed CSB sparsity pattern. Though these compression schemes are specific to certain types of matrices, they are the most efficient in both computation and storage. In our work, in order to be the most efficient in storage, we optimize a sparse matrix compression scheme for our sparsity pattern HP.
Iii The AlgorithmHardware Closedloop Acceleration Framework
To address the deployment challenge of Transformer and the problem to select the best device, as the first attempt, we propose an algorithmhardware closedloop framework, which can provide a best device under different constraints. Our framework makes a tradeoff between the sparsity ratio and hardware resources to achieve coexploration of model compression and hardware acceleration. Next, we use FPGA to illustrate our design, and it can also be applied to other devices, such as mobile devices, ASICs.
Iiia Problem Definition and Overview
In this paper, we aim to develop an algorithmhardware closedloop acceleration framework to select the best device under different and for Transformer. We define the problem as follows: Given a specific data set , a backbone model , a hardware pool , latency constraint , accuracy constraint , the objective is to determine: (i) a compressed model including sparsity of each layer; (ii) the target hardware device; such that the compressed model can be deployed onto the target device meanwhile satisfying both constraints and .
Figure 1 shows the overview of our framework and Algorithm 1 illustrates the whole process. Firstly, we design a pruning technique and conduct sparsityaware accelerator design (components 1⃝ 2⃝) and abstract a performance predictor to estimate hardware resource requirements (components 3⃝). Then we use the RNNbased RL controller to guide the search process: (i) the controller predicts a sample; (ii) the performance predictor roughly estimates resource requirements of the sample (components 3⃝); (iii) select the target device from hardware pool to meet resource requirements (components 4⃝); (iiii) fine tune the resource allocation exactly and optimize the latency under the target device constraint (components 5⃝); (iiiii) train the model and get accuracy (components 6⃝). At last, the controller is updated based the feedback (reward) from 4⃝ 5⃝ 6⃝ and then predicts better samples. In the following text, we will introduce these components onebyone.
IiiB Network Compression
In order to accommodate Transformer models with enormous parameters onto the onchip memory of FPGA, a pruning technique that can achieve a high sparsity ratio with a small accuracy loss and hardwarefriendly is necessary. In this paper, we propose HP, which is a twolevel pruning technique. It combines the advantages of existing two pruning techniques, BP [14] and VW [3]. Firstly, to keep hardwarefriendly, we adopt BP to prune model. However, it is coarsegrained and can’t achieve high sparsity ratio with a small accuracy loss. But how to achieve higher sparsity ratio? Next, based on BP, we adopt VW, a finegrained pruning, to prune further. In this way, we can maintain hardwarefriendly and achieve high sparsity.
BP  VW  HP(ours)  

Finegrained  ✓  ✓  
Coarsegrained  ✓  ✓  
Flexibility  ✓  
Hardwarefriendly  ✓  ✓  ✓ 
High spar.& acc.  ✓ 
As Figure 2 shows, our HP is a combination of BP and VW. First, we adopt BP as the first level pruning and we divide the weight matrix into blocks and prune some unimportant columns in each block. We regard this BP model as the backbone model of HP and denote its sparsity ratio by . The value of determines the starting sparsity rate of HP weight, which can be adjusted flexibly. Then, based on the backbone model BP, we adopt VW as the second level to remove unimportant elements in each unpruned column of blocks. To keep balanced, we remove the same number of elements in each column of blocks. Our HP combines coarsegrained (BP) and finegrained pruning (VW) to achieve a higher sparsity ratio and ensure a small accuracy loss. We make a comparison among BP, VW and HP. As Table II shows, our HP combines the best of both BP and VW and is more flexible and effective than them. As for VW, it keeps all vectors (columns), which is unnecessary because some vectors are important and some are not. As for HP, we can first prune some unimportant columns, which can increase the sparsity ratio than VW to some extent. Moreover, we can also flexibly adjust the value of to achieve different sparsity ranges and accuracy.
IiiC Sparsityaware Accelerator Design
In this section, first, we introduce the optimized sparse weight matrix storage format when implementing on FPGA. Then we introduce the accelerator design.
The Storage Format Optimization. In sparse matrices, the number of nonzero elements (NZ) is much smaller than the number of zero elements. In order to avoid unnecessarily 1) storing zero elements and 2) performing computations on them, we need an efficient scheme to compress the sparse matrix. Various sparse matrix storage formats have been proposed, e.g., COO [1], CSR [18], BCSR [23], TileBitmap [38], MBR [13]. In our work, we use a bitmap format similar to MBR and optimize this format based on our sparsity pattern HP to reduce memory usage further.
Figure 3 shows our format, WMark. We design two formats according to the sparsity ratio of backbone model . When is not equal to 0%, the weight is pruned by BP and VW and we use SF1. When is equal to 0%, the weight is only pruned by VW and we use SF2. There are three arrays in SF1: (i) the threedimensional array records all nonzero elements (NZ). The first dimension record the number of blocks and the NZ in successive blocks are concatenated (columnmajor order) and stored continuously in the last two dimension; (ii) array stores the indices of unpruned columns in each block; (iii) In order to track which elements are NZ in each unpruned column, we use a bitmap array . If the a slot contains a NZ we set the it to ”1”, otherwise to ”0”. As for SF2, the array is not needed and there are only two array. There are four arrays in MBR [13]: . The difference between our WMark and MBR [13] format is that: 1) array is not needed. Because it is easy to calculate the row indices due to the balanced property of HP sparsity pattern. 2) The Bitmap array in our WMark only records the unpruned column not all columns, which can save memory storage. We set a weight matrix with 50% sparsity ratio and compare the memory usage of the five formats with ours. As Table III shows, our format performs better than all.
COO [1]  CSR[18]  BCSR [23]  TileBitmap [38]  MBR [13]  WMark(ours)  

value  1250  1250  2500  1250  1250  1250 
col_Idx  3125  3125  312.5  312.5  312.5  312.5 
row_Idx  3125  7.8  1.6  312.5  1.6   
Index        351.6     
Bitmap        625  625  312.5 
Total (Kb)  7500  4382.8  2814.1  2851.6  2189.1  1875 
Accelerator Design. Different from other FPGA accelerator design [11, 10, 12, 3], we fit all weights on onchip memory of FPGA and don’t move data between onchip and offchip memory by weight pruning and quantization. And to realize a low inference latency with parallel FPGA, there are multiple challenges to design an architecture that can exploit the benefits of HP. In previous work, [3] and [27] implement accelerators with sparsity but they are designed for RNN model (matrixvector multiplication, MV) and can’t be applied to Transformer (matrix/vectormatrix multiplication, MM / VM). As Figure 4 show, with generalized VM as in [22], there are two concurrent irregular memory accesses challenges, one for random read to input vector and the other for random write to result matrix, which can install the parallel execution. To solve these challenges, we change the memory access pattern. To avoid the random write to result matrix, we multiply multiple rows of the input matrix by one column of the weight matrix, which can achieve sequential writing. To solve the challenge of random read to input, we assign a input matrix row buffer and use register to implement it which can be randomly accessed.
Figure 5 shows our computation engine. It consists of parallel processing elements (PEs) that compute dot products of distinct input matrix rows and one column of the weight matrix (one block) concurrently to exploit interrow parallelism, while each PE is designed to exploit intrarow parallelism in a single dot product operation. Each PE contains a input matrix row buffer to buffer each row of the being multiplied input matrix and this buffer is implemented by register which can be randomly accessed. This computation includes 5 steps: (1) The PE reads elements from the weight matrix memory and elements based on the array from the input row buffer . (2) multipliers operate simultaneously to obtain scalar products. (3) an adder tree sums scalar products to calculate the dot product. (4) PE reads from the weight matrix. (5) The dot product result is written back to the result memory based on the . PEs are fully pipelined so that one operation can be processed per clock cycle.
IiiD Performance Predictor
We develop a FPGA performance predictor to roughly analyze resource requirements , , based on software and hardware parameters predicted by RL controller.
1⃝ We model the Block RAMs (onchip SRAM units, called BRAMs) usage using the formula in [39]. According to onchip buffer allocation, we can calculate the BRAMs for th buffer . Among them, represents the quantization bits of weight and represents the number of NZ of weight. The and represent the configuration of BRAM. Then we can get the total BRAMs by adding up all buffers .
2⃝The DSP usage is related to multiplyaccumulate and data type. According to the computation engine in Figure 5, it can execute MAC operations in parallel. For the 16bit fixed point, it requires DSPs, where each multiplication and add operation requires 1 DSP. For 32bit floating point, it requires DSPs, where 5 is the sum of 3 DSPs for one multiplication and 2 DSPs for one add operation [35]. Suppose that the total number of layers are and the PEs size of th layer is , then the total DSP is : .
3⃝The clock cycles are related to the size of PEs. After implementing PEs in Vivado HLS, we try to make the pipeline interval become 1, indicating that PEs can output one result in 1 clock cycles. Therefore, clock cycles of one layer equal the number of times that PEs is invoked. The sparse matrix multiplication of and with sparsity ration can support MAC. With the PEs size , we can calculate the clock cycles: . Therefore, for layers in total, the total clock cycles is: .
IiiE Choose Device
Next, we introduce how to choose the best device from hardware pool based on , , . Figure 6 show the process of selecting a best device. Table IV show our hardware pool. The process is as follows: (1) First, we sort devices in hardware pool according to the number of BRAMs provided by each device. (2) we perform binary search to find the device whose BRAMs are large than . That might be more than one device thus we use a set to denote the alternative devices. (3) we calculate the latency for device in set based on the formula , where is the running frequency of device and meanwhile we also compute the resource utilization for each device. (4) we choose the device whose is small than . Specifically, When there are more than two device to choose from, we choose the device with largest , meaning that we can select the device with lower price and higher resource utilization.
Devices  BRAM  DSP  LUT  FF 

Alveo U200  4320  6840  21182240  2364480 
VC709  2940  3600  433200  866400 
VC707  2060  2800  303600  607200 
ZCU102  1824  2520  274080  548160 
ZCU104  624  1728  230400  460800 
IiiF Optimization
The optimization step is to exactly fine tune the resource allocation sheme under the resource constraint of the target device to achieve the least clock cycles and fill up the gap between the actual and estimated value. In this step, we specifically consider the parallelism of the DotAttention layer, which defaults to 1 in the performance predictor step. The target FPGA offers a certain number of resource including DSP slices (ALUs) and BRAMs. And our goal is to leverage these resources to exploiting the compute capabilities of the FPGA and achieving reasonably high performance. This problem can be expressed as the following optimization objective:
Here, the parameters are the PE size of th layer and is the parallelism of DotAttention layers. The and are the clock cycles and computation resource needed by one DotAttention layer. is the avaliable computation resource of the target device. is the mumber of heads of Transformerbased models. Algorithm 2 illustrates our finetuned resource allocation scheme and solves the optimization objective. The algorithm takes in as input the Transformer model architecture and the target device constraints . And it finally outputs the parallelism factor and the least latency .
IiiG Reinforcement Learning (RL)
In our design, the search space is very big, therefore we exploit the RL to carry out guided search. The RL controller is implemented based on an RNN [40]. In each episode, the controller first predicts a sample, and gets its based on the evaluation results from the environment (components 3⃝ 4⃝ 5⃝ 6⃝ in Figure 1). Then, we employ the Monte Carlo policy gradient algorithm [34, 37] to update the controller:
where is the batch size and is the number of steps in each episode. The exponential factor are used to adjust the reward at every step and the baseline is the average exponential moving of rewards.
Our framework specifically takes hardware performance () into consideration rather than just model accuracy . As Figure 7 shows, we integrate the software parameters (# sparsity ratio) and accelerator design parameters (# parallelism factors) into the action space of controller to realize a coexploration of sparsity ratio and hardware resources. Therefore, we employ a reward function to calculate , which takes the accuracy , latency , the resource utilization and latency constraint to calculate reward. The function is defined as follows:
In the above function, there are two cases. First, if and , it indicates that the performance of the sample can satisfy the constraints, we sum up the reward of hardware performance and accuracy. Otherwise, in any other case, it indicates that the sample can’t satisfy constraints and we directly return negative values to the controller, which can save the search time. Note that we return different negative reward to guide the search. We return for and return for .
Iv Experiments
Iva Experimental Settings
Baseline Models and Datasets.
We test our method on Transformer model using WikiText2 dataset
[20] and on TinyBERT model using GLUE benchmark [31]. For Transformer model, there are 2 encoder and 1 decoder layers (the hidden size is 800, the feedforward size is 200 and the head number is 4). And we use the accuracy of word prediction as our evaluation metrics. For TinyBERT, there are 4 encoder layers and 1 pooler layer and 1 classifier layer.
Evaluation Platforms.
We conduct the reinforcement learning framework with the training of Transformer model on an 8× NVIDIA Quadro RTX 6000 GPU server (24 GB GPU memory). Experiments environment are performed on Python 3.6.10, GCC 7.3.0, PyTorch 1.4.0, and CUDA 10.1. The hardware accelerator design is implemented with Vivado HLS, which is the commonly used high level synthesis tool. This tool enables implementing the accelerator with C languages and exports the RTL as a Vivado’s IP core. The C code of our accelerator is parallelized by adding HLSdefined pragma. Presynthesis resource report are used for performance estimation.
IvB Experimental Results
IvB1 Pruning Strategy
We set the sparsity ratio of backbone model to different values for different models. For TinyBert model, its model size is relatively small and is sensitive to pruning. Therefore in order to maintain high accuracy, we set to . For Transformer model, through experiments we set to , which can ensure high sparsity ratio and maintain acceptable accuracy loss. As Table V shows, Transformer model pruned by HP can reduce model size by with accuracy loss. And the TinyBert model can achieve and
accuracy loss for MRPC task and SST2 task.
Accuracy. To evaluate benefit of our HP, we compare it with BP [14], VW [3], blockwise pruning (BW) [21], and irregular pruning on Transformer model. The block size of BW and BP is , , respectively. And the vector size for VW is . As Figure 9 shows, the HP, VW and the irregular pruning can achieve the same model accuracy when the sparsity is smaller than 70%. The HP can achieve better accuracy than irregular pruning at around 82% sparsity. When the sparsity is larger than 92% (the intersection of HP and irregular), HP performs worse than irregular due to large sparsity of the backbone model. VW can only achieve the limited 90% sparsity when the vector size is and our HP can achieve 99% sparsity. These experimental results demonstrate that HP has almost the same effectiveness as irregular sparsity and outperforms BW, BP and VW sparsity in terms of achievable accuracy or sparsity during pruning.
Transformer  TinyBert(MRPC)  TinyBert(SST2)  

Base  HP(=50%)  Base  HP(=0%)  Base  HP(=0%)  
model size  52M  6M  14.5M  10.9M  14.5M  8.6M 
sparsity  0.00%  89.85%  0.00%  25.0%  0.00%  41.00% 
accuracy  98.50%  96.13%  86.45%  85.75%  92.60%  90.37% 
Visualization. We visualize the weight matrices after HP, VW and irregular pruning on Transformer model. Figure 8 visualizes the three sparse weight matrices of a submatrix which is randomly selected from the whole weight matrix. Pink grids indicate nonzero parameters and the pink level indicates the magnitude of the absolute value. Figure 8(a) shows the two steps of HP. In our HP matrices, there are two blocks (the top and bottom of the dashed line) and each vector (column) of in blocks has 7 NZ. We can see that the heat map of HP weight can prune some unimportant columns and maintain most important weights as irregular pruning. Although irregular sparsity retains some weights in a column while our HP removes the whole column, these weights are relatively small (this can be seen from the pink level in Figure 8) and the removal has no significant impact on accuracy. Instead, most of the important weights can be retained by our HP to ensure accuracy.
IvB2 Overhead Comparison of Sparse Weight Format
We compare overhead among CSR [18], TileBipmap [38], MBR [13] and our WMark format. We use the memory usage as the metric. Figure 10 shows the results, it is clear that our optimized format WMark needs the least memory than all of them. And the WMark can achieve reduction in memory usage than MBR [13]. The reason is that our WMark has the balanced property and we don’t need the array to calculate the start index of each row. Besides, our array only mask the nonzero columns not all columns. Therefore, our WMark can use less memory than MBR [13].
Models  (LC,AC)  sparsity  accuracy  est. latency  BRMA / Uti  DSP / Uti  LUT / Uti  FF / Uti  Target device 

Transformer  (40ms, 92%)  92.00%  94.45%  35.70ms  2492 / 85%  1644 / 46%  303879 /70%  268065 / 30%  VC709 
(20ms, 96%)  86.42%  96.84%  18.10ms  3311 / 77%  5040 / 74%  908833 / 77%  1102880 / 47%  Alveo U200  
TinyBert (MRPC)  (180ms, 85%)  0%  86.45%  175ms  1602 / 87%  1027 / 40%  262248 / 95%  131542 / 23%  ZCU102 
(45ms, 85%)  0%  86.45%  42.1ms  2194 / 74%  1928 / 53%  417058 / 96%  254178 / 29%  VC709  
(18ms, 85%)  0%  86.45%  15.8ms  4204 / 97%  4145 / 60%  936545 / 79%  504293 / 21%  Alveo U200  
(50ms, 80%)  27.00%  84.95%  47.33ms  1530 / 83%  991 / 39%  254543 / 92%  158817 / 28%  ZCU102  
TinyBert (SST2)  (45ms, 90%)  25.00%  90.83%  40ms  1674 / 91%  1056 / 42%  264443 / 96%  189035 / 34%  ZCU102 
(30ms, 90%)  41.00%  90.37%  25.1ms  2504 / 85%  2028 / 56%  316028 / 73%  235177 / 27%  VC709 
IvB3 Validation On FPGA
We use FPGA devices to validate our approach, Table VI show our results. Our approach can find different devices under different sets of and . For Transformer, we set two sets of constraints to choose device: (40ms,92) for loose constraints and (20ms,96) for tight constraints. For (40ms,92), its latency and accuracy restrictions are loose, so it is possible to achieve a higher sparsity ratio and deploy to a midend FPGA VC709. For (20ms,96), its latency and accuracy restrictions are relatively tight, therefore the sparsity ratio is relatively small to ensure accuracy and it will be deployed to a device with strong computing power, Alveo U200, to achieve very low latency.
For MRPC task of TinyBERT model, first we set up three sets of constraints which have the same but different . The experimental results show that constraints with smaller can be deployed on FPGAs with greater computing power, such as (180ms, 85%) to ZCU102, (45ms, 85%) to VC709. Then we set constraint with lower (50ms, 80%), the searched result of sparsity ratio is 25% and the target device is ZCU102. This constraint can also be mapped to lowend FPGA (ZCU102), the same device as (180ms, 85%), due to compression and can achieve latency reduction. Therefore, the same device can satisfy different sets of constraints by compression and can be applied to different application scenarios. For SST2 task of TinyBERT model, it shows similar experimental results as Transformer and MRPC task.
IvB4 Crossplatform Comparison
The research on Transformer models mainly focus on at the software level for CPU and GPU, such as Trans [30], Evolved Transformer [28], and HAT [33], but little work has been published related to custom hardware acceleration on FPGAs, in addition to FTRANS [16]. We compare the efficiency of ours with these work. Since these work exploit different models and data set, in order to provide a fair comparison, we use the floatingpoint operations per second (FLOPS) as the metric. As Table VII shows, our FPGA implementation can achieve speedup compared to Trans [30], Evolved Transformer [28] and HAT [33] on CPU respectively. And it can achieve and speedup compared to HAT [33] on GPU and FTRANS [16] on FPGA.
IvB5 Search Space Exploration
We collect the explored results from RL to form the search space exploration results of Transformer in Figure 11. In this figure, the xaxis and yaxis stand for the latency and accuracy. We show the search result of constraint (26ms,96). We can see that these points are mainly concentrated in the vicinity of (26ms,96). This is due to the guided search of RL, which makes the search samples closer to the solution. There are two points A and B satisfying the constraints in Figure 11, it may represent there are two devices to choose from. In this case, we use the third metric, resource utilization, to select the best solution. The device with the largest resource utilization is the one that is more suitable, and at the same time it will be the cheaper one. Therefore, with our approach we can choose the best device.
IvB6 Ablation Study
In this section, we investigate the influence of the sparsity of backbone model in our HP. We set the row of block size as 10 in our experiment. The first feature of HP is that it can achieve different sparsity range when combined with different backbone models. For example, when is equal to , it can achieve sparsity range from to . when is equal to , the sparsity range is from to . VW has the limited sparsity of and it can’t achieve sparsity larger than . For accuracy, as Figure 12 shows, under the same sparsity ratio, accuracy decreases with the increase of . For example, When sparsity is , backbone model can achieve , which is higher than and for and backbone model. So that tells us that we’d better make the value of small in order to get a better accuracy.
V Conclusion
In this paper, we propose an algorithmhardware closedloop acceleration framework to solve the challenge of efficient deployments and device selection problem. Our framework can achieve from constraints () to device. To achieve high sparsity ratio, we propose HP to reduce model size. To further reduce memory usage, we optimized the sparse matrix storage format based HP sparsity pattern. Experiments show that our framework can find different devices for various and , covering from lowend devices to highend devices.
Acknowledgments
This work is partially supported by National Nature Science Foundation of China (NSFC) 61972154 and Shanghai Science and Technology Commission Project 20511101600. We sincerely thank Prof. Caiwen Ding at UConn and Prof. Weiwen Jiang at George Mason University for the intensive discussions and constructive suggestions.
References
 [1] (1995) Templates for solution of linear systems: building blocks for iterative methods. Siam. Cited by: §II, §IIIC, TABLE III.
 [2] (2009) Patternbased sparse matrix representation for memoryefficient smvm kernels. In International Conference on Supercomputing, pp. 100. Cited by: §II.
 [3] (2019) Efficient and effective sparse lstm on fpga with bankbalanced sparsity. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 63–72. Cited by: §IIIB, §IIIC, §IVB1.

[4]
(2018)
The best of both worlds: combining recent advances in neural machine translation
. arXiv preprint arXiv:1804.09849. Cited by: §II.  [5] (2020) The lottery ticket hypothesis for pretrained bert networks. arXiv preprint arXiv:2007.12223. Cited by: §I, §II.
 [6] (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.

[7]
(2018)
A framework for evaluating barriers to the democratization of artificial intelligence
. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §I.  [8] (2020) Accelerating sparse dnn models without hardwaresupport via tilewise sparsity. arXiv preprint arXiv:2008.13006. Cited by: TABLE I, §I.
 [9] (2020) A^ 3: accelerating attention mechanisms in neural networks with approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 328–341. Cited by: TABLE I, §I, §II.
 [10] (2019) Achieving superlinear speedup across multifpga for realtime dnn inference. ACM Transactions on Embedded Computing Systems (TECS) 18 (5s), pp. 1–23. Cited by: §IIIC.
 [11] (2019) Accuracy vs. efficiency: achieving both through fpgaimplementation aware neural architecture search. In Proceedings of the 56th Annual Design Automation Conference 2019, pp. 1–6. Cited by: §IIIC.
 [12] (2019) Xfer: a novel design to achieve superlinear performance on multiple fpgas for realtime ai. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 305–305. Cited by: §IIIC.
 [13] (2013) Efficient sparse matrix multiplevector multiplication using a bitmapped format. In 20th Annual International Conference on High Performance Computing, pp. 286–294. Cited by: §IIIC, §IIIC, TABLE III, §IVB2.
 [14] (2020) Efficient transformerbased large scale language representations using hardwarefriendly block structured pruning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §I, §IIIB, §IVB1.
 [15] (2020) Efficient transformerbased large scale language representations using hardwarefriendly block structured pruning. arXiv preprint arXiv:2009.08065. Cited by: §II.
 [16] (2020) FTRANS: energyefficient acceleration of transformers using fpga. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 175–180. Cited by: TABLE I, §I, §II, §IVB4, TABLE VII.
 [17] (2019) Edge ai: ondemand accelerating deep neural network inference via edge computing. IEEE Transactions on Wireless Communications 19 (1), pp. 447–457. Cited by: §I.
 [18] (2015) CSR5: an efficient storage format for crossplatform sparse matrixvector multiplication. In The 29th ACM International Conference on Supercomputing (ICS ’15), Cited by: §II, §IIIC, TABLE III, §IVB2.
 [19] (2020) PCONV: the missing but desirable sparsity in dnn weight pruning for realtime execution on mobile devices.. In AAAI, pp. 5117–5124. Cited by: §II.
 [20] (2016) Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843. Cited by: §IVA.

[21]
(2017)
Blocksparse recurrent neural networks
. arXiv preprint arXiv:1711.02782. Cited by: §IVB1.  [22] (2018) OuterSPACE: an outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Cited by: §IIIC.
 [23] (1999) Improving performance of sparse matrixvector multiplication. In SC’99: Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, pp. 30–30. Cited by: §IIIC, TABLE III.
 [24] (202011) When BERT Plays the Lottery, All Tickets Are Winning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 3208–3229. External Links: Link, Document Cited by: §I, §II.
 [25] (2015) Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664. Cited by: §I.
 [26] (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §I.
 [27] (2020) CSBrnn: a fasterthanrealtime rnn acceleration framework with compressed structured blocks. In Proceedings of the 34th ACM International Conference on Supercomputing, pp. 1–12. Cited by: §II, §IIIC.
 [28] (2019) The evolved transformer. In International Conference on Machine Learning, pp. 5877–5886. Cited by: §IVB4, TABLE VII.
 [29] (2015) Endtoend memory networks. arXiv preprint arXiv:1503.08895. Cited by: §I.
 [30] (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §I, §IVB4, TABLE VII.
 [31] (2018) Glue: a multitask benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §IVA.
 [32] (2016) Inner attention based recurrent neural networks for answer selection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1288–1297. Cited by: §I.
 [33] (2020) Hat: hardwareaware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187. Cited by: TABLE I, §I, §II, §IVB4, TABLE VII.
 [34] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §IIIG.
 [35] Introduction to fpga design with vivado highlevel synthesis. Note: https://www.xilinx.com/support/documentation/sw_manuals/ug998vivadointrofpgadesignhls.pdf Cited by: §IIID.
 [36] (2018) Scaling for edge inference of deep neural networks. Nature Electronics 1 (4), pp. 216–222. Cited by: §I.
 [37] (2020) Coexploration of neural architectures and heterogeneous asic accelerator designs targeting multiple tasks. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. Cited by: §IIIG.

[38]
(2020)
Accelerating sparse matrix–matrix multiplication with gpu tensor cores
. Computers & Electrical Engineering 88, pp. 106848. Cited by: §IIIC, TABLE III, §IVB2.  [39] (2019) Performance modeling and directives optimization for high level synthesis on fpga. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems. Cited by: §IIID.
 [40] (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §IIIG.
Comments
There are no comments yet.