Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

by   Panjie Qi, et al.

State-of-the-art Transformer-based models, with gigantic parameters, are difficult to be accommodated on resource constrained embedded devices. Moreover, with the development of technology, more and more embedded devices are available to run a Transformer model. For a Transformer model with different constraints (tight or loose), it can be deployed onto devices with different computing power. However, in previous work, designers did not choose the best device among multiple devices. Instead, they just used an existing device to deploy model, which was not necessarily the best fit and may lead to underutilization of resources. To address the deployment challenge of Transformer and the problem to select the best device, we propose an algorithm hardware closed-loop acceleration framework. Given a dataset, a model, latency constraint LC and accuracy constraint AC, our framework can provide a best device satisfying both constraints. In order to generate a compressed model with high sparsity ratio, we propose a novel pruning technique, hierarchical pruning (HP). We optimize the sparse matrix storage format for HP matrix to further reduce memory usage for FPGA implementation. We design a accelerator that takes advantage of HP to solve the problem of concurrent random access. Experiments on Transformer and TinyBert model show that our framework can find different devices for various LC and AC, covering from low-end devices to high-end devices. Our HP can achieve higher sparsity ratio and is more flexible than other sparsity pattern. Our framework can achieve 37x, 1.9x, 1.7x speedup compared to CPU, GPU and FPGA, respectively.



There are no comments yet.


page 1

page 3

page 4

page 5

page 7


Deep Compressed Pneumonia Detection for Low-Power Embedded Devices

Deep neural networks (DNNs) have been expanded into medical fields and t...

Visual Transformer Pruning

Visual transformer has achieved competitive performance on a variety of ...

Dancing along Battery: Enabling Transformer with Run-time Reconfigurability on Mobile Devices

A pruning-based AutoML framework for run-time reconfigurability, namely ...

Dynamic Transformer for Efficient Machine Translation on Embedded Devices

The Transformer architecture is widely used for machine translation task...

An Image Enhancing Pattern-based Sparsity for Real-time Inference on Mobile Devices

Weight pruning has been widely acknowledged as a straightforward and eff...

A Brain-Inspired Low-Dimensional Computing Classifier for Inference on Tiny Devices

By mimicking brain-like cognition and exploiting parallelism, hyperdimen...

A Resource-Efficient Embedded Iris Recognition System Using Fully Convolutional Networks

Applications of Fully Convolutional Networks (FCN) in iris segmentation ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, Transformer [30]

has gained popularity and achieved record-breaking results on major natural language processing (NLP) tasks, including question answering, sentiment analysis and language inference

[32, 6, 29, 25]. Although state-of-the-art Transformer models offer great prediction accuracy, they have a large number of parameters. For example, the model has 340M parameters [6] and the DistilBERT, a compact model, has 67M parameters [26]

. Moreover, with the ongoing democratization of machine learning

[7], there are increasing needs to execute such giant models on embedded devices [17, 36], e.g., field-programmable gate array (FPGA) or application-specific integrated circuit (ASIC). Using these devices as an acceleration platform for Transformer is challenging as they offer a limited on-chip memory and often possess limited off-chip bandwidth, both of which are critical for high performance. This restriction is particularly limiting for FPGA due to its extremely small on-chip memory, approximately 5MB for low-end FPGA (e.g., ZCU104) and 35MB for high-end FPGA (e.g., Alveo U200). Therefore, when Transformer comes to embedded devices, the primary challenge is in accommodating giant models onto these devices, along with the requirement of low inference latency.

Methods AlgorithmHardware NAS Hardware Ours
FTRANS[16] [8] HAT [33] A3 [9]
Hardware Type
Resource Uti.
TABLE I: Comparisons of acceleration framework for Transformer models. Our framework distinguishes from other works by considering both AC and LC.

Three research trends have attracted enormous interests to improve the performance of Transformer, as Table I shows. The first trend is hardware acceleration on ASIC, e.g., A3 [9], where researchers mainly focus on hardware acceleration. The second trend is algorithm optimization on CPU and GPU, such as neural architecture search (NAS) and model compression algorithms, e.g., block structured pruning [14], lottery ticket hypothesis [5, 24, 33]. The third trend is the algorithmhardware sequential design flow [16, 8], which compress the model first and then implement compressed model to a existing device. This sequential design flow has no hardware performance feedback to software optimization. In this paper, we propose an algorithmhardware closed-loop framework, which can trade off between the sparsity ratio and hardware resources to achieve co-exploration of model compression and hardware acceleration. Our framework simultaneously considers hardware type, resource utilization, model compression, and and .

Moreover, with the development of technology, more and more hardware devices are available to run a Transformer model, such as various types of mobile device (e.g., Apple Bionic, Qualcomm Snapdragon, HiSilicon Kirin, Samsung Exynos, …), FPGAs (e.g., ZCU102, VC707, Alveo U200, Versal, …), ASICs and son on. These devices have different computing power and storage capacities, which are critical to the performance of Transformer. Moreover, for a Transformer model with different constraint requirements (tight or loose), it can be deployed onto devices with different computing power. However, in previous work, designers did not choose the best one among multiple devices. Instead, they just used an existing device to deploy the model, which was not necessarily the best fit and may lead to underutilization of resources. Therefore, with the surging of various types of devices and constraint requirements for models, it is becoming increasingly difficult for designers to select the best device for their application.

To address the deployment challenge of Transformer and the problem to select the best device, as the first attempt, we propose an algorithmhardware closed-loop framework, which can provide a best device under different constraints. Our framework makes a tradeoff between the sparsity ratio of model and hardware resources to achieve co-exploration to accelerate Transformer inference. We use FPGA to illustrate our design, and it can also be applied to other hardware devices, such as mobile devices, ASICs.

The main contributions of this paper are: (1) An Algorithm hardware closed-loop framework. We provide a co-exploration framework from constraints (,) to device. User can input some constraints, , , backbone model and dataset, our framework can output the best device to deploy this model meanwhile satisfying both constraints. (2) A Hardware-friendly Hierarchical Pruning (HP) Technique.

We propose HP, a novel sparsity pattern, which is a two-level pruning technique and takes a advantage of two existing pruning techniques, block structured pruning (BP) and vector-wise pruning (VW). HP is hardware-friendly and can achieve high sparsity ratio. (3)

A Sparse Matrix Storage Format Optimization. We optimize a sparse weight format for our HP matrix on FPGA implementation. Our format can significantly reduce memory usage and perform better than commonly used formats. (4) Sparsity-aware Accelerator.

We design a FPGA-based accelerator for HP and abstract a performance predictor to build a bridge between the software and hardware for efficient clock cycles and resource usage estimation.

Fig. 1: Overview of the proposed framework

Ii Related Work

Transformer. Transformer has been highly optimized at the software level for CPU and GPU. A research trend is to modify the architecture of Transformer to improve the performance on CPU and GPU [33, 4]

. These work exploit Neural Architecture Search (NAS) to search a best model architecture. However, the cost is usually high in the search process, since massive computations and neural network samples are required for an optimized network architecture. However, little work has been published related to custom hardware acceleration for transformer-based model, particularly on FPGAs.

[9] has been proposed to accelerate different parts of transformer model, attention and fully-connected layers, to achieve efficient processing on ASIC. [16] is the only currently published FPGA accelerator, which proposes a acceleration framework to enable compression on FPGA. This work sequentially first compress model and then deploy the compressed model on FPGA. This sequential design flow has no hardware performance feedback to software optimization and is not the optimal. In this paper, we trade off between the sparsity ratio and hardware resources to achieve co-exploration of model compression and hardware acceleration.

Model Compression. [5, 24] applied Lottery ticket hypothesis on model compression on BERT, based on an observation that a subnetwork of randomly-initialized network can replace the original network with the same performance. However, the non-structure pruning is not hardware-friendly. For hardware-friendly weight pruning, [15] proposes a hardware-friendly block structured pruning technique for transformer. However this technique will result in a significant accuracy loss when pruning ratio increases or block size is larger. [19] proposes pattern pruning to make a better balance between accuracy and pruning ratio. But this pruning technique cannot directly apply to hardware due to parallelism limit.

Sparse Matrix Compression Formats. A variety of sparse matrix representation formats have been proposed to compress the sparse matrix. Prior works take two major approaches to design such compression scheme. The first approach is to devise general compression formats, such as Compressed Sparse Row (CSR) [18], Coordinate (COO) [1]. They both record the row/column indices of each non-zero elements, which cause excessive memory usage. The second approach is to leverage a certain known structure in a given type of sparse matrix. For example, the DIA format [2] is highly efficient in matrices where the non-zero elements are centrated along the diagonals of the matrix. The CSB format [27] is devised for the proposed CSB sparsity pattern. Though these compression schemes are specific to certain types of matrices, they are the most efficient in both computation and storage. In our work, in order to be the most efficient in storage, we optimize a sparse matrix compression scheme for our sparsity pattern HP.

Iii The AlgorithmHardware Closed-loop Acceleration Framework

To address the deployment challenge of Transformer and the problem to select the best device, as the first attempt, we propose an algorithmhardware closed-loop framework, which can provide a best device under different constraints. Our framework makes a tradeoff between the sparsity ratio and hardware resources to achieve co-exploration of model compression and hardware acceleration. Next, we use FPGA to illustrate our design, and it can also be applied to other devices, such as mobile devices, ASICs.

Iii-a Problem Definition and Overview

In this paper, we aim to develop an algorithmhardware closed-loop acceleration framework to select the best device under different and for Transformer. We define the problem as follows: Given a specific data set , a backbone model , a hardware pool , latency constraint , accuracy constraint , the objective is to determine: (i) a compressed model including sparsity of each layer; (ii) the target hardware device; such that the compressed model can be deployed onto the target device meanwhile satisfying both constraints and .

Figure 1 shows the overview of our framework and Algorithm 1 illustrates the whole process. Firstly, we design a pruning technique and conduct sparsity-aware accelerator design (components 1⃝ 2⃝) and abstract a performance predictor to estimate hardware resource requirements (components 3⃝). Then we use the RNN-based RL controller to guide the search process: (i) the controller predicts a sample; (ii) the performance predictor roughly estimates resource requirements of the sample (components 3⃝); (iii) select the target device from hardware pool to meet resource requirements (components 4⃝); (iiii) fine tune the resource allocation exactly and optimize the latency under the target device constraint (components 5⃝); (iiiii) train the model and get accuracy (components 6⃝). At last, the controller is updated based the feedback (reward) from 4⃝ 5⃝ 6⃝ and then predicts better samples. In the following text, we will introduce these components one-by-one.

0:  : backbone model; : a specific data set     : latency constraint; : accuracy constraint      : a hardware pool
0:   a compressed model      the target hardware
1:  for each in range(1, do
2:     RL controller predicts a sample
3:     Performance predictor to roughly predict clock cycles , the number of block RAMs (BRAMs) and DSPs based on the sample.
4:     Choose device from based on , , .
5:     Estimate the maximum latency
6:     if find proper device and  then
7:         calculate the and choose the best fit .
8:         fine tune resource allocation to get mini latency
9:          = Prune_Train(,)
10:     else
11:         assign negative values to
12:     end if
13:     Reward = + norm() +
14:     Monte Carlo algorithm to update controller
15:  end for
Algorithm 1 Acceleration Framework.

Iii-B Network Compression

In order to accommodate Transformer models with enormous parameters onto the on-chip memory of FPGA, a pruning technique that can achieve a high sparsity ratio with a small accuracy loss and hardware-friendly is necessary. In this paper, we propose HP, which is a two-level pruning technique. It combines the advantages of existing two pruning techniques, BP [14] and VW [3]. Firstly, to keep hardware-friendly, we adopt BP to prune model. However, it is coarse-grained and can’t achieve high sparsity ratio with a small accuracy loss. But how to achieve higher sparsity ratio? Next, based on BP, we adopt VW, a fine-grained pruning, to prune further. In this way, we can maintain hardware-friendly and achieve high sparsity.

BP VW HP(ours)
High spar.& acc.
TABLE II: Pruning techniques comparison among BP, VW and our HP. Our HP combines coarse-grained and fine-grained pruning and can achieve higher sparsity ratio than BP and VW.
Fig. 2: The Proposed Pruning Technique, Hierarchical Pruning (HP).

As Figure 2 shows, our HP is a combination of BP and VW. First, we adopt BP as the first level pruning and we divide the weight matrix into blocks and prune some unimportant columns in each block. We regard this BP model as the backbone model of HP and denote its sparsity ratio by . The value of determines the starting sparsity rate of HP weight, which can be adjusted flexibly. Then, based on the backbone model BP, we adopt VW as the second level to remove unimportant elements in each unpruned column of blocks. To keep balanced, we remove the same number of elements in each column of blocks. Our HP combines coarse-grained (BP) and fine-grained pruning (VW) to achieve a higher sparsity ratio and ensure a small accuracy loss. We make a comparison among BP, VW and HP. As Table II shows, our HP combines the best of both BP and VW and is more flexible and effective than them. As for VW, it keeps all vectors (columns), which is unnecessary because some vectors are important and some are not. As for HP, we can first prune some unimportant columns, which can increase the sparsity ratio than VW to some extent. Moreover, we can also flexibly adjust the value of to achieve different sparsity ranges and accuracy.

Iii-C Sparsity-aware Accelerator Design

In this section, first, we introduce the optimized sparse weight matrix storage format when implementing on FPGA. Then we introduce the accelerator design.

The Storage Format Optimization. In sparse matrices, the number of non-zero elements (NZ) is much smaller than the number of zero elements. In order to avoid unnecessarily 1) storing zero elements and 2) performing computations on them, we need an efficient scheme to compress the sparse matrix. Various sparse matrix storage formats have been proposed, e.g., COO [1], CSR [18], BCSR [23], Tile-Bitmap [38], MBR [13]. In our work, we use a bitmap format similar to MBR and optimize this format based on our sparsity pattern HP to reduce memory usage further.

Fig. 3: The Optimized Sparse Weight Matrix Storage Formats: WMark. (a) SF1 for . (b) SF2 for . is the sparsity ratio of the backbone model (BP model).

Figure 3 shows our format, WMark. We design two formats according to the sparsity ratio of backbone model . When is not equal to 0%, the weight is pruned by BP and VW and we use SF1. When is equal to 0%, the weight is only pruned by VW and we use SF2. There are three arrays in SF1: (i) the three-dimensional array records all non-zero elements (NZ). The first dimension record the number of blocks and the NZ in successive blocks are concatenated (column-major order) and stored continuously in the last two dimension; (ii) array stores the indices of unpruned columns in each block; (iii) In order to track which elements are NZ in each unpruned column, we use a bitmap array . If the a slot contains a NZ we set the it to ”1”, otherwise to ”0”. As for SF2, the array is not needed and there are only two array. There are four arrays in MBR [13]: . The difference between our WMark and MBR [13] format is that: 1) array is not needed. Because it is easy to calculate the row indices due to the balanced property of HP sparsity pattern. 2) The Bitmap array in our WMark only records the unpruned column not all columns, which can save memory storage. We set a weight matrix with 50% sparsity ratio and compare the memory usage of the five formats with ours. As Table III shows, our format performs better than all.

COO [1] CSR[18] BCSR [23] Tile-Bitmap [38] MBR [13] WMark(ours)
value 1250 1250 2500 1250 1250 1250
col_Idx 3125 3125 312.5 312.5 312.5 312.5
row_Idx 3125 7.8 1.6 312.5 1.6 -
Index - - - 351.6 - -
Bitmap - - - 625 625 312.5
Total (Kb) 7500 4382.8 2814.1 2851.6 2189.1 1875
TABLE III: Memory Usage Comparison among Sparse matrix formats.

Accelerator Design. Different from other FPGA accelerator design [11, 10, 12, 3], we fit all weights on on-chip memory of FPGA and don’t move data between on-chip and off-chip memory by weight pruning and quantization. And to realize a low inference latency with parallel FPGA, there are multiple challenges to design an architecture that can exploit the benefits of HP. In previous work, [3] and [27] implement accelerators with sparsity but they are designed for RNN model (matrix-vector multiplication, MV) and can’t be applied to Transformer (matrix/vector-matrix multiplication, MM / VM). As Figure 4 show, with generalized VM as in [22], there are two concurrent irregular memory accesses challenges, one for random read to input vector and the other for random write to result matrix, which can install the parallel execution. To solve these challenges, we change the memory access pattern. To avoid the random write to result matrix, we multiply multiple rows of the input matrix by one column of the weight matrix, which can achieve sequential writing. To solve the challenge of random read to input, we assign a input matrix row buffer and use register to implement it which can be randomly accessed.

Figure 5 shows our computation engine. It consists of parallel processing elements (PEs) that compute dot products of distinct input matrix rows and one column of the weight matrix (one block) concurrently to exploit inter-row parallelism, while each PE is designed to exploit intra-row parallelism in a single dot product operation. Each PE contains a input matrix row buffer to buffer each row of the being multiplied input matrix and this buffer is implemented by register which can be randomly accessed. This computation includes 5 steps: (1) The PE reads elements from the weight matrix memory and elements based on the array from the input row buffer . (2) multipliers operate simultaneously to obtain scalar products. (3) an adder tree sums scalar products to calculate the dot product. (4) PE reads from the weight matrix. (5) The dot product result is written back to the result memory based on the . PEs are fully pipelined so that one operation can be processed per clock cycle.

Fig. 4: The sparse MM parallelism scheme. (a) generalized VM and parallelism. (b) our MM design with HP. We exploit the multi-row of input matrix to avoid random write to result matrix.
Fig. 5: Computation Engine.

Iii-D Performance Predictor

We develop a FPGA performance predictor to roughly analyze resource requirements , , based on software and hardware parameters predicted by RL controller.

1⃝ We model the Block RAMs (on-chip SRAM units, called BRAMs) usage using the formula in [39]. According to on-chip buffer allocation, we can calculate the BRAMs for -th buffer . Among them, represents the quantization bits of weight and represents the number of NZ of weight. The and represent the configuration of BRAM. Then we can get the total BRAMs by adding up all buffers .

2⃝The DSP usage is related to multiply-accumulate and data type. According to the computation engine in Figure 5, it can execute MAC operations in parallel. For the 16-bit fixed point, it requires DSPs, where each multiplication and add operation requires 1 DSP. For 32-bit floating point, it requires DSPs, where 5 is the sum of 3 DSPs for one multiplication and 2 DSPs for one add operation [35]. Suppose that the total number of layers are and the PEs size of -th layer is , then the total DSP is : .

3⃝The clock cycles are related to the size of PEs. After implementing PEs in Vivado HLS, we try to make the pipeline interval become 1, indicating that PEs can output one result in 1 clock cycles. Therefore, clock cycles of one layer equal the number of times that PEs is invoked. The sparse matrix multiplication of and with sparsity ration can support MAC. With the PEs size , we can calculate the clock cycles: . Therefore, for layers in total, the total clock cycles is: .

Iii-E Choose Device

Next, we introduce how to choose the best device from hardware pool based on , , . Figure 6 show the process of selecting a best device. Table IV show our hardware pool. The process is as follows: (1) First, we sort devices in hardware pool according to the number of BRAMs provided by each device. (2) we perform binary search to find the device whose BRAMs are large than . That might be more than one device thus we use a set to denote the alternative devices. (3) we calculate the latency for device in set based on the formula , where is the running frequency of device and meanwhile we also compute the resource utilization for each device. (4) we choose the device whose is small than . Specifically, When there are more than two device to choose from, we choose the device with largest , meaning that we can select the device with lower price and higher resource utilization.

Alveo U200 4320 6840 21182240 2364480
VC709 2940 3600 433200 866400
VC707 2060 2800 303600 607200
ZCU102 1824 2520 274080 548160
ZCU104 624 1728 230400 460800
TABLE IV: Hardware Pool
Fig. 6: Choose target device

Iii-F Optimization

The optimization step is to exactly fine tune the resource allocation sheme under the resource constraint of the target device to achieve the least clock cycles and fill up the gap between the actual and estimated value. In this step, we specifically consider the parallelism of the Dot-Attention layer, which defaults to 1 in the performance predictor step. The target FPGA offers a certain number of resource including DSP slices (ALUs) and BRAMs. And our goal is to leverage these resources to exploiting the compute capabilities of the FPGA and achieving reasonably high performance. This problem can be expressed as the following optimization objective:

Here, the parameters are the PE size of -th layer and is the parallelism of Dot-Attention layers. The and are the clock cycles and computation resource needed by one Dot-Attention layer. is the avaliable computation resource of the target device. is the mumber of heads of Transformer-based models. Algorithm 2 illustrates our fine-tuned resource allocation scheme and solves the optimization objective. The algorithm takes in as input the Transformer model architecture and the target device constraints . And it finally outputs the parallelism factor and the least latency .

0:  : the target device constraints     : the Transformer model architecture     : the sparsity ratio for all layers
0:  : the optimized cycles       : parallelism factor
1:  Initialize the execution_cycles
2:  Set avaliable computation resource: total DSPs
3:  Compute the computation complexity of -th layer: and the total computation complexity of all layers:
4:  for each in range(0, do
6:     for each in range(0, do
7:         ;
8:         adjust the parallelism factor () based on
9:     end for
10:     calculate cycles =
11:     if cycles  then
12:         =cycles
13:         record
14:     end if
15:  end for
Algorithm 2 Fine-tuned Resource Allocation Scheme

Iii-G Reinforcement Learning (RL)

In our design, the search space is very big, therefore we exploit the RL to carry out guided search. The RL controller is implemented based on an RNN [40]. In each episode, the controller first predicts a sample, and gets its based on the evaluation results from the environment (components 3⃝ 4⃝ 5⃝ 6⃝ in Figure 1). Then, we employ the Monte Carlo policy gradient algorithm [34, 37] to update the controller:

where is the batch size and is the number of steps in each episode. The exponential factor are used to adjust the reward at every step and the baseline is the average exponential moving of rewards.

Our framework specifically takes hardware performance () into consideration rather than just model accuracy . As Figure 7 shows, we integrate the software parameters (# sparsity ratio) and accelerator design parameters (# parallelism factors) into the action space of controller to realize a co-exploration of sparsity ratio and hardware resources. Therefore, we employ a reward function to calculate , which takes the accuracy , latency , the resource utilization and latency constraint to calculate reward. The function is defined as follows:

In the above function, there are two cases. First, if and , it indicates that the performance of the sample can satisfy the constraints, we sum up the reward of hardware performance and accuracy. Otherwise, in any other case, it indicates that the sample can’t satisfy constraints and we directly return negative values to the controller, which can save the search time. Note that we return different negative reward to guide the search. We return for and return for .

Fig. 7: Reinforcement Learning. We integrate the software parameters (#sparsity ratio) and hardware parameters (#parallelism factors) into the action space to realize a co-exploration of sparsity ratio and hardware resource. The environment is made up of components 3⃝ 4⃝ 5⃝ 6⃝ in Figure 1.

Iv Experiments

Iv-a Experimental Settings

Baseline Models and Datasets.

We test our method on Transformer model using WikiText-2 dataset

[20] and on TinyBERT model using GLUE benchmark [31]

. For Transformer model, there are 2 encoder and 1 decoder layers (the hidden size is 800, the feed-forward size is 200 and the head number is 4). And we use the accuracy of word prediction as our evaluation metrics. For TinyBERT, there are 4 encoder layers and 1 pooler layer and 1 classifier layer.

Evaluation Platforms.

We conduct the reinforcement learning framework with the training of Transformer model on an 8× NVIDIA Quadro RTX 6000 GPU server (24 GB GPU memory). Experiments environment are performed on Python 3.6.10, GCC 7.3.0, PyTorch 1.4.0, and CUDA 10.1. The hardware accelerator design is implemented with Vivado HLS, which is the commonly used high level synthesis tool. This tool enables implementing the accelerator with C languages and exports the RTL as a Vivado’s IP core. The C code of our accelerator is parallelized by adding HLS-defined pragma. Pre-synthesis resource report are used for performance estimation.

Iv-B Experimental Results

Iv-B1 Pruning Strategy

We set the sparsity ratio of backbone model to different values for different models. For TinyBert model, its model size is relatively small and is sensitive to pruning. Therefore in order to maintain high accuracy, we set to . For Transformer model, through experiments we set to , which can ensure high sparsity ratio and maintain acceptable accuracy loss. As Table V shows, Transformer model pruned by HP can reduce model size by with accuracy loss. And the TinyBert model can achieve and

accuracy loss for MRPC task and SST-2 task.

Accuracy. To evaluate benefit of our HP, we compare it with BP [14], VW [3], block-wise pruning (BW) [21], and irregular pruning on Transformer model. The block size of BW and BP is , , respectively. And the vector size for VW is . As Figure 9 shows, the HP, VW and the irregular pruning can achieve the same model accuracy when the sparsity is smaller than 70%. The HP can achieve better accuracy than irregular pruning at around 82% sparsity. When the sparsity is larger than 92% (the intersection of HP and irregular), HP performs worse than irregular due to large sparsity of the backbone model. VW can only achieve the limited 90% sparsity when the vector size is and our HP can achieve 99% sparsity. These experimental results demonstrate that HP has almost the same effectiveness as irregular sparsity and outperforms BW, BP and VW sparsity in terms of achievable accuracy or sparsity during pruning.

Transformer TinyBert(MRPC) TinyBert(SST-2)
Base HP(=50%) Base HP(=0%) Base HP(=0%)
model size 52M 6M 14.5M 10.9M 14.5M 8.6M
sparsity 0.00% 89.85% 0.00% 25.0% 0.00% 41.00%
accuracy 98.50% 96.13% 86.45% 85.75% 92.60% 90.37%
TABLE V: Pruning Strategy
Fig. 8: Weight heat map visualization after pruning with (a) HP (b) VW (c) irregular pruning. These weight heat maps are sub-matrices of the whole matrix of the feed forward layer2 in the encoder.

Visualization. We visualize the weight matrices after HP, VW and irregular pruning on Transformer model. Figure 8 visualizes the three sparse weight matrices of a sub-matrix which is randomly selected from the whole weight matrix. Pink grids indicate non-zero parameters and the pink level indicates the magnitude of the absolute value. Figure 8(a) shows the two steps of HP. In our HP matrices, there are two blocks (the top and bottom of the dashed line) and each vector (column) of in blocks has 7 NZ. We can see that the heat map of HP weight can prune some unimportant columns and maintain most important weights as irregular pruning. Although irregular sparsity retains some weights in a column while our HP removes the whole column, these weights are relatively small (this can be seen from the pink level in Figure 8) and the removal has no significant impact on accuracy. Instead, most of the important weights can be retained by our HP to ensure accuracy.

Fig. 9: Accuracy comparison of Transformer model on WikiText-2 dataset with various pruning patterns.

Iv-B2 Overhead Comparison of Sparse Weight Format

We compare overhead among CSR [18], Tile-Bipmap [38], MBR [13] and our WMark format. We use the memory usage as the metric. Figure 10 shows the results, it is clear that our optimized format WMark needs the least memory than all of them. And the WMark can achieve reduction in memory usage than MBR [13]. The reason is that our WMark has the balanced property and we don’t need the array to calculate the start index of each row. Besides, our array only mask the non-zero columns not all columns. Therefore, our WMark can use less memory than MBR [13].

Fig. 10: Overhead comparison of sparse weight matrix formats among CSR, Tile-Bitmap, MBR and our WMark.
Models (LC,AC) sparsity accuracy est. latency BRMA / Uti DSP / Uti LUT / Uti FF / Uti Target device
Transformer (40ms, 92%) 92.00% 94.45% 35.70ms 2492 / 85% 1644 / 46% 303879 /70% 268065 / 30% VC709
(20ms, 96%) 86.42% 96.84% 18.10ms 3311 / 77% 5040 / 74% 908833 / 77% 1102880 / 47% Alveo U200
TinyBert (MRPC) (180ms, 85%) 0% 86.45% 175ms 1602 / 87% 1027 / 40% 262248 / 95% 131542 / 23% ZCU102
(45ms, 85%) 0% 86.45% 42.1ms 2194 / 74% 1928 / 53% 417058 / 96% 254178 / 29% VC709
(18ms, 85%) 0% 86.45% 15.8ms 4204 / 97% 4145 / 60% 936545 / 79% 504293 / 21% Alveo U200
(50ms, 80%) 27.00% 84.95% 47.33ms 1530 / 83% 991 / 39% 254543 / 92% 158817 / 28% ZCU102
TinyBert (SST-2) (45ms, 90%) 25.00% 90.83% 40ms 1674 / 91% 1056 / 42% 264443 / 96% 189035 / 34% ZCU102
(30ms, 90%) 41.00% 90.37% 25.1ms 2504 / 85% 2028 / 56% 316028 / 73% 235177 / 27% VC709
TABLE VI: Validation On FPGA

Iv-B3 Validation On FPGA

We use FPGA devices to validate our approach, Table VI show our results. Our approach can find different devices under different sets of and . For Transformer, we set two sets of constraints to choose device: (40ms,92) for loose constraints and (20ms,96) for tight constraints. For (40ms,92), its latency and accuracy restrictions are loose, so it is possible to achieve a higher sparsity ratio and deploy to a mid-end FPGA VC709. For (20ms,96), its latency and accuracy restrictions are relatively tight, therefore the sparsity ratio is relatively small to ensure accuracy and it will be deployed to a device with strong computing power, Alveo U200, to achieve very low latency.

For MRPC task of TinyBERT model, first we set up three sets of constraints which have the same but different . The experimental results show that constraints with smaller can be deployed on FPGAs with greater computing power, such as (180ms, 85%) to ZCU102, (45ms, 85%) to VC709. Then we set constraint with lower (50ms, 80%), the searched result of sparsity ratio is 25% and the target device is ZCU102. This constraint can also be mapped to low-end FPGA (ZCU102), the same device as (180ms, 85%), due to compression and can achieve latency reduction. Therefore, the same device can satisfy different sets of constraints by compression and can be applied to different application scenarios. For SST-2 task of TinyBERT model, it shows similar experimental results as Transformer and MRPC task.

Iv-B4 Cross-platform Comparison

The research on Transformer models mainly focus on at the software level for CPU and GPU, such as Trans [30], Evolved Transformer [28], and HAT [33], but little work has been published related to custom hardware acceleration on FPGAs, in addition to FTRANS [16]. We compare the efficiency of ours with these work. Since these work exploit different models and data set, in order to provide a fair comparison, we use the floating-point operations per second (FLOPS) as the metric. As Table VII shows, our FPGA implementation can achieve speedup compared to Trans [30], Evolved Transformer [28] and HAT [33] on CPU respectively. And it can achieve and speedup compared to HAT [33] on GPU and FTRANS [16] on FPGA.

[30] [28] [33] [33] [16] Trans(84%) Tinybert(0%)
Operations(G) 1.5 2.9 1.1 1.1 0.284 0.09 1.2
latency(s) 3.3 7.6 2.1 0.147 0.034 0.00645 0.0158
FLOPS(G) 0.45 0.38 0.52 7.48 8.35 14.14 75.94
Impro. base 0.84 1.16 16.62 18.5 31.42 168.75
TABLE VII: Comparison among CPU,GPU and FPGA

Iv-B5 Search Space Exploration

We collect the explored results from RL to form the search space exploration results of Transformer in Figure 11. In this figure, the x-axis and y-axis stand for the latency and accuracy. We show the search result of constraint (26ms,96). We can see that these points are mainly concentrated in the vicinity of (26ms,96). This is due to the guided search of RL, which makes the search samples closer to the solution. There are two points A and B satisfying the constraints in Figure 11, it may represent there are two devices to choose from. In this case, we use the third metric, resource utilization, to select the best solution. The device with the largest resource utilization is the one that is more suitable, and at the same time it will be the cheaper one. Therefore, with our approach we can choose the best device.

Fig. 11: The RL Search Results for Transformer under constraint (26ms, 96). We select the device with the largest from A and B.

Iv-B6 Ablation Study

In this section, we investigate the influence of the sparsity of backbone model in our HP. We set the row of block size as 10 in our experiment. The first feature of HP is that it can achieve different sparsity range when combined with different backbone models. For example, when is equal to , it can achieve sparsity range from to . when is equal to , the sparsity range is from to . VW has the limited sparsity of and it can’t achieve sparsity larger than . For accuracy, as Figure 12 shows, under the same sparsity ratio, accuracy decreases with the increase of . For example, When sparsity is , backbone model can achieve , which is higher than and for and backbone model. So that tells us that we’d better make the value of small in order to get a better accuracy.

Fig. 12: The accuracy comparison of HP with different backbone models.

V Conclusion

In this paper, we propose an algorithmhardware closed-loop acceleration framework to solve the challenge of efficient deployments and device selection problem. Our framework can achieve from constraints () to device. To achieve high sparsity ratio, we propose HP to reduce model size. To further reduce memory usage, we optimized the sparse matrix storage format based HP sparsity pattern. Experiments show that our framework can find different devices for various and , covering from low-end devices to high-end devices.


This work is partially supported by National Nature Science Foundation of China (NSFC) 61972154 and Shanghai Science and Technology Commission Project 20511101600. We sincerely thank Prof. Caiwen Ding at UConn and Prof. Weiwen Jiang at George Mason University for the intensive discussions and constructive suggestions.


  • [1] R. E. Al. (1995) Templates for solution of linear systems: building blocks for iterative methods. Siam. Cited by: §II, §III-C, TABLE III.
  • [2] M. Belgin, G. ·Ba·Ck, and C. J. Ribbens (2009) Pattern-based sparse matrix representation for memory-efficient smvm kernels. In International Conference on Supercomputing, pp. 100. Cited by: §II.
  • [3] S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, Y. Liu, M. Wu, and L. Zhang (2019) Efficient and effective sparse lstm on fpga with bank-balanced sparsity. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 63–72. Cited by: §III-B, §III-C, §IV-B1.
  • [4] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, N. Parmar, M. Schuster, Z. Chen, et al. (2018)

    The best of both worlds: combining recent advances in neural machine translation

    arXiv preprint arXiv:1804.09849. Cited by: §II.
  • [5] T. Chen, J. Frankle, S. Chang, S. Liu, Y. Zhang, Z. Wang, and M. Carbin (2020) The lottery ticket hypothesis for pre-trained bert networks. arXiv preprint arXiv:2007.12223. Cited by: §I, §II.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.
  • [7] C. Garvey (2018)

    A framework for evaluating barriers to the democratization of artificial intelligence

    In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I.
  • [8] C. Guo, B. Y. Hsueh, J. Leng, Y. Qiu, Y. Guan, Z. Wang, X. Jia, X. Li, M. Guo, and Y. Zhu (2020) Accelerating sparse dnn models without hardware-support via tile-wise sparsity. arXiv preprint arXiv:2008.13006. Cited by: TABLE I, §I.
  • [9] T. J. Ham, S. J. Jung, S. Kim, Y. H. Oh, Y. Park, Y. Song, J. Park, S. Lee, K. Park, J. W. Lee, et al. (2020) A^ 3: accelerating attention mechanisms in neural networks with approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 328–341. Cited by: TABLE I, §I, §II.
  • [10] W. Jiang, E. H. Sha, X. Zhang, L. Yang, Q. Zhuge, Y. Shi, and J. Hu (2019) Achieving super-linear speedup across multi-fpga for real-time dnn inference. ACM Transactions on Embedded Computing Systems (TECS) 18 (5s), pp. 1–23. Cited by: §III-C.
  • [11] W. Jiang, X. Zhang, E. H. Sha, L. Yang, Q. Zhuge, Y. Shi, and J. Hu (2019) Accuracy vs. efficiency: achieving both through fpga-implementation aware neural architecture search. In Proceedings of the 56th Annual Design Automation Conference 2019, pp. 1–6. Cited by: §III-C.
  • [12] W. Jiang, X. Zhang, E. H. Sha, Q. Zhuge, L. Yang, Y. Shi, and J. Hu (2019) Xfer: a novel design to achieve super-linear performance on multiple fpgas for real-time ai. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 305–305. Cited by: §III-C.
  • [13] R. Kannan (2013) Efficient sparse matrix multiple-vector multiplication using a bitmapped format. In 20th Annual International Conference on High Performance Computing, pp. 286–294. Cited by: §III-C, §III-C, TABLE III, §IV-B2.
  • [14] B. Li, Z. Kong, T. Zhang, J. Li, Z. Li, H. Liu, and C. Ding (2020) Efficient transformer-based large scale language representations using hardware-friendly block structured pruning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §I, §III-B, §IV-B1.
  • [15] B. Li, Z. Kong, T. Zhang, J. Li, Z. Li, H. Liu, and C. Ding (2020) Efficient transformer-based large scale language representations using hardware-friendly block structured pruning. arXiv preprint arXiv:2009.08065. Cited by: §II.
  • [16] B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan, H. Liu, and C. Ding (2020) FTRANS: energy-efficient acceleration of transformers using fpga. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 175–180. Cited by: TABLE I, §I, §II, §IV-B4, TABLE VII.
  • [17] E. Li, L. Zeng, Z. Zhou, and X. Chen (2019) Edge ai: on-demand accelerating deep neural network inference via edge computing. IEEE Transactions on Wireless Communications 19 (1), pp. 447–457. Cited by: §I.
  • [18] W. Liu and B. Vinter (2015) CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In The 29th ACM International Conference on Supercomputing (ICS ’15), Cited by: §II, §III-C, TABLE III, §IV-B2.
  • [19] X. Ma, F. Guo, W. Niu, X. Lin, J. Tang, K. Ma, B. Ren, and Y. Wang (2020) PCONV: the missing but desirable sparsity in dnn weight pruning for real-time execution on mobile devices.. In AAAI, pp. 5117–5124. Cited by: §II.
  • [20] S. Merity, C. Xiong, J. Bradbury, and R. Socher (2016) Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843. Cited by: §IV-A.
  • [21] S. Narang, E. Undersander, and G. Diamos (2017)

    Block-sparse recurrent neural networks

    arXiv preprint arXiv:1711.02782. Cited by: §IV-B1.
  • [22] S. Pal, J. Beaumont, D. H. Park, A. Amarnath, and R. Dreslinski (2018) OuterSPACE: an outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Cited by: §III-C.
  • [23] A. Pinar and M. T. Heath (1999) Improving performance of sparse matrix-vector multiplication. In SC’99: Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, pp. 30–30. Cited by: §III-C, TABLE III.
  • [24] S. Prasanna, A. Rogers, and A. Rumshisky (2020-11) When BERT Plays the Lottery, All Tickets Are Winning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 3208–3229. External Links: Link, Document Cited by: §I, §II.
  • [25] T. Rocktäschel, E. Grefenstette, K. M. Hermann, T. Kočiskỳ, and P. Blunsom (2015) Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664. Cited by: §I.
  • [26] V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §I.
  • [27] R. Shi, P. Dong, T. Geng, Y. Ding, X. Ma, H. K. So, M. Herbordt, A. Li, and Y. Wang (2020) CSB-rnn: a faster-than-realtime rnn acceleration framework with compressed structured blocks. In Proceedings of the 34th ACM International Conference on Supercomputing, pp. 1–12. Cited by: §II, §III-C.
  • [28] D. So, Q. Le, and C. Liang (2019) The evolved transformer. In International Conference on Machine Learning, pp. 5877–5886. Cited by: §IV-B4, TABLE VII.
  • [29] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus (2015) End-to-end memory networks. arXiv preprint arXiv:1503.08895. Cited by: §I.
  • [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §I, §IV-B4, TABLE VII.
  • [31] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §IV-A.
  • [32] B. Wang, K. Liu, and J. Zhao (2016) Inner attention based recurrent neural networks for answer selection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1288–1297. Cited by: §I.
  • [33] H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han (2020) Hat: hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187. Cited by: TABLE I, §I, §II, §IV-B4, TABLE VII.
  • [34] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §III-G.
  • [35] Xilinx Introduction to fpga design with vivado high-level synthesis. Note: Cited by: §III-D.
  • [36] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi (2018) Scaling for edge inference of deep neural networks. Nature Electronics 1 (4), pp. 216–222. Cited by: §I.
  • [37] L. Yang, Z. Yan, M. Li, H. Kwon, L. Lai, T. Krishna, V. Chandra, W. Jiang, and Y. Shi (2020) Co-exploration of neural architectures and heterogeneous asic accelerator designs targeting multiple tasks. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. Cited by: §III-G.
  • [38] O. Zachariadis, N. Satpute, J. Gómez-Luna, and J. Olivares (2020)

    Accelerating sparse matrix–matrix multiplication with gpu tensor cores

    Computers & Electrical Engineering 88, pp. 106848. Cited by: §III-C, TABLE III, §IV-B2.
  • [39] J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang, and B. He (2019) Performance modeling and directives optimization for high level synthesis on fpga. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. Cited by: §III-D.
  • [40] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §III-G.