I Introduction
Convolutional Neural Networks (CNNs) have achieved impressive progress over the past few years in various domains, such as image classification[1], object detection[2], and semantic segmentation[3]. At the same time, the neural network models are becoming deeper and heavier to achieve better performance. The huge number of weights increases the requirement in hardware resources and the energy consumption of inference, thus making it challenging for implementing CNNs on embedded systems. In recent works, pruning techniques have been studied to address this limitation [4, 5]. It is observed that contemporary neural networks tend to be overparameterized, and a large portion of weights can be removed to reduce the model complexity and the heavy cost of inference. This pruning process only has a marginal impact on accuracy after finetuning the remaining weights. For example, 89% of weights were pruned in the AlexNet [1] model without accuracy loss in [4].
Although great progress has been made in network pruning, such a high pruning rate does not directly lead to the same degree of energy saving and throughput improvement in the hardware accelerators. Many stateoftheart CNN accelerators employ systolic arrays to implement the intensive multiplyaccumulate computations (MACs) during inference [6, 7]. As one of the most popular architectures to implement CNN models, the systolic array has the advantage of regular structure, parallel computation, and data reuse capability. Therefore, it can execute the intrinsic matrix multiplications of CNN very efficiently with high energy efficiency and throughput. However, it is not easy for the regular structure of the systolic array to take full advantage of the finegrained sparsity in the pruned network models. Since the nonzero weights have an irregular distribution in the weight matrix, the size of the weight matrix of each layer cannot be reduced efficiently by the unstructured pruning. Therefore, most of the zero weights will still be mapped to the systolic array nodes, making it difficult to improve the throughput and energy efficiency. As one potential way to tackle this problem, structured pruning techniques have been explored [8, 9, 10], where the pruning is performed in a larger granularity like channelwise and kernelwise. However, structured pruning usually tends to have higher accuracy loss than the finegrained unstructured pruning when the pruned models are of the same size [11, 7].
In this work, we propose a tight compression method to compress the unstructurallypruned CNN model to a small and dense format, so that the systolic array can be fully utilized in implementing the pruned neural network. Compared to the stateoftheart model compression techniques, the proposed method can achieve a higher compression rate of the weight matrix, which leads to significant improvements in throughput and energy efficiency. In summary, this work makes the following contributions:

A compression method is proposed to implement CNN models efficiently in the systolic array system. The unstructured pruning is carried out first. Then, the sparse weight matrix is partitioned according to the size of the systolic array, and the rows and columns are permuted to facilitate packing. Simulated annealing (SA) is used during the permutation process to obtain the optimal arrangement of the weight matrix for the final compression step. After permutation, the sparse weight matrix can be compressed to a small and dense format to make full use of the hardware resources.

Two pruning granularities are explored for weight permutation and matrix compression. The design based on the unstructured weight pruning is presented first (i.e. weightlevel compression). Then, the model is further pruned at the subword level to exploit the finegrained subword sparsity for improving the compression performance (i.e. subwordlevel compression).

The hardware structure of the systolic array is taken into account during the compression process. Instead of only considering the pruning rate, we adopt the compression rate of the weight matrix as the optimization objective for the simulated annealingbased weight permutation.

A systolic array architecture is designed for implementing the compressed CNN models. Experimental results show that the proposed compression method can prune over 93% of the weights and compress the size of the weight matrices by 14.13 times. As a result, significant improvements can be achieved in throughput and energy efficiency.
This work is an extension of our prior paper [12]. The extended materials include 1) a more finegrained subwordlevel compression method to further improve the compression performance, and 2) the hardware architecture to support the subwordlevel compressed models.
Ii Preliminaries
Iia CNN Model and Systolic Array
Convolutional neural network (CNN) is a machine learning model inspired by the brain. It is organized as a stack of layers to extract the features hierarchically, as shown in Fig.
1. Convolutional (CONV) layers are the core layers for feature processing and thus account for most of the arithmetic computations. At each neuron, the weighted sum of the input activations is computed. Then, the multiplication and accumulation (MAC) result is sent to the activation function to calculate the output activation. Since the intrinsic computation at each layer can be regarded as a matrix multiplication between weights and input activations, CNN models are highly suitable to be implemented using systolic arrays.
Fig. 2 (a) shows the general architecture of a systolic array. Each node in the systolic array is a processing unit connected with the four neighboring nodes. In the widely used weightstationary computation flow, the weight matrix is stored in the local registers of the nodes and will not move during the matrix multiplication [6, 7]. Each row in the systolic array is mapped with the weights of the same neuron. The input activations are sent into the array from the bottom and gradually propagate to the top. As the input activations propagate across the nodes, the weight stored in each node is continually multiplied with the input activations received from the bottom node. The product is accumulated with the corresponding partial sum received from the left node and then propagates to the right. All the weightactivation products related to the same output will be summed up along the row and sent out of the systolic array from the right.
IiB Related Works
IiB1 Pruning
Network pruning seeks to minimize the number of nonzero weights in the overparameterized CNN model to reduce the model complexity and the cost of inference. In [4], Han et al. propose to prune the weights according to their magnitude. Weights with small magnitude values are considered to have a relatively small contribution to the model quality and thus can be removed. After pruning, the remaining weights are retrained to recover the accuracy. The method can achieve a pruning rate of 89% in AlexNet [1]. In [5], an automated pruning algorithm is proposed to gradually prune the smallmagnitude weights to a preset level of sparsity with minimal retraining requirement. Considerable pruning rates have been achieved in previous works. However, it is difficult to fully exploit the unstructured sparsity in the regular hardware architectures like systolic arrays. Since the zero weights are randomly spread across the matrix, the size of the weight matrix cannot be reduced efficiently by unstructured pruning, and thus most zero weights still have to be allocated to the nodes of the systolic array. The nodes mapped with zero weights will not perform effective computations, thus affecting the overall improvements in throughput and energy efficiency.
IiB2 Structured Pruning
Structured pruning techniques have been explored in recent works to implement the sparse neural networks in the existing regular architectures [8, 9, 10]. Instead of pruning individual weights, the structured pruning is performed in a larger granularity such as an entire row or column in the weight matrix. After pruning, the model will have a structured sparsity that can be mapped to the systolic array efficiently. However, the number of nonzero weights after the structured pruning is usually larger than the finegrained pruning if the same level of accuracy is maintained [11, 7]. For example, 4.1M nonzero weights are preserved after the unstructured pruning [7]
on the ImageNet dataset
[13], whereas 23.2M nonzero weights are kept after the structured pruning [8].IiB3 Column Packing and Conflict Pruning
Another way to map the sparse networks efficiently to the systolic array is to reduce the size of the sparse weight matrix by packing the columns. A novel method is proposed in [7]. The moderate unstructured pruning is carried out first. Then, different weight columns are grouped, so that later the entire group can be mapped to a single column of the systolic array to save the energy and execution time. Inside the group, only one nonzero weight is allowed at each row position to make sure that each node only needs to store one weight and handle one MAC operation at a time. To perform matrix multiplication, the input activations also need to be grouped and sent to the corresponding columns of the systolic array. At each node, only one activation will be selected to multiply with the weight. Since the partial results are accumulated along the rows, column packing will not cause any error or hardware overhead to the accumulation process. However, compared to the conventional unstructured pruning, extra accuracy loss will be induced by a secondtime pruning called conflict pruning. Since there are usually hundreds and thousands of rows in the weight matrix, most columns very likely have some nonzeros at the same row positions. These nonzeros are called conflicts between the columns. Directly packing the weight matrix only has a limited benefit in matrix compression as it is difficult to find a set of columns that has no conflict. To make the packing more efficient, a specific number of conflicts are allowed in each column group in [7]. The model will then undergo a second pruning step, where all the conflicts except the one with the largest magnitude at each row are pruned in every column group. This compression method is illustrated in Fig. 2 (b). Two pairs of conflicts exist in the column group, and the nonzero weights 3 and 5 will be pruned during conflict pruning. After mapping the column group to the systolic array, the appropriate input will be selected at each node to compute with the weight, as shown in Fig. 2 (b).
IiB4 MixedPrecision Computation
In addition to the pruning techniques, mixedprecision computation is another way to reduce the computational complexity [14, 15]. Since most weights have small values, and only a small portion of weights are orders of magnitude larger, a smaller bitlength (e.g. 8bit) can be achieved by using mixedprecision to represent the weights. In [14], two different scale factors are used for quantizing the weights at a layer. For weights with small magnitude, a narrow quantization range with high resolution is used. On the contrary, for weights with large magnitude, a wide quantization range with low resolution is adopted. A smaller bitlength (e.g. 6bit) is used for all weights. In [15]
, the large weights are called outliers and represented using a large bitlength (
e.g. 16bit). The majority of weights with small magnitude adopt a 4bit representation to reduce the computations. This mixedprecision computation usually requires special hardware design for implementation. For example, different from the normal computations that are handled by 4bit MAC units, the outliers require dedicated highprecision processing elements to compute in [15].Iii WeightLevel Tight Compression and Accelerator Design
Iiia Overview
To make the column packing efficient, the number of conflicts allowed for each group is equal to in [7], where is the number of rows in the weight matrix. When
is equal to 1k, 1750 conflicts are allowed per group. Such a large amount of conflict pruning has a nontrivial impact on the model quality. As a result, more nonzero weights have to be preserved to maintain high accuracy, compared to the efficient unstructured pruning. For example, 0.3M nonzero weights are preserved after conflict pruning with an accuracy of 92.9% on the CIFAR10 dataset
[16]. However, the efficient unstructured pruning can reduce the number of nonzero weights to 0.13M in the same neural network with an accuracy of 93.10%.Since the weight matrix usually has a much larger size than the systolic array, it needs to be partitioned into smaller weight tiles and mapped to the systolic array multiple times to finish the entire matrix multiplication. In this work, we leverage this weight tiling and propose a novel weight permutation method to avoid conflicts in the potential groups and facilitate column packing without sacrificing accuracy. Different from [7], no conflict pruning is performed, and thus a more aggressive pruning rate is achieved with high accuracy. Through permutation, the size of the weight matrix of each layer can be significantly reduced compared to conflict pruning. As a result, fewer weight tiles need to be mapped to the systolic array for computation, and thus higher throughput and energy efficiency can be achieved.
The highlevel flow of the proposed compression method is summarized in Algorithm 1. In the first stage, the network model is gradually pruned up to a preset level of pruning rate () in the training epochs. Before training, the subroutine  is invoked in Line 2 to schedule the intermediate pruning epochs () and the corresponding pruning rates (). At each pruning epoch , the subroutine  is invoked to prune the smallmagnitude weights to a percentage of in each layer. To make an appletoapple comparison with the conflict pruningbased compression [7], we adopt the same pruning schedule proposed in [5]. The pruning rate gradually increases in the first half of the training process. After that, the neural network is trained without further pruning to recover accuracy. After training, the compression enters the second stage, where the sparse weight matrix () of each layer is further compressed to a small and dense format through weight permutation (). To obtain the optimal permutation result, we adopt simulated annealing algorithm for optimization. Details will be discussed shortly in the following subsections.
IiiB Weight Permutation and Matrix Compression
As mentioned in Section IIB, directly grouping the columns without conflict pruning cannot compress the weight matrix efficiently, since any conflicts among the columns will make the packing invalid. The intuition of weight permutation is to compress the weight matrix by partitioning it into several submatrix sections according to the size of the systolic array and permuting the rows and columns across different submatrices to avoid conflicts as much as possible for efficient column packing in each submatrix section. As the weight matrix is usually much larger than the systolic array, we need to divide it into tiles and map each tile onto the array for computation. The weights of a column group of a tile are loaded into the corresponding column of the array. Therefore, we only need to make sure there is no conflict among the group columns of each tile instead of the group columns of the whole original weight matrix. It gives much higher flexibility to carry out the matrix compression. An illustrating example is shown in Fig. 3. The weight matrix originally has a size of , and we assume the size of the target systolic array is . For clarity, only the nonzero weights are shown in the figure. Since each pair of the columns has at least one conflict, the columns cannot be grouped directly. For weight permutation, the matrix is firstly divided into row sections. The height of each row section is equal to the number of rows in the systolic array, so that different row sections will not be mapped to the systolic array at the same time. In this case, the columns in each section can be grouped independently to maximize the overall compression rate.
To explain how permutation works, a onestep row swapping is performed between the two row sections, as shown in Fig. 3. Then the permuted matrix is compressed using Algorithm 2. In the beginning, each column group in each row section () only has one column entry. The packing starts from the first group () and searches the other columns to find the one () that has no conflicts with and, at the same time, can achieve the densest format (i.e. with the minimum number of zeros) after merging with (Line 7). If exists, the two groups will be merged into a new group in Line 9. In the case that multiple candidates with the same density exist, the first one will be combined with . This process will repeat until the current can no longer combine with any columns since there are conflicts or the number of columns in reaches the upper bound () that can be supported by the systolic array. Then the next column group will be processed (Line 11). After compression, the number of zero weights is reduced from 17 to 5 in Fig. 3, and the total size of the weight matrix is reduced by 37.5% (from to and ). The second row section can still be further compressed. Since the packing starts from the left, a different column order will lead to different compression results. To exploit this phenomenon, the columns in the row sections are permuted to search for the optimal order for packing. For example, a onestep column permutation is performed on the second row section in Fig. 3. After compression, the second row section only has 2 column groups. The number of zero weights is reduced from 17 to 1, and the size of the weight matrix is reduced by 50%.
Fig. 4 shows the result after multiple steps of the mixed permutation in a small weight matrix with two row sections. The columns in each row section can be ordered independently, and each row can be permuted to either row section. Since the row permutation is across different sections, the column order of the permuted row has to follow that of the destination row section. Through permutation, the weight matrix can be compressed efficiently, reducing the number of weight tiles from 4 to 2 (each tile has a size of ).
IiiC Simulated Annealing Based Permutation
Weight permutation is an effective way to improve the matrix compression rate without accuracy loss. However, finding the optimal order of rows and columns so that the compression can be optimized is computationally intractable. For example, there are possible states in the small matrix shown in Fig. 4. Since the complexity increases explosively with the matrix size, it is impractical to search through the huge state space exhaustively. In this work, we adopt simulated annealing (SA) [17, 18] to find a closetooptimal solution efficiently.
The pseudo code of the SAbased weight permutation algorithm is shown in Algorithm 3. The optimization starts from a high temperature where any proposed solution is likely to be accepted. The subroutine  is invoked in Line 6 to propose a neighbor state () by a onestep random permutation (either row or column). Since the optimal orders of rows and columns are correlated, it is important to mix the row swapping and column permutation during the optimization, instead of optimizing one first. Then, the is packed and evaluated in Line 7 and 8. The energy difference in Line 8 is calculated by:
(1) 
where the first term () is the difference in the matrix size after packing, and the second term is the difference in the number of weight tiles () multiplied by the size of the systolic array (). An example is shown in Fig. 5. Since a slight change in the matrix size may not result in a difference in the number of weight tiles to map to the systolic array, will be zero in most cases. Once a relatively large difference is made, or the change is at the boundary of a tile (which reduces or increases the number of weight tiles), extra reward () or penalty () will be given. If
, the new state is better and will always be accepted. Otherwise, it will have a probability of
to get accepted. At high temperatures, the optimizer has a larger chance to accept worse solutions, whereas it will become more conservative as the temperature decreases. The temperature is multiplied by () after every steps to simulate the cooling process. In this work, and are set to 0.01 and 15, respectively. The empirical value of is from 1000 to 3000, depending on the size of the weight matrix. For the early layers with a small size (e.g. ), an initial temperature of 1000 is enough. For large layers, a higher temperature is needed. is set to for all layers.IiiD Hardware Design
The hardware architecture to implement the compressed models is shown in Fig. 6 (a). We adopt the weightstationary flow described in Section IIA. The overall systolic array architecture is similar to that of [7] (in which bitserial computation is used) with the following differences. Before computation, one or more row sections are loaded from the offchip memory to the weight buffer. Assume the size of the systolic array is . To perform matrix multiplication, a weight tile in the order of the permutated and packed matrix is read from the buffer and stored in the nodes of the systolic array. Each column in the tile can pack up to 16 sparse columns of the original weight matrix. Therefore, the input activations also need to be grouped accordingly before sent to the systolic array. Since the order of the columns and the group composition are different for each row section, the corresponding input activations have to be read from the input buffer and aligned accordingly. The indices of the input channels corresponding to each packed column are known from the compression results and are stored in an address look up table (LUT). According to the index, the data of input channel is read from the input buffer and stored in the register array in a way that aligned with the column packing and order. We use double buffering to prefetch the input data for the 32 systolic array columns to make the systolic array fully occupied during the computations.
The matrix multiplication is performed in a bitserial manner similar to [7]
. At each cycle, 1bit of each input activation is shifted into the array. The 16 input bits (16 inputs that are packed to the same column) will propagate along the systolic array column, and the input data of adjacent columns have a onecycle skew. At each node, 1bit (
) is selected from the 16 input bits to multiply with the 8bit weight. The partial sum will be accumulated along the row and sent to the output buffer. There is a onecycle skew between the partial sums of adjacent rows. Similar to [7], we adopt an 8bit representation for the weights and activations and a 32bit representation for the partial sums. Since it takes 8 cycles to shift in the 8bit activations and 32 cycles to accumulate the partial sums, each node has 4 MAC units to receive the input data in an interleaved manner to maximize the throughput. In each MAC unit (Fig. 6 (b)), the main part for the bitserial multiplication contains 8 full adders (FAs). The 4 MAC units share the same weight and index information.The systolic array is composed of four subarrays that can be configured to work in two independent groups to maximize the hardware utilization. In the case that a row section has too few column groups to fill up the entire array, it can be folded with another row section and mapped together to the array to save the overall execution time and energy consumption. For example, after computing the first two tiles of row section 1 in Fig. 6 (a), the last 7 column groups can be mapped to the first subarray, and the other subarrays can be mapped with the column groups of row section 2. After the weight tile finishes computing with the data in the input buffer, the next tile will be loaded into the systolic array for computation. This process will continue until all the tiles in the weight buffer finish the computations with the data in the input buffer. Then, new data will be loaded to the input buffer to continue the computations. Once all the computations related to the current weight tiles are completed, the weight buffer can be updated to compute other output channels. This flow guarantees that each weight in the model will only be read once from the offchip memory. Moreover, since the input buffer will finish computing with the entire row section before loading new data, all the input channels loaded onchip can be fully utilized for computations.
Iv SubwordLevel Tight Compression: Towards Higher Compression Rate
In Section III, we have proposed a weightlevel compression method. After compression, the model is uniformly quantized to 8bit for the deployment in the systolic array system. In this section, we will propose a subwordlevel compression method which leverages the mixedprecision representation to further enhance compression performance and reduce computational complexity. The compressed mixedprecision weight matrix can be easily supported by the systolic array with minor modification in the MAC units.
Iva Subword Pruning
For the subwordlevel tight compression, there is an extra step between the training stage (with unstructured weight pruning) and the compression stage (with weight permutation), where the sparse weight matrix of each layer is quantized to 8bit and further pruned at the subword level. The intuition of subword pruning is to represent the sparse weight matrix using mixed precisions. In Section IIB, two types of mixedprecision representation are introduced, as shown in Fig. 7 (a). The first type is to use the same bitlength but two different scale factors to represent the weights. The second type is to use the same scale factor but two different bitlengths. Unlike the previous works [14, 15], in subword pruning, both the bitlength and scale factor will be changed for the weights according to their magnitude. Fig. 7(b) illustrates this mixedprecision representation. The nonzero weights at each layer are in three formats after the subword pruning: the Subword L, the Subword H, and the 8bit full precision. For the weights with small magnitude, the most significant bits (i.e. Subword ) in the 8bit representation are zero and thus can be removed without incurring accuracy loss. On the contrary, for the weights with large magnitude, it is important to preserve a large range but may not be necessary to have such high resolution. In this case, the least significant bits (i.e. Subword ) in the 8bit representation can be removed with negligible accuracy loss. This operation can be interpreted as a finegrained subwordlevel pruning, where either the Subword or Subword is pruned for most nonzero weights. A small fraction of weights is allowed to preserve the 8bit full precision to limit the accuracy loss. Specifically, if the deviation after the subword pruning () is larger than a threshold value (e.g. 25%), full precision will be used for the weight. A larger threshold value will result in less fullprecision weights, which is beneficial for the subwordlevel compression. However, the accuracy loss induced by subword pruning will increase at the same time. The neural network needs to be retrained with subword pruning for several epochs to recover the accuracy.
An illustrating example is shown in Fig. 8. In this example, the bitlength is 4bit for the subwords, and the allowable deviation for subword pruning is set to 25%. Most nonzero weights only have one nonzero subword after the pruning. A small fraction of weights (e.g. the one with a value of 23) preserves 8bit precision. Later, this finegrained subwordlevel sparsity will be exploited to improve the compression performance. It is worth mentioning that the bitlengths of the Subword and Subword may vary over different layers but are consistent within a layer for efficient model compression. Details for determining the bitlengths will be discussed shortly.
IvB Exploiting the Subword Sparsity for Compression
To illustrate the subwordlevel compression, the small weight matrix shown in Fig. 8 is firstly compressed by column packing. (Weight permutation is not applied here for simplicity.) The compressed weight matrix is shown in Fig. 9 (a). In a column group, only one nonzero weight is allowed at each row position. Any conflicts among the columns will make the packing invalid. A simple way to exploit the subword sparsity for compression is to divide the original weight matrix into two matrices, one for Subword and the other for Subword . Then, each subword matrix can be compressed individually, as shown in Fig. 9 (a). Although the overall matrix size after compression () is the same as the original column packing, each node of the systolic array can be less complex due to the smaller bitlength of the subwords. However, this naive compression method will bring some problems. Since a column in the original weight matrix is now split into two columns, each input activation needs to be grouped and input twice into the systolic array to finish the computations. It will increase the energy consumption for input preparation and hence induce nontrivial overhead. Besides, the address LUT (Fig. 6) that stores the indices for aligning the input activations will be doubled since the column group compositions for the two subword matrices are different.
A more efficient compression method is shown in Fig. 9 (b), where the entire weight matrix is compressed as a whole. Two nonzero weights at the same row are allowed to be merged during the compression, as long as their nonzero subwords are not in the same position. Later, the merged weights will be mapped to the same node of the systolic array and computed as a whole. The density of the weight matrix in Fig. 9 (b) is calculated by the number of nonzero weights divided by the size of the compressed weight matrix. Originally, the weight matrix has a small density of 26%. In the first step, column 3 is selected and grouped with column 0. The two weights at the first row are merged and become 8bit {Subword , Subword }. After grouping the two columns, the density increases to 31%. This process will continue until the first column group can no longer combine with any columns due to conflicts or reaching the upper bound of columns per group (parameter in Algorithm 2). Then the next column group will be processed. After compression, the size of the weight matrix is reduced to , and the density increases from 26% to 93%. Compared to the previous solution in Fig. 9 (a), no extra overhead will be caused by input preparation or address LUT. This example shows how subword sparsity can help to improve the compression performance. It can be combined with the simulated annealingbased weight permutation proposed in Section III for a complete subwordlevel compression.
To obtain the optimal compression performance, it is preferred to have half of the nonzero weights pruned to Subword and the other half pruned to Subword . In this case, the nonzeros at the same row will have a larger chance to be merged to avoid conflicts. The ratio of Subword to Subword depends on how we partition the 8bit weight. For example, Table I shows the bitlengths of the subwords and the proportion of each subword type at a layer of the benchmark on CIFAR10 [16]. The optimal bitlength is 4bit for both subword types in this example. In the experiment, three optimal combinations are observed for the layers of all the benchmarks, including {3bit , 5bit }, {4bit , 4bit }, and {5bit , 3bit } (Fig. 7 (b)). The hardware can be configured to support different combinations.
Nonzero Weights  BitLengths {,}  

{2,6}  {3,5}  {4,4}  {5,3}  {6,2}  
Subword (%)  99.9  97.6  57.7  18.4  6.0 
Subword (%)  0.1  2.3  37.6  76.4  90.7 
8Bit Full Precision (%)  0  0.1  4.7  5.2  3.3 
IvC Hardware Modification for SubwordLevel Compression
The toplevel hardware architecture for implementing the subwordlevel compressed model is similar to that in Fig. 6 (a). Each column group in the weight tile can pack up to 16 sparse columns of the original weight matrix, and the input preparation part is the same as before. One major difference from the previous design is the MAC unit at each node of the systolic array. Fig. 10 shows the MAC unit modified to support the subwordlevel compression. Each nonzero entry of the weight tile can be a Subword , a Subword , or an 8bit {Subword , Subword } in the case that two weights are merged or the full precision is used for one weight. Therefore, up to 2 bits ( and ) are selected to multiply with the subwords stored at each node. An extra full adder is used to invert the input activation for Subword if the two subwords have opposite signs. As mentioned in Section IVB, there are three bitlength combinations for the subwords. It is supported by the MAC unit and controlled by the 2bit signal . Besides, extra index information needs to be stored for the compressed weight matrix. Originally, a 4bit index is needed at each node to select 1bit input from the 16 input bits. After subwordlevel compression, two sets of the 4bit index are needed, one for Subword and the other for Subword . In the case that an 8bit fullprecision weight is stored at the node, the two indices will have the same value to select the same input bit (=). Although energy and area overhead is induced by the modification, the overall energy efficiency and area efficiency are improved due to better compression performance. A detailed analysis will be given in the experimental results.
V Experimental Results
In this section, we will evaluate the performance of the proposed compression method and compare it with the previous compression techniques. The weightlevel compression results are presented first in Section VB. Then, the models are further pruned and compressed at the subword level to improve the compression results in Section VC. After that, we implement and synthesize the hardware architectures presented in the previous sections, and the throughput and energy efficiency of the implementations are reported in Section VD. The buffers are implemented by SRAM and modeled in CACTI 7.0 [19]
using 45nm process node to estimate the energy and area. Other components, including the systolic array, the accumulation units, and the input register array, are synthesized using the Synopsys Design Compiler with Nangate 45nm Open Cell Library
[20]. To make a comparison with [7], we also implement the models compressed by conflict pruning in a baseline systolic array system similar to [7].Va Benchmarks
We evaluate the compression method using three popular datasets, including the CIFAR10 dataset [16] with color images of 10 classes, the CIFAR100 dataset [16] with
images of 100 classes, and the ImageNet dataset
[13] with images of 1000 classes. We adopt the same network architecture as [7] to make a fair comparison with the conflict pruning method. The neural networks are composed of 19 convolutional layers and 1 fully connected layer [21]. Stochastic gradient descent is used for training. Similar to
[7], the Nesterov momentum is set to 0.9. The learning rate is initialized as 0.1 and gradually decays to 0. L1 regularization is used with the strength equal to
. The neural networks on the three datasets are trained for 300, 300, and 120 epochs, respectively, with a batch size of 128. After training, the models with unstructured weight sparsity are obtained. For subword pruning, the models are further retrained for 20, 20, and 8 epochs, respectively, to recover the accuracy.VB Performance of WeightLevel Tight Compression
The initial number of weights without pruning is 1.91M in the benchmark of CIFAR10. During training, 93.3% of the weights can be pruned with a negligible accuracy loss. The top1 accuracy after pruning is 93.10%. Similarly, the benchmarks of CIFAR100 and ImageNet initially have 3.33M and 4.49M weights, respectively. The pruning rates in the two benchmarks are 90.4% and 88.6%, respectively. After pruning, 75.09% top1 accuracy is obtained on CIFAR100, and 56.616% top1 accuracy is obtained on ImageNet. Fig. 11 shows the row sections of a typical weight matrix (, CONV15) in the benchmark of CIFAR10 before and after compression. Each row section has a size of . 93.3% of weights in the matrix are pruned and represented by the black dots. Other weights are represented by the white dots.
Directly packing the columns of the entire weight matrix without conflict pruning can only achieve a limited compression rate of 2.4. The first row section after column packing is shown in Fig. 11. The compressed weight matrix has a size of , and only 17% of weights are nonzero. In this case, the high sparsity cannot be fully utilized to improve energy efficiency and throughput. After partitioning the weight matrix and packing each row section independently (without permutation), a better compression result with 53% of nonzero weight density is obtained. Fig. 11 shows the row section [0][5] in the compressed weight matrix. The compression density can be further improved by weight permutation. Through the weightlevel tight compression, the 512 columns in the first row section can be packed in 61 groups, thus effectively compressing the row section by 8.4 times. 78% of weights are nonzero in the compressed row section, which is much denser compared to the original sparse format. Similar improvement has been observed for other row sections, as shown in Fig. 11. The compressed weight matrix tends to have denser column groups on the left. It is because the packing starts from the left and searches the remaining columns to form the densest group at each step. No accuracy loss is caused by the weight permutation.
Benchmark  Performance  Conflict Pruning [7]  WeightLevel Tight Compression  SubwordLevel Tight Compression 
CIFAR10  Pruning Rate (%)  84.7  93.3  93.3 
Number of Nonzero Weights  0.3M  0.13M  0.13M (4.64% in FullPrecision)  
Matrix Compression Rate  5.88  10.28  14.13  
Top1 Accuracy (%)  92.9  93.1  92.16  
CIFAR100  Pruning Rate (%)  80.5  90.4  90.4 
Number of Nonzero Weights  0.65M  0.32M  0.32M (10.33% in FullPrecision)  
Matrix Compression Rate  4.60  8.30  11.10  
Top1 Accuracy (%)  75.08  75.09  74.32  
ImageNet  Pruning Rate (%)  66.6  88.6  88.6 
Number of Nonzero Weights  1.5M  0.51M  0.51M (17.90% in FullPrecision)  
Matrix Compression Rate  2.69  6.52  8.66  
Top1 Accuracy (%)  55.99  56.61  55.70 
The pruning rate, matrix compression rate, and accuracy after compression are summarized in Table II. Compared to the conflict pruning method [7], the weightlevel tight compression can achieve a larger pruning rate. It is because accuracy loss will be introduced in conflict pruning and hence to achieve the same level of accuracy, more weights have to be kept. The number of nonzero weights after tight compression is 2.3 times smaller than conflict pruning in the benchmark of CIFAR10. The matrix compression rate is effectively improved from 5.88 to 10.28, and thus significant improvements can be achieved in throughput and energy efficiency (which will be shown in Section VD). Similarly, on the CIFAR100 dataset, the pruning rate in tight compression can be 10% larger than the conflict pruning method when the two models are of the same accuracy. As a result, the number of nonzero weights after tight compression is 2.0 times smaller than the conflict pruning method. The matrix compression rate can be improved from 4.60 to 8.30 by the weightlevel tight compression. On the ImageNet dataset, the number of nonzero weights after tight compression is 2.9 times smaller than the conflict pruning method, and the matrix compression rate can be improved from 2.69 to 6.52.
We also compare tight compression with some structured pruning methods in terms of the model size and accuracy. The results are shown in Table III. Compared to structured pruning, tight compression can achieve a smaller matrix size with the same level of accuracy.
Performance  Network  Filter  Filter  Tight 
Slimming  Pruning  Pruning  Compression  
[8]  [9]  [10]  (WeightLevel)  
Pruning Rate (%)  88.5  64.0  86.5  93.3 
Weight Matrix Size after Compression  2.30M  5.4M  2.02M  0.18M 
Top1 Accuracy (%)  93.8  93.4  90.9  93.1 
VC Performance of SubwordLevel Tight Compression
For better compression performance, the sparse model is further pruned and compressed at the subword level using the method presented in Section IV. The maximum allowable deviation () for the subword pruning is set to 0.3 for the benchmark of CIFAR10. The weight pruning rate is the same as before (i.e. 93.3%). However, 95.36% of the nonzero weights only have one subword (either Subword or Subword ). A small fraction of the nonzero weights (4.64%) preserves 8bit precision to limit the accuracy loss. The relative accuracy loss induced by subword pruning is 0.94%. After the subwordlevel compression, the total size of the weight matrices is reduced from 1.91M to 0.135M, and a large matrix compression rate of 14.13 is achieved for the model. The results are summarized in Table II. Compared to conflict pruning and the weightlevel tight compression, the matrix compression rate can be significantly improved by the finegrained subwordlevel compression. Fig. 11 shows the row section [0][5] in the compressed weight matrix (CONV15 in the benchmark of CIFAR10). A large compression rate of is achieved at the layer, which is close to the upper bound (parameter in Algorithm 2). The density of nonzeros in the weight matrix is improved from 78% to 101% (100% due to the merging of subwords), compared to the weightlevel tight compression.
Similar improvement has been achieved in the benchmark of CIFAR100. The maximum allowable deviation () for the subword pruning is set to 0.27, and only 10.33% of nonzero weights preserve an 8bit full precision. Others only have one nonzero subword. The relative accuracy loss induced by subword pruning is 0.77%. Compared to the weightlevel tight compression, the total size of the weight matrices is reduced from 0.40M to 0.30M. Therefore, the compression rate of the weight matrix is improved from 8.30 to 11.10. In the benchmark of ImageNet, is set to 0.23, and 17.90% of nonzero weights preserve full precision after subword pruning. Compared to the weightlevel tight compression, the compression rate of the weight matrix is further improved from 6.52 to 8.66 at the cost of 0.91% of relative accuracy loss.
VD Analysis of Throughput and Energy Efficiency
Since accuracy loss will be induced by conflict pruning, uneven pruning is performed in [7] to maintain high accuracy while reducing the model size as much as possible. Specifically, less aggressive pruning and column packing (14 columns per group) are performed in the early layers that are relatively small and have less capacity for compression, and more aggressive pruning is performed in the subsequent large layers. Although the size of the large layers can be reduced, the improvement in throughput is limited since the first few layers have higher computational complexity (due to the large input volume) and take more time to process. In the proposed tight compression method, uniform pruning is performed and thus all the layers can be compressed efficiently to improve the throughput and energy efficiency. Fig. 12 shows the execution time of each layer in the benchmark of CIFAR10. Compared to conflict pruning, the overall execution time can be efficiently reduced through tight compression. Besides, Fig. 12 also shows that the finegrained subwordlevel tight compression can achieve a higher throughput than the weightlevel tight compression.
Benchmark  Performance  Conflict Pruning[7]  WeightLevel Tight Compression  SubwordLevel Tight Compression  
Value  Improvement Over  Value  Improvement Over  
CIFAR10  Throughput (frames/s)  563.2  1196.8  2.12  1550.5  2.75 
Energy Efficiency (frames/J)  2440.4  3827.4  1.57  4537.4  1.86  
Area Efficiency (frames/s/mm)  332.3  602.9  1.81  695.6  2.09  
Top1 Accuracy (%)  93.0  93.0  0  92.14  0.86%  
CIFAR100  Throughput (frames/s)  257.4  460.0  1.79  581.1  2.26 
Energy Efficiency (frames/J)  1066.6  1441.7  1.35  1687.2  1.58  
Area Efficiency (frames/s/mm)  151.8  231.7  1.53  260.7  1.72  
Top1 Accuracy (%)  74.80  75.06  0.26%  74.17  0.63%  
ImageNet  Throughput (frames/s)  53.8  101.9  1.89  133.4  2.48 
Energy Efficiency (frames/J)  238.6  336.5  1.41  405.0  1.70  
Area Efficiency (frames/s/mm)  31.8  51.3  1.61  59.8  1.88  
Top1 Accuracy (%)  51.28  55.86  4.58%  55.49  4.21% 
The comparison of throughput, energy efficiency, and area efficiency among different compression methods are summarized in Table IV. Compared to conflict pruning, the weightlevel tight compression improves the throughput and energy efficiency by 2.12 and 1.57 times, respectively, on the CIFAR10 dataset. Fig. 13 shows the synthesis results for different implementations. The power consumption and area of the systolic array are slightly larger than the baseline case. Since more columns can be packed in one group through tight compression, the input register array is larger than the baseline implementation for conflict pruning which only supports eight columns per group at maximum. Besides, the address LUT also has some area overhead. Therefore, the area of the implementation for the weightlevel tight compression is 20% larger than the baseline implementation for conflict pruning. However, as the throughput is increased, the overall area efficiency is improved by 1.81 times. In the benchmark of CIFAR100, the throughput, energy efficiency, and area efficiency are improved by 1.79, 1.35, and 1.53 times, respectively. Similarly, in the benchmark of ImageNet, the throughput, energy efficiency, and area efficiency are improved by 1.89, 1.41, and 1.61 times, respectively. The top1 accuracies in the implementations for weightlevel tight compression and conflict pruning are 55.86% and 51.28%, respectively, on the ImageNet dataset. The accuracies are different from those in Table II since the activations are also quantized.
As discussed in Section IVC, extra energy and area overhead is induced to support the subwordlevel compression. The overhead contains two parts. The first part is related to the modification in the MAC units at each node of the systolic array. The power and area of the systolic array are increased by 11% and 17%, respectively, compared to the implementation for the weightlevel tight compression. The second part of the overhead is related to the extra index information for the weight matrix. Since the energy consumption and area of the weight buffer are much smaller (10%) than the systolic array, the overall overhead is mainly determined by the first part. Since better compression performance is obtained through the subwordlevel compression, the overall throughput, energy efficiency, and area efficiency are significantly improved, as shown in Table IV.
Vi Conclusions
In this paper, we have proposed a compression method to fully utilize the unstructured sparsity for implementing neural networks efficiently on the regular systolic array architecture. Two pruning granularities are explored for weight matrix compression. Besides weight pruning, we further prune the model at the subword level to exploit the finegrained subword sparsity for improving the compression performance. After pruning, the sparse weight matrix of each layer is compressed to a small and dense format by permuting the weights to avoid conflicts and facilitate column packing. Through weight permutation, the matrix compression rate can be significantly improved, compared to the stateoftheart compression techniques. As a result, the throughput and energy efficiency can be improved by 2.75 and 1.86 times, respectively.
References
 [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. 25th Int. Conf. Neural Inf. Process. Syst., Dec. 2012, pp. 1097–1105.

[2]
J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once:
Unified, realtime object detection,”
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 779–788, 2016.  [3] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015.
 [4] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” CoRR, vol. abs/1506.02626, 2015. [Online]. Available: http://arxiv.org/abs/1506.02626
 [5] M. Zhu and S. Gupta, “To prune, or not to prune: Exploring the efficacy of pruning for model compression,” CoRR, vol. abs/1710.01878, 2017.

[6]
N. P. Jouppi, C. Young, N. Patil et al.
, “Indatacenter performance analysis of a tensor processing unit,” in
Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA ’17. New York, NY, USA: ACM, 2017, pp. 1–12. [Online]. Available: http://doi.acm.org/10.1145/3079856.3080246  [7] H. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization,” in Proc. 24th Int. Conf. Archit. Support for Program. Lang. & Oper. Syst. (ASPLOS). New York, NY, USA: ACM, 2019, pp. 821–834.
 [8] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” CoRR, vol. abs/1708.06519, 2017.
 [9] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” CoRR, vol. abs/1608.08710, 2016.
 [10] Q. Huang, S. K. Zhou, S. You, and U. Neumann, “Learning to prune filters in convolutional neural networks,” CoRR, vol. abs/1801.07365, 2018. [Online]. Available: http://arxiv.org/abs/1801.07365
 [11] T. Gale, E. Elsen, and S. Hooker, “The state of sparsity in deep neural networks,” CoRR, vol. abs/1902.09574, 2019.
 [12] X. Chen, J. Zhu, J. Jiang, and C. Y. Tsui, “Tight compression: Compressing cnn model tightly through unstructured pruning and simulated annealing based permutation,” in 2020 57th ACM/IEEE Design Automation Conference (DAC), 2020, pp. 1–6.
 [13] O. Russakovsky, J. Deng, H. Su et al., “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vis. (IJCV), vol. 115, no. 3, pp. 211–252, Dec. 2015.
 [14] S. Jain, S. Venkataramani, V. Srinivasan, J. Choi, K. Gopalakrishnan, and L. Chang, “Biscaleddnn: Quantizing longtailed datastructures with two scale factors for deep neural networks,” in 2019 56th ACM/IEEE Design Automation Conference (DAC), 2019, pp. 1–6.
 [15] E. Park, D. Kim, and S. Yoo, “Energyefficient neural network accelerator based on outlieraware lowprecision computation,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018, pp. 688–698.
 [16] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Univ. of Toronto, Toronto, Canada, Tech. Rep., 2009.
 [17] R. Rao and S. Iyengar, “Binpacking by simulated annealing,” Comput. Math. Appl., vol. 27, no. 5, pp. 71 – 82, 1994.
 [18] D. F. Wong, H. W. Leong, and C. L. Liu, Simulated Annealing for VLSI Design. Norwell, MA, USA: Kluwer Academic Publishers, 1988.
 [19] R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas, “Cacti 7: New tools for interconnect exploration in innovative offchip memories,” ACM Trans. Archit. Code Optim., vol. 14, no. 2, pp. 14:1–14:25, Jun. 2017.
 [20] J. Knudsen, “Nangate 45nm open cell library,” in CDNLive EMEA, 2008.
 [21] B. Wu, A. Wan, X. Yue et al., “Shift: A zero flop, zero parameter alternative to spatial convolutions,” CoRR, vol. abs/1711.08141, 2017.
Comments
There are no comments yet.