Tight Compression: Compressing CNN Through Fine-Grained Pruning and Weight Permutation for Efficient Implementation

The unstructured sparsity after pruning poses a challenge to the efficient implementation of deep learning models in existing regular architectures like systolic arrays. On the other hand, coarse-grained structured pruning is suitable for implementation in regular architectures but tends to have higher accuracy loss than unstructured pruning when the pruned models are of the same size. In this work, we propose a model compression method based on a novel weight permutation scheme to fully exploit the fine-grained weight sparsity in the hardware design. Through permutation, the optimal arrangement of the weight matrix is obtained, and the sparse weight matrix is further compressed to a small and dense format to make full use of the hardware resources. Two pruning granularities are explored. In addition to the unstructured weight pruning, we also propose a more fine-grained subword-level pruning to further improve the compression performance. Compared to the state-of-the-art works, the matrix compression rate is significantly improved from 5.88x to 14.13x. As a result, the throughput and energy efficiency are improved by 2.75 and 1.86 times, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 5

page 6

page 7

page 8

page 10

page 11

03/04/2022

Structured Pruning is All You Need for Pruning CNNs at Initialization

Pruning is a popular technique for reducing the model size and computati...
02/11/2020

PCNN: Pattern-based Fine-Grained Regular Pruning towards Optimizing CNN Accelerators

Weight pruning is a powerful technique to realize model compression. We ...
05/26/2021

Dynamic Probabilistic Pruning: A general framework for hardware-constrained pruning at different granularities

Unstructured neural network pruning algorithms have achieved impressive ...
05/05/2021

Sequential Encryption of Sparse Neural Networks Toward Optimum Representation of Irregular Sparsity

Even though fine-grained pruning techniques achieve a high compression r...
05/11/2020

CSB-RNN: A Faster-than-Realtime RNN Acceleration Framework with Compressed Structured Blocks

Recurrent neural networks (RNNs) have been widely adopted in temporal se...
05/14/2019

Network Pruning for Low-Rank Binary Indexing

Pruning is an efficient model compression technique to remove redundancy...
03/21/2022

Optimal Fine-Grained N:M sparsity for Activations and Neural Gradients

In deep learning, fine-grained N:M sparsity reduces the data footprint a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Convolutional Neural Networks (CNNs) have achieved impressive progress over the past few years in various domains, such as image classification[1], object detection[2], and semantic segmentation[3]. At the same time, the neural network models are becoming deeper and heavier to achieve better performance. The huge number of weights increases the requirement in hardware resources and the energy consumption of inference, thus making it challenging for implementing CNNs on embedded systems. In recent works, pruning techniques have been studied to address this limitation [4, 5]. It is observed that contemporary neural networks tend to be over-parameterized, and a large portion of weights can be removed to reduce the model complexity and the heavy cost of inference. This pruning process only has a marginal impact on accuracy after fine-tuning the remaining weights. For example, 89% of weights were pruned in the AlexNet [1] model without accuracy loss in [4].

Although great progress has been made in network pruning, such a high pruning rate does not directly lead to the same degree of energy saving and throughput improvement in the hardware accelerators. Many state-of-the-art CNN accelerators employ systolic arrays to implement the intensive multiply-accumulate computations (MACs) during inference [6, 7]. As one of the most popular architectures to implement CNN models, the systolic array has the advantage of regular structure, parallel computation, and data reuse capability. Therefore, it can execute the intrinsic matrix multiplications of CNN very efficiently with high energy efficiency and throughput. However, it is not easy for the regular structure of the systolic array to take full advantage of the fine-grained sparsity in the pruned network models. Since the nonzero weights have an irregular distribution in the weight matrix, the size of the weight matrix of each layer cannot be reduced efficiently by the unstructured pruning. Therefore, most of the zero weights will still be mapped to the systolic array nodes, making it difficult to improve the throughput and energy efficiency. As one potential way to tackle this problem, structured pruning techniques have been explored [8, 9, 10], where the pruning is performed in a larger granularity like channel-wise and kernel-wise. However, structured pruning usually tends to have higher accuracy loss than the fine-grained unstructured pruning when the pruned models are of the same size [11, 7].

In this work, we propose a tight compression method to compress the unstructurally-pruned CNN model to a small and dense format, so that the systolic array can be fully utilized in implementing the pruned neural network. Compared to the state-of-the-art model compression techniques, the proposed method can achieve a higher compression rate of the weight matrix, which leads to significant improvements in throughput and energy efficiency. In summary, this work makes the following contributions:

  • A compression method is proposed to implement CNN models efficiently in the systolic array system. The unstructured pruning is carried out first. Then, the sparse weight matrix is partitioned according to the size of the systolic array, and the rows and columns are permuted to facilitate packing. Simulated annealing (SA) is used during the permutation process to obtain the optimal arrangement of the weight matrix for the final compression step. After permutation, the sparse weight matrix can be compressed to a small and dense format to make full use of the hardware resources.

  • Two pruning granularities are explored for weight permutation and matrix compression. The design based on the unstructured weight pruning is presented first (i.e. weight-level compression). Then, the model is further pruned at the subword level to exploit the fine-grained subword sparsity for improving the compression performance (i.e. subword-level compression).

  • The hardware structure of the systolic array is taken into account during the compression process. Instead of only considering the pruning rate, we adopt the compression rate of the weight matrix as the optimization objective for the simulated annealing-based weight permutation.

  • A systolic array architecture is designed for implementing the compressed CNN models. Experimental results show that the proposed compression method can prune over 93% of the weights and compress the size of the weight matrices by 14.13 times. As a result, significant improvements can be achieved in throughput and energy efficiency.

This work is an extension of our prior paper [12]. The extended materials include 1) a more fine-grained subword-level compression method to further improve the compression performance, and 2) the hardware architecture to support the subword-level compressed models.

Ii Preliminaries

Fig. 1:

CNN Model and the MAC Operation at One Neuron.

Ii-a CNN Model and Systolic Array

Convolutional neural network (CNN) is a machine learning model inspired by the brain. It is organized as a stack of layers to extract the features hierarchically, as shown in Fig. 

1

. Convolutional (CONV) layers are the core layers for feature processing and thus account for most of the arithmetic computations. At each neuron, the weighted sum of the input activations is computed. Then, the multiplication and accumulation (MAC) result is sent to the activation function to calculate the output activation. Since the intrinsic computation at each layer can be regarded as a matrix multiplication between weights and input activations, CNN models are highly suitable to be implemented using systolic arrays.

Fig. 2 (a) shows the general architecture of a systolic array. Each node in the systolic array is a processing unit connected with the four neighboring nodes. In the widely used weight-stationary computation flow, the weight matrix is stored in the local registers of the nodes and will not move during the matrix multiplication [6, 7]. Each row in the systolic array is mapped with the weights of the same neuron. The input activations are sent into the array from the bottom and gradually propagate to the top. As the input activations propagate across the nodes, the weight stored in each node is continually multiplied with the input activations received from the bottom node. The product is accumulated with the corresponding partial sum received from the left node and then propagates to the right. All the weight-activation products related to the same output will be summed up along the row and sent out of the systolic array from the right.

Ii-B Related Works

Ii-B1 Pruning

Network pruning seeks to minimize the number of nonzero weights in the over-parameterized CNN model to reduce the model complexity and the cost of inference. In [4], Han et al. propose to prune the weights according to their magnitude. Weights with small magnitude values are considered to have a relatively small contribution to the model quality and thus can be removed. After pruning, the remaining weights are retrained to recover the accuracy. The method can achieve a pruning rate of 89% in AlexNet [1]. In [5], an automated pruning algorithm is proposed to gradually prune the small-magnitude weights to a preset level of sparsity with minimal retraining requirement. Considerable pruning rates have been achieved in previous works. However, it is difficult to fully exploit the unstructured sparsity in the regular hardware architectures like systolic arrays. Since the zero weights are randomly spread across the matrix, the size of the weight matrix cannot be reduced efficiently by unstructured pruning, and thus most zero weights still have to be allocated to the nodes of the systolic array. The nodes mapped with zero weights will not perform effective computations, thus affecting the overall improvements in throughput and energy efficiency.

Ii-B2 Structured Pruning

Structured pruning techniques have been explored in recent works to implement the sparse neural networks in the existing regular architectures [8, 9, 10]. Instead of pruning individual weights, the structured pruning is performed in a larger granularity such as an entire row or column in the weight matrix. After pruning, the model will have a structured sparsity that can be mapped to the systolic array efficiently. However, the number of nonzero weights after the structured pruning is usually larger than the fine-grained pruning if the same level of accuracy is maintained [11, 7]. For example, 4.1M nonzero weights are preserved after the unstructured pruning [7]

on the ImageNet dataset

[13], whereas 23.2M nonzero weights are kept after the structured pruning [8].

Fig. 2: (a) Matrix Multiplication in the Systolic Array and (b) the Conflict Pruning-Based Compression [7].

Ii-B3 Column Packing and Conflict Pruning

Another way to map the sparse networks efficiently to the systolic array is to reduce the size of the sparse weight matrix by packing the columns. A novel method is proposed in [7]. The moderate unstructured pruning is carried out first. Then, different weight columns are grouped, so that later the entire group can be mapped to a single column of the systolic array to save the energy and execution time. Inside the group, only one nonzero weight is allowed at each row position to make sure that each node only needs to store one weight and handle one MAC operation at a time. To perform matrix multiplication, the input activations also need to be grouped and sent to the corresponding columns of the systolic array. At each node, only one activation will be selected to multiply with the weight. Since the partial results are accumulated along the rows, column packing will not cause any error or hardware overhead to the accumulation process. However, compared to the conventional unstructured pruning, extra accuracy loss will be induced by a second-time pruning called conflict pruning. Since there are usually hundreds and thousands of rows in the weight matrix, most columns very likely have some nonzeros at the same row positions. These nonzeros are called conflicts between the columns. Directly packing the weight matrix only has a limited benefit in matrix compression as it is difficult to find a set of columns that has no conflict. To make the packing more efficient, a specific number of conflicts are allowed in each column group in [7]. The model will then undergo a second pruning step, where all the conflicts except the one with the largest magnitude at each row are pruned in every column group. This compression method is illustrated in Fig. 2 (b). Two pairs of conflicts exist in the column group, and the nonzero weights 3 and 5 will be pruned during conflict pruning. After mapping the column group to the systolic array, the appropriate input will be selected at each node to compute with the weight, as shown in Fig. 2 (b).

Ii-B4 Mixed-Precision Computation

In addition to the pruning techniques, mixed-precision computation is another way to reduce the computational complexity [14, 15]. Since most weights have small values, and only a small portion of weights are orders of magnitude larger, a smaller bit-length (e.g. 8-bit) can be achieved by using mixed-precision to represent the weights. In [14], two different scale factors are used for quantizing the weights at a layer. For weights with small magnitude, a narrow quantization range with high resolution is used. On the contrary, for weights with large magnitude, a wide quantization range with low resolution is adopted. A smaller bit-length (e.g. 6-bit) is used for all weights. In [15]

, the large weights are called outliers and represented using a large bit-length (

e.g. 16-bit). The majority of weights with small magnitude adopt a 4-bit representation to reduce the computations. This mixed-precision computation usually requires special hardware design for implementation. For example, different from the normal computations that are handled by 4-bit MAC units, the outliers require dedicated high-precision processing elements to compute in [15].

Iii Weight-Level Tight Compression and Accelerator Design

Iii-a Overview

To make the column packing efficient, the number of conflicts allowed for each group is equal to in [7], where is the number of rows in the weight matrix. When

is equal to 1k, 1750 conflicts are allowed per group. Such a large amount of conflict pruning has a non-trivial impact on the model quality. As a result, more nonzero weights have to be preserved to maintain high accuracy, compared to the efficient unstructured pruning. For example, 0.3M nonzero weights are preserved after conflict pruning with an accuracy of 92.9% on the CIFAR-10 dataset

[16]. However, the efficient unstructured pruning can reduce the number of nonzero weights to 0.13M in the same neural network with an accuracy of 93.10%.

Since the weight matrix usually has a much larger size than the systolic array, it needs to be partitioned into smaller weight tiles and mapped to the systolic array multiple times to finish the entire matrix multiplication. In this work, we leverage this weight tiling and propose a novel weight permutation method to avoid conflicts in the potential groups and facilitate column packing without sacrificing accuracy. Different from [7], no conflict pruning is performed, and thus a more aggressive pruning rate is achieved with high accuracy. Through permutation, the size of the weight matrix of each layer can be significantly reduced compared to conflict pruning. As a result, fewer weight tiles need to be mapped to the systolic array for computation, and thus higher throughput and energy efficiency can be achieved.

Input: Network model (

); Epochs (

); Pruning rate ();
          Max. number of columns per group ()
Output: Compressed network model ()
1 Stage 1: Unstructured Pruning
2 prune-schedule() ;
3 ;
4 for  to  do
5        if  then
6               magnitude-prune() ;
7              
8        end if
9        train() ;
10       
11 end for
12Stage 2: Weight Matrix Compression
13 for  to  do
14        matrix-extract() ;
15        matrix-compress() ;
16       
17 end for
Algorithm 1 Tight Compression Overview

The high-level flow of the proposed compression method is summarized in Algorithm 1. In the first stage, the network model is gradually pruned up to a preset level of pruning rate () in the training epochs. Before training, the subroutine - is invoked in Line 2 to schedule the intermediate pruning epochs () and the corresponding pruning rates (). At each pruning epoch , the subroutine - is invoked to prune the small-magnitude weights to a percentage of in each layer. To make an apple-to-apple comparison with the conflict pruning-based compression [7], we adopt the same pruning schedule proposed in [5]. The pruning rate gradually increases in the first half of the training process. After that, the neural network is trained without further pruning to recover accuracy. After training, the compression enters the second stage, where the sparse weight matrix () of each layer is further compressed to a small and dense format through weight permutation (-). To obtain the optimal permutation result, we adopt simulated annealing algorithm for optimization. Details will be discussed shortly in the following subsections.

Iii-B Weight Permutation and Matrix Compression

As mentioned in Section II-B, directly grouping the columns without conflict pruning cannot compress the weight matrix efficiently, since any conflicts among the columns will make the packing invalid. The intuition of weight permutation is to compress the weight matrix by partitioning it into several sub-matrix sections according to the size of the systolic array and permuting the rows and columns across different sub-matrices to avoid conflicts as much as possible for efficient column packing in each sub-matrix section. As the weight matrix is usually much larger than the systolic array, we need to divide it into tiles and map each tile onto the array for computation. The weights of a column group of a tile are loaded into the corresponding column of the array. Therefore, we only need to make sure there is no conflict among the group columns of each tile instead of the group columns of the whole original weight matrix. It gives much higher flexibility to carry out the matrix compression. An illustrating example is shown in Fig. 3. The weight matrix originally has a size of , and we assume the size of the target systolic array is . For clarity, only the nonzero weights are shown in the figure. Since each pair of the columns has at least one conflict, the columns cannot be grouped directly. For weight permutation, the matrix is firstly divided into row sections. The height of each row section is equal to the number of rows in the systolic array, so that different row sections will not be mapped to the systolic array at the same time. In this case, the columns in each section can be grouped independently to maximize the overall compression rate.

Fig. 3: An Illustrating Example of Weight Permutation.

To explain how permutation works, a one-step row swapping is performed between the two row sections, as shown in Fig. 3. Then the permuted matrix is compressed using Algorithm 2. In the beginning, each column group in each row section () only has one column entry. The packing starts from the first group () and searches the other columns to find the one () that has no conflicts with and, at the same time, can achieve the densest format (i.e. with the minimum number of zeros) after merging with (Line 7). If exists, the two groups will be merged into a new group in Line 9. In the case that multiple candidates with the same density exist, the first one will be combined with . This process will repeat until the current can no longer combine with any columns since there are conflicts or the number of columns in reaches the upper bound () that can be supported by the systolic array. Then the next column group will be processed (Line 11). After compression, the number of zero weights is reduced from 17 to 5 in Fig. 3, and the total size of the weight matrix is reduced by 37.5% (from to and ). The second row section can still be further compressed. Since the packing starts from the left, a different column order will lead to different compression results. To exploit this phenomenon, the columns in the row sections are permuted to search for the optimal order for packing. For example, a one-step column permutation is performed on the second row section in Fig. 3. After compression, the second row section only has 2 column groups. The number of zero weights is reduced from 17 to 1, and the size of the weight matrix is reduced by 50%.

Input: Permuted weight matrix (); Max. number of columns per group ()
Output: Packed weight matrix ()
1 ;
2 for  to  do
3        section-extract() ;
4        ;
5        while  width() do
6               group-extract() ;
7               find-densest-group() ;
8               if  then
9                      merge() ;
10                     
11               else
12                      ;
13                     
14               end if
15              
16        end while
17       
18 end for
Algorithm 2 Packing the Permuted Weight Matrix

Fig. 4 shows the result after multiple steps of the mixed permutation in a small weight matrix with two row sections. The columns in each row section can be ordered independently, and each row can be permuted to either row section. Since the row permutation is across different sections, the column order of the permuted row has to follow that of the destination row section. Through permutation, the weight matrix can be compressed efficiently, reducing the number of weight tiles from 4 to 2 (each tile has a size of ).

Fig. 4: Compression Result of a Weight Matrix with 2 Row Sections.

Iii-C Simulated Annealing Based Permutation

Weight permutation is an effective way to improve the matrix compression rate without accuracy loss. However, finding the optimal order of rows and columns so that the compression can be optimized is computationally intractable. For example, there are possible states in the small matrix shown in Fig. 4. Since the complexity increases explosively with the matrix size, it is impractical to search through the huge state space exhaustively. In this work, we adopt simulated annealing (SA) [17, 18] to find a close-to-optimal solution efficiently.

Input: Weight matrix (); Max. number of columns per group (); Initial and final temperatures (, ); Cooling factor (); Number of iterations at each temperature ()
Output: Compressed weight matrix after optimization ()
1 ;
2 pack() ; // Algorithm 2
3 ;
4 ;
5 while  do
6        neighbor-state();// Permutation
7        pack() ;
8        delta-energy() ;
9        if random() then
10               ;
11               ;
12              
13        end if
14        ;
15        if  then
16               ; // Cooling
17               ;
18              
19        end if
20       
21 end while
22 ;
Algorithm 3 SA-Based Weight Permutation

The pseudo code of the SA-based weight permutation algorithm is shown in Algorithm 3. The optimization starts from a high temperature where any proposed solution is likely to be accepted. The subroutine - is invoked in Line 6 to propose a neighbor state () by a one-step random permutation (either row or column). Since the optimal orders of rows and columns are correlated, it is important to mix the row swapping and column permutation during the optimization, instead of optimizing one first. Then, the is packed and evaluated in Line 7 and 8. The energy difference in Line 8 is calculated by:

(1)

where the first term () is the difference in the matrix size after packing, and the second term is the difference in the number of weight tiles () multiplied by the size of the systolic array (). An example is shown in Fig. 5. Since a slight change in the matrix size may not result in a difference in the number of weight tiles to map to the systolic array, will be zero in most cases. Once a relatively large difference is made, or the change is at the boundary of a tile (which reduces or increases the number of weight tiles), extra reward () or penalty () will be given. If

, the new state is better and will always be accepted. Otherwise, it will have a probability of

to get accepted. At high temperatures, the optimizer has a larger chance to accept worse solutions, whereas it will become more conservative as the temperature decreases. The temperature is multiplied by () after every steps to simulate the cooling process. In this work, and are set to 0.01 and 15, respectively. The empirical value of is from 1000 to 3000, depending on the size of the weight matrix. For the early layers with a small size (e.g. ), an initial temperature of 1000 is enough. For large layers, a higher temperature is needed. is set to for all layers.

Fig. 5: The Energy Difference () for SA.

Iii-D Hardware Design

The hardware architecture to implement the compressed models is shown in Fig. 6 (a). We adopt the weight-stationary flow described in Section II-A. The overall systolic array architecture is similar to that of [7] (in which bit-serial computation is used) with the following differences. Before computation, one or more row sections are loaded from the off-chip memory to the weight buffer. Assume the size of the systolic array is . To perform matrix multiplication, a weight tile in the order of the permutated and packed matrix is read from the buffer and stored in the nodes of the systolic array. Each column in the tile can pack up to 16 sparse columns of the original weight matrix. Therefore, the input activations also need to be grouped accordingly before sent to the systolic array. Since the order of the columns and the group composition are different for each row section, the corresponding input activations have to be read from the input buffer and aligned accordingly. The indices of the input channels corresponding to each packed column are known from the compression results and are stored in an address look up table (LUT). According to the index, the data of input channel is read from the input buffer and stored in the register array in a way that aligned with the column packing and order. We use double buffering to prefetch the input data for the 32 systolic array columns to make the systolic array fully occupied during the computations.

The matrix multiplication is performed in a bit-serial manner similar to [7]

. At each cycle, 1-bit of each input activation is shifted into the array. The 16 input bits (16 inputs that are packed to the same column) will propagate along the systolic array column, and the input data of adjacent columns have a one-cycle skew. At each node, 1-bit (

) is selected from the 16 input bits to multiply with the 8-bit weight. The partial sum will be accumulated along the row and sent to the output buffer. There is a one-cycle skew between the partial sums of adjacent rows. Similar to [7], we adopt an 8-bit representation for the weights and activations and a 32-bit representation for the partial sums. Since it takes 8 cycles to shift in the 8-bit activations and 32 cycles to accumulate the partial sums, each node has 4 MAC units to receive the input data in an interleaved manner to maximize the throughput. In each MAC unit (Fig. 6 (b)), the main part for the bit-serial multiplication contains 8 full adders (FAs). The 4 MAC units share the same weight and index information.

Fig. 6: (a) The Systolic Array Architecture and (b) a MAC Unit.

The systolic array is composed of four subarrays that can be configured to work in two independent groups to maximize the hardware utilization. In the case that a row section has too few column groups to fill up the entire array, it can be folded with another row section and mapped together to the array to save the overall execution time and energy consumption. For example, after computing the first two tiles of row section 1 in Fig. 6 (a), the last 7 column groups can be mapped to the first subarray, and the other subarrays can be mapped with the column groups of row section 2. After the weight tile finishes computing with the data in the input buffer, the next tile will be loaded into the systolic array for computation. This process will continue until all the tiles in the weight buffer finish the computations with the data in the input buffer. Then, new data will be loaded to the input buffer to continue the computations. Once all the computations related to the current weight tiles are completed, the weight buffer can be updated to compute other output channels. This flow guarantees that each weight in the model will only be read once from the off-chip memory. Moreover, since the input buffer will finish computing with the entire row section before loading new data, all the input channels loaded on-chip can be fully utilized for computations.

Iv Subword-Level Tight Compression: Towards Higher Compression Rate

In Section III, we have proposed a weight-level compression method. After compression, the model is uniformly quantized to 8-bit for the deployment in the systolic array system. In this section, we will propose a subword-level compression method which leverages the mixed-precision representation to further enhance compression performance and reduce computational complexity. The compressed mixed-precision weight matrix can be easily supported by the systolic array with minor modification in the MAC units.

Fig. 7: Mixed-Precision Weights (a) in Previous Works [14, 15] and (b) after the Proposed Subword Pruning.

Iv-a Subword Pruning

For the subword-level tight compression, there is an extra step between the training stage (with unstructured weight pruning) and the compression stage (with weight permutation), where the sparse weight matrix of each layer is quantized to 8-bit and further pruned at the subword level. The intuition of subword pruning is to represent the sparse weight matrix using mixed precisions. In Section II-B, two types of mixed-precision representation are introduced, as shown in Fig. 7 (a). The first type is to use the same bit-length but two different scale factors to represent the weights. The second type is to use the same scale factor but two different bit-lengths. Unlike the previous works [14, 15], in subword pruning, both the bit-length and scale factor will be changed for the weights according to their magnitude. Fig. 7(b) illustrates this mixed-precision representation. The nonzero weights at each layer are in three formats after the subword pruning: the Subword L, the Subword H, and the 8-bit full precision. For the weights with small magnitude, the most significant bits (i.e. Subword ) in the 8-bit representation are zero and thus can be removed without incurring accuracy loss. On the contrary, for the weights with large magnitude, it is important to preserve a large range but may not be necessary to have such high resolution. In this case, the least significant bits (i.e. Subword ) in the 8-bit representation can be removed with negligible accuracy loss. This operation can be interpreted as a fine-grained subword-level pruning, where either the Subword or Subword is pruned for most nonzero weights. A small fraction of weights is allowed to preserve the 8-bit full precision to limit the accuracy loss. Specifically, if the deviation after the subword pruning () is larger than a threshold value (e.g. 25%), full precision will be used for the weight. A larger threshold value will result in less full-precision weights, which is beneficial for the subword-level compression. However, the accuracy loss induced by subword pruning will increase at the same time. The neural network needs to be retrained with subword pruning for several epochs to recover the accuracy.

An illustrating example is shown in Fig. 8. In this example, the bit-length is 4-bit for the subwords, and the allowable deviation for subword pruning is set to 25%. Most nonzero weights only have one nonzero subword after the pruning. A small fraction of weights (e.g. the one with a value of 23) preserves 8-bit precision. Later, this fine-grained subword-level sparsity will be exploited to improve the compression performance. It is worth mentioning that the bit-lengths of the Subword and Subword may vary over different layers but are consistent within a layer for efficient model compression. Details for determining the bit-lengths will be discussed shortly.

Fig. 8: Weight Matrix after the Subword Pruning.

Iv-B Exploiting the Subword Sparsity for Compression

To illustrate the subword-level compression, the small weight matrix shown in Fig. 8 is firstly compressed by column packing. (Weight permutation is not applied here for simplicity.) The compressed weight matrix is shown in Fig. 9 (a). In a column group, only one nonzero weight is allowed at each row position. Any conflicts among the columns will make the packing invalid. A simple way to exploit the subword sparsity for compression is to divide the original weight matrix into two matrices, one for Subword and the other for Subword . Then, each subword matrix can be compressed individually, as shown in Fig. 9 (a). Although the overall matrix size after compression () is the same as the original column packing, each node of the systolic array can be less complex due to the smaller bit-length of the subwords. However, this naive compression method will bring some problems. Since a column in the original weight matrix is now split into two columns, each input activation needs to be grouped and input twice into the systolic array to finish the computations. It will increase the energy consumption for input preparation and hence induce non-trivial overhead. Besides, the address LUT (Fig. 6) that stores the indices for aligning the input activations will be doubled since the column group compositions for the two subword matrices are different.

Fig. 9: (a) A Simple Way of Subword-Level Compression and (b) the Proposed Subword-Level Compression.

A more efficient compression method is shown in Fig. 9 (b), where the entire weight matrix is compressed as a whole. Two nonzero weights at the same row are allowed to be merged during the compression, as long as their nonzero subwords are not in the same position. Later, the merged weights will be mapped to the same node of the systolic array and computed as a whole. The density of the weight matrix in Fig. 9 (b) is calculated by the number of nonzero weights divided by the size of the compressed weight matrix. Originally, the weight matrix has a small density of 26%. In the first step, column 3 is selected and grouped with column 0. The two weights at the first row are merged and become 8-bit {Subword , Subword }. After grouping the two columns, the density increases to 31%. This process will continue until the first column group can no longer combine with any columns due to conflicts or reaching the upper bound of columns per group (parameter in Algorithm 2). Then the next column group will be processed. After compression, the size of the weight matrix is reduced to , and the density increases from 26% to 93%. Compared to the previous solution in Fig. 9 (a), no extra overhead will be caused by input preparation or address LUT. This example shows how subword sparsity can help to improve the compression performance. It can be combined with the simulated annealing-based weight permutation proposed in Section III for a complete subword-level compression.

To obtain the optimal compression performance, it is preferred to have half of the nonzero weights pruned to Subword and the other half pruned to Subword . In this case, the nonzeros at the same row will have a larger chance to be merged to avoid conflicts. The ratio of Subword to Subword depends on how we partition the 8-bit weight. For example, Table I shows the bit-lengths of the subwords and the proportion of each subword type at a layer of the benchmark on CIFAR-10 [16]. The optimal bit-length is 4-bit for both subword types in this example. In the experiment, three optimal combinations are observed for the layers of all the benchmarks, including {3-bit , 5-bit }, {4-bit , 4-bit }, and {5-bit , 3-bit } (Fig. 7 (b)). The hardware can be configured to support different combinations.

Nonzero Weights Bit-Lengths {,}
{2,6} {3,5} {4,4} {5,3} {6,2}
Subword (%) 99.9 97.6 57.7 18.4 6.0
Subword (%) 0.1 2.3 37.6 76.4 90.7
8-Bit Full Precision (%) 0 0.1 4.7 5.2 3.3
TABLE I: Bit-Lengths of the Subwords and the Proportion of Each Subword Type at a Layer (CIFAR-10)

Iv-C Hardware Modification for Subword-Level Compression

The top-level hardware architecture for implementing the subword-level compressed model is similar to that in Fig. 6 (a). Each column group in the weight tile can pack up to 16 sparse columns of the original weight matrix, and the input preparation part is the same as before. One major difference from the previous design is the MAC unit at each node of the systolic array. Fig. 10 shows the MAC unit modified to support the subword-level compression. Each nonzero entry of the weight tile can be a Subword , a Subword , or an 8-bit {Subword , Subword } in the case that two weights are merged or the full precision is used for one weight. Therefore, up to 2 bits ( and ) are selected to multiply with the subwords stored at each node. An extra full adder is used to invert the input activation for Subword if the two subwords have opposite signs. As mentioned in Section IV-B, there are three bit-length combinations for the subwords. It is supported by the MAC unit and controlled by the 2-bit signal . Besides, extra index information needs to be stored for the compressed weight matrix. Originally, a 4-bit index is needed at each node to select 1-bit input from the 16 input bits. After subword-level compression, two sets of the 4-bit index are needed, one for Subword and the other for Subword . In the case that an 8-bit full-precision weight is stored at the node, the two indices will have the same value to select the same input bit (=). Although energy and area overhead is induced by the modification, the overall energy efficiency and area efficiency are improved due to better compression performance. A detailed analysis will be given in the experimental results.

V Experimental Results

In this section, we will evaluate the performance of the proposed compression method and compare it with the previous compression techniques. The weight-level compression results are presented first in Section V-B. Then, the models are further pruned and compressed at the subword level to improve the compression results in Section V-C. After that, we implement and synthesize the hardware architectures presented in the previous sections, and the throughput and energy efficiency of the implementations are reported in Section V-D. The buffers are implemented by SRAM and modeled in CACTI 7.0 [19]

using 45nm process node to estimate the energy and area. Other components, including the systolic array, the accumulation units, and the input register array, are synthesized using the Synopsys Design Compiler with Nangate 45nm Open Cell Library

[20]. To make a comparison with [7], we also implement the models compressed by conflict pruning in a baseline systolic array system similar to [7].

Fig. 10: The MAC Unit for Subword-Level Compression.

V-a Benchmarks

We evaluate the compression method using three popular datasets, including the CIFAR-10 dataset [16] with color images of 10 classes, the CIFAR-100 dataset [16] with

images of 100 classes, and the ImageNet dataset

[13] with images of 1000 classes. We adopt the same network architecture as [7] to make a fair comparison with the conflict pruning method. The neural networks are composed of 19 convolutional layers and 1 fully connected layer [21]

. Stochastic gradient descent is used for training. Similar to

[7]

, the Nesterov momentum is set to 0.9. The learning rate is initialized as 0.1 and gradually decays to 0. L1 regularization is used with the strength equal to

. The neural networks on the three datasets are trained for 300, 300, and 120 epochs, respectively, with a batch size of 128. After training, the models with unstructured weight sparsity are obtained. For subword pruning, the models are further retrained for 20, 20, and 8 epochs, respectively, to recover the accuracy.

V-B Performance of Weight-Level Tight Compression

The initial number of weights without pruning is 1.91M in the benchmark of CIFAR-10. During training, 93.3% of the weights can be pruned with a negligible accuracy loss. The top-1 accuracy after pruning is 93.10%. Similarly, the benchmarks of CIFAR-100 and ImageNet initially have 3.33M and 4.49M weights, respectively. The pruning rates in the two benchmarks are 90.4% and 88.6%, respectively. After pruning, 75.09% top-1 accuracy is obtained on CIFAR-100, and 56.616% top-1 accuracy is obtained on ImageNet. Fig. 11 shows the row sections of a typical weight matrix (, CONV15) in the benchmark of CIFAR-10 before and after compression. Each row section has a size of . 93.3% of weights in the matrix are pruned and represented by the black dots. Other weights are represented by the white dots.

Directly packing the columns of the entire weight matrix without conflict pruning can only achieve a limited compression rate of 2.4. The first row section after column packing is shown in Fig. 11. The compressed weight matrix has a size of , and only 17% of weights are nonzero. In this case, the high sparsity cannot be fully utilized to improve energy efficiency and throughput. After partitioning the weight matrix and packing each row section independently (without permutation), a better compression result with 53% of nonzero weight density is obtained. Fig. 11 shows the row section [0][5] in the compressed weight matrix. The compression density can be further improved by weight permutation. Through the weight-level tight compression, the 512 columns in the first row section can be packed in 61 groups, thus effectively compressing the row section by 8.4 times. 78% of weights are nonzero in the compressed row section, which is much denser compared to the original sparse format. Similar improvement has been observed for other row sections, as shown in Fig. 11. The compressed weight matrix tends to have denser column groups on the left. It is because the packing starts from the left and searches the remaining columns to form the densest group at each step. No accuracy loss is caused by the weight permutation.

Fig. 11: Weight Matrix Compressed Using Different Methods (Black: Zero Weights, White: Nonzero Weights).
Benchmark Performance Conflict Pruning [7] Weight-Level Tight Compression Subword-Level Tight Compression
CIFAR-10 Pruning Rate (%) 84.7 93.3 93.3
Number of Nonzero Weights 0.3M 0.13M    0.13M   (4.64% in Full-Precision)
Matrix Compression Rate 5.88 10.28 14.13
Top-1 Accuracy (%) 92.9 93.1 92.16
CIFAR-100 Pruning Rate (%) 80.5 90.4 90.4
Number of Nonzero Weights 0.65M 0.32M    0.32M   (10.33% in Full-Precision)
Matrix Compression Rate 4.60 8.30 11.10
Top-1 Accuracy (%) 75.08 75.09 74.32
ImageNet Pruning Rate (%) 66.6 88.6 88.6
Number of Nonzero Weights 1.5M 0.51M    0.51M   (17.90% in Full-Precision)
Matrix Compression Rate 2.69 6.52 8.66
Top-1 Accuracy (%) 55.99 56.61 55.70
TABLE II: Performance of Tight Compression and the Comparison with Conflict Pruning [7]

The pruning rate, matrix compression rate, and accuracy after compression are summarized in Table II. Compared to the conflict pruning method [7], the weight-level tight compression can achieve a larger pruning rate. It is because accuracy loss will be introduced in conflict pruning and hence to achieve the same level of accuracy, more weights have to be kept. The number of nonzero weights after tight compression is 2.3 times smaller than conflict pruning in the benchmark of CIFAR-10. The matrix compression rate is effectively improved from 5.88 to 10.28, and thus significant improvements can be achieved in throughput and energy efficiency (which will be shown in Section V-D). Similarly, on the CIFAR-100 dataset, the pruning rate in tight compression can be 10% larger than the conflict pruning method when the two models are of the same accuracy. As a result, the number of nonzero weights after tight compression is 2.0 times smaller than the conflict pruning method. The matrix compression rate can be improved from 4.60 to 8.30 by the weight-level tight compression. On the ImageNet dataset, the number of nonzero weights after tight compression is 2.9 times smaller than the conflict pruning method, and the matrix compression rate can be improved from 2.69 to 6.52.

We also compare tight compression with some structured pruning methods in terms of the model size and accuracy. The results are shown in Table III. Compared to structured pruning, tight compression can achieve a smaller matrix size with the same level of accuracy.

Performance Network Filter Filter Tight
Slimming Pruning Pruning Compression
[8] [9] [10] (Weight-Level)
Pruning Rate (%) 88.5 64.0 86.5 93.3
Weight Matrix Size after Compression 2.30M 5.4M 2.02M 0.18M
Top-1 Accuracy (%) 93.8 93.4 90.9 93.1
TABLE III: Comparison between Tight Compression and Structured Pruning on CIFAR-10

V-C Performance of Subword-Level Tight Compression

For better compression performance, the sparse model is further pruned and compressed at the subword level using the method presented in Section IV. The maximum allowable deviation () for the subword pruning is set to 0.3 for the benchmark of CIFAR-10. The weight pruning rate is the same as before (i.e. 93.3%). However, 95.36% of the nonzero weights only have one subword (either Subword or Subword ). A small fraction of the nonzero weights (4.64%) preserves 8-bit precision to limit the accuracy loss. The relative accuracy loss induced by subword pruning is 0.94%. After the subword-level compression, the total size of the weight matrices is reduced from 1.91M to 0.135M, and a large matrix compression rate of 14.13 is achieved for the model. The results are summarized in Table II. Compared to conflict pruning and the weight-level tight compression, the matrix compression rate can be significantly improved by the fine-grained subword-level compression. Fig. 11 shows the row section [0][5] in the compressed weight matrix (CONV15 in the benchmark of CIFAR-10). A large compression rate of is achieved at the layer, which is close to the upper bound (parameter in Algorithm 2). The density of nonzeros in the weight matrix is improved from 78% to 101% (100% due to the merging of subwords), compared to the weight-level tight compression.

Similar improvement has been achieved in the benchmark of CIFAR-100. The maximum allowable deviation () for the subword pruning is set to 0.27, and only 10.33% of nonzero weights preserve an 8-bit full precision. Others only have one nonzero subword. The relative accuracy loss induced by subword pruning is 0.77%. Compared to the weight-level tight compression, the total size of the weight matrices is reduced from 0.40M to 0.30M. Therefore, the compression rate of the weight matrix is improved from 8.30 to 11.10. In the benchmark of ImageNet, is set to 0.23, and 17.90% of nonzero weights preserve full precision after subword pruning. Compared to the weight-level tight compression, the compression rate of the weight matrix is further improved from 6.52 to 8.66 at the cost of 0.91% of relative accuracy loss.

V-D Analysis of Throughput and Energy Efficiency

Since accuracy loss will be induced by conflict pruning, uneven pruning is performed in [7] to maintain high accuracy while reducing the model size as much as possible. Specifically, less aggressive pruning and column packing (1-4 columns per group) are performed in the early layers that are relatively small and have less capacity for compression, and more aggressive pruning is performed in the subsequent large layers. Although the size of the large layers can be reduced, the improvement in throughput is limited since the first few layers have higher computational complexity (due to the large input volume) and take more time to process. In the proposed tight compression method, uniform pruning is performed and thus all the layers can be compressed efficiently to improve the throughput and energy efficiency. Fig. 12 shows the execution time of each layer in the benchmark of CIFAR-10. Compared to conflict pruning, the overall execution time can be efficiently reduced through tight compression. Besides, Fig. 12 also shows that the fine-grained subword-level tight compression can achieve a higher throughput than the weight-level tight compression.

Fig. 12: Execution Time of Each Layer in the Benchmark of CIFAR-10 (X Axis: Layer Index, Y Axis: Time (us)).
Benchmark Performance Conflict Pruning[7] Weight-Level Tight Compression Subword-Level Tight Compression

1
  Value Improvement Over

1
  Value Improvement Over

1
CIFAR-10 Throughput (frames/s) 563.2 1196.8 2.12 1550.5 2.75
Energy Efficiency (frames/J) 2440.4 3827.4 1.57 4537.4 1.86
Area Efficiency (frames/s/mm) 332.3 602.9 1.81 695.6 2.09
Top-1 Accuracy (%) 93.0 93.0 0 92.14 -0.86%
CIFAR-100 Throughput (frames/s) 257.4 460.0 1.79 581.1 2.26
Energy Efficiency (frames/J) 1066.6 1441.7 1.35 1687.2 1.58
Area Efficiency (frames/s/mm) 151.8 231.7 1.53 260.7 1.72
Top-1 Accuracy (%) 74.80 75.06 0.26% 74.17 -0.63%
ImageNet Throughput (frames/s) 53.8 101.9 1.89 133.4 2.48
Energy Efficiency (frames/J) 238.6 336.5 1.41 405.0 1.70
Area Efficiency (frames/s/mm) 31.8 51.3 1.61 59.8 1.88
Top-1 Accuracy (%) 51.28 55.86 4.58% 55.49 4.21%
TABLE IV: Throughput and Energy Efficiency of the Implementations for Tight Compression and Conflict Pruning

The comparison of throughput, energy efficiency, and area efficiency among different compression methods are summarized in Table IV. Compared to conflict pruning, the weight-level tight compression improves the throughput and energy efficiency by 2.12 and 1.57 times, respectively, on the CIFAR-10 dataset. Fig. 13 shows the synthesis results for different implementations. The power consumption and area of the systolic array are slightly larger than the baseline case. Since more columns can be packed in one group through tight compression, the input register array is larger than the baseline implementation for conflict pruning which only supports eight columns per group at maximum. Besides, the address LUT also has some area overhead. Therefore, the area of the implementation for the weight-level tight compression is 20% larger than the baseline implementation for conflict pruning. However, as the throughput is increased, the overall area efficiency is improved by 1.81 times. In the benchmark of CIFAR-100, the throughput, energy efficiency, and area efficiency are improved by 1.79, 1.35, and 1.53 times, respectively. Similarly, in the benchmark of ImageNet, the throughput, energy efficiency, and area efficiency are improved by 1.89, 1.41, and 1.61 times, respectively. The top-1 accuracies in the implementations for weight-level tight compression and conflict pruning are 55.86% and 51.28%, respectively, on the ImageNet dataset. The accuracies are different from those in Table II since the activations are also quantized.

As discussed in Section IV-C, extra energy and area overhead is induced to support the subword-level compression. The overhead contains two parts. The first part is related to the modification in the MAC units at each node of the systolic array. The power and area of the systolic array are increased by 11% and 17%, respectively, compared to the implementation for the weight-level tight compression. The second part of the overhead is related to the extra index information for the weight matrix. Since the energy consumption and area of the weight buffer are much smaller (10%) than the systolic array, the overall overhead is mainly determined by the first part. Since better compression performance is obtained through the subword-level compression, the overall throughput, energy efficiency, and area efficiency are significantly improved, as shown in Table IV.

Fig. 13: Synthesis Results for Different Implementations.

Vi Conclusions

In this paper, we have proposed a compression method to fully utilize the unstructured sparsity for implementing neural networks efficiently on the regular systolic array architecture. Two pruning granularities are explored for weight matrix compression. Besides weight pruning, we further prune the model at the subword level to exploit the fine-grained subword sparsity for improving the compression performance. After pruning, the sparse weight matrix of each layer is compressed to a small and dense format by permuting the weights to avoid conflicts and facilitate column packing. Through weight permutation, the matrix compression rate can be significantly improved, compared to the state-of-the-art compression techniques. As a result, the throughput and energy efficiency can be improved by 2.75 and 1.86 times, respectively.

References