I Introduction
Convolutional Neural Networks (CNN) have been developing rapidly in many applications, such as image classification [15], object detection [8]
and natural language processing
[12]. When neural networks achieve higher accuracy with increasing computation and parameters [11] [2], many accelerators [4] [5] [16] [6] implemented on ASIC or FPGA with abundant parallel units are proposed to deploy neural networks on edge devices. Although these accelerators can improve throughput, it is still very challenging to deal with tremendous computation and transfer large amounts of data from DRAM to the onchip memory. Weight pruning is an effective approach to reduce model size. However, the methods proposed in early works [10] [22] prune weights randomly (named irregular pruning). Irregular pruning needs to store weights in Compressed Sparse Column (CSC) format [9], which leads to considerable extra indices to represent weights. Besides, imbalance of workload among different computing units in the highly parallel architecture causes resource underutilization, which prevents the hardware from fully leveraging the advantages of weight pruning.Contrary to irregular pruning, regular pruning is more hardwarefriendly. Different granularities are explored in previous works, including irregular sparsity (0D), vectorlevel sparsity (1D), kernellevel sparsity (2D), and filterlevel sparsity (3D)
[21] [1] [24]. However, the accuracy drops as the increment of pruning regularity [21]. Therefore, it is imperative to find a new granularity to achieve regular pruning with high accuracy.In this paper, we firstly propose Sparsity Pattern Mask (SPM), a novel kernellevel format to index nonzero weights of each kernel. Based on SPM, we present PCNN, a finegrained regular 1D pruning method with the identical number of nonzero weights in each kernel of one layer. In this case, the computation workload of different convolution windows can be balanced with a few number of kernel patterns in each layer.
To push PCNN to be more regular, we further leverage multiple knapsack framework to distillate patterns (i.e., fewer patterns). In this way, we can employ fewer bits to encode SPM.
Based on PCNN, we implement a patternbased architecture in 55nm process. Specialized memory optimization is proposed to map the PCNNbased workload with SPM code in hardware. Leveraging the benefit of PCNN, a sparsityaware PE array is designed to achieve highly parallel computation with a delicate pipeline. Moreover, the sparsityaware PE array can process sparse weights and activations simultaneously.
In experiments, combined with other pruning methods like channel pruning, the PCNN algorithm can achieve up to 34.4 reduction of model size with negligible accuracy loss, which proves its orthogonality feature. With the power of PCNN, the proposed patternaware architecture fully leverages benefits from PCNN and shows up to 9 speedup and 28.39 TOPS/W efficiency with only 3.1% memory overhead of indices.
Ii The Proposed PCNN Framework
Iia Descriptions of PCNN
Generally, there are some zeros in one convolution kernel. To avoid storing zeros in the memory, we propose a kernellevel index format called Sparsity Pattern Mask (SPM) to encode sparse weights. As shown in Figure 1, the nonzero weights in a kernel are distributed in a specific pattern, which can be encoded with an SPM index. As a result, only the nonzero values and the SPM index need to be stored. Different from Compressed Sparse Column (CSC) format [9] which employs one index to each weight, we only need to apply one SPM index to each kernel. In contrast, the index overhead is much smaller.
However, there are in total patterns in kernels, indicating that the bitwidth of SPM index is 9. Therefore, in order to reduce the bitwidth overhead of SPM encoding and simultaneously maintain the balanced workload for parallel computation, we propose PCNN which keeps identical sparsity in each kernel of one layer (i.e., the numbers of zeros in different kernels are the same). In this way, the number of patterns is reduced to . In Figure 2, there are even some redundant patterns when we apply PCNN, which means that the number of patterns can be ulteriorly reduced to some extent.
Based on the above considerations, we expect to find the optimal pruning manner to deploy PCNN with appropriate sparsity and the number of patterns. Consequently, we establish a framework to describe PCNN with the following terminologies. The set is the kernel sparsity of each convolution layer, determined by , where denotes the number of nonzero values while is the kernel size for each layer.
Next, the set is a full collection of pattern sets for each layer and the extracted one is extracted from without redundant patterns as shown in Figure 2. Thus, in the PCNN learning framework, we intend to explore the appropriate and for each layer, aiming to achieve a smaller model with fewer patterns.
IiB KPbased Pattern Distillation
Problem modeling. As mentioned above, we intend to further reduce the number of patterns in our PCNN framework. Thus, we employ pattern distillation to choose the dominant ones. Pattern distillation means we have to select finite valuable patterns from a candidate set, which perfectly corresponds with the core idea of the knapsack problem.
With a given Convolutional Neural Network containing convolution layers, denotes the weights in the th convolution layer and the set is a weight collection of layers. denotes the number of nonzero weights in each kernel. We formulate pattern distillation as:
(1)  
where is the th kernel in and is the number of kernels in . denotes the full set or candidate set of the patterns that have uniform sparsity and denotes the total number of patters in (i.e., ). is the selected patterns in the th layer that is derived form . Hyperparameter is the desired number of the selected patterns (). The th pattern in to is selected when , and vice versa. is a projection function that matches to the nearest pattern in by keeping top absolute values.
The pattern distillation problem is similar to the knapsack problem (KP). If we regard each as a single knapsack, then the capacity of each knapsack is 1. In other words, we can only choose one pattern from the candidate set for . Representing with the selected patterns, the problem can be regarded as the multiple knapsack problem (MKP). Particularly, since all capacities of are 1, KPbased pattern selection is a multiple knapsack problem with identical capacities (MKP1).
Solution with greedy algorithm. In the case of our special problem, we propose an efficient greedy algorithm to solve KBbased pattern optimization. For each layer, we first select the most valuable pattern for each and then collect the number of patterns that have been chosen. Finally, we keep the patterns that have the highest frequency (top patterns). Details are shown in Algorithm 1. As a result, the set will contain a fewer number of patterns.
Iii The architecture for PCNNBased Computing
Iiia Mapping PCNN in Memory with SPM
Figure (a)a is the overall architecture for PCNNbased computation. Pattern Config (PaC) provides information of kernel sparsity and employs SPM mapping table for the decoder. In the pattern pruning framework, a kernel is denoted by a corresponding SPM code and a nonzero sequence, stored separately in Weight SRAM and Pattern SRAM. For a
kernel, the length of the nonzero sequence ranges from 1 to 9. Therefore, the sizes of kernel and SPM registers are 60word which can integrally store kernels that contain 1 to 6 nonzero weights. For other sparsities, we pad zeros to align the memory. A kernel with nonzero weights is fetched into the register and the corresponding SPM code is simultaneously decoded into a 9bit weight mask. After generating weight pointers based on the mask, the patternaware PE group will properly fetch weights from the kernel register with sparsity pointers which will be presented in the next subsection.
Figure 3 is the memory layout for PCNN. In different sparsity cases, weights are similarly stored in order. The host controller can delicately access memory, fetching data to fill in the registers according to the kernel sparsity configured in PaC.
IiiB SparsityAware Processing Element Group
A detailed overview of the patternaware PE group is shown in Figure 4, where the sparsity IO can generate pointers based on the weight mask from the SPM decoder to fetch weights properly. In our design, we implement 64 PEs with 4 MAC units in each one. Consequently, our architecture can perform at most 256 MACs per cycle. Besides weight sparsity, we also leverage activation sparsity to further improve computing efficiency. Therefore, we employ sharedactivation datapath to balance the workload from activations.
The sparsity pointer generation is shown in Figure (b)b. Both the weight mask and the activation mask are transferred to the sparsity IO and then the sparsity mask is generated. Later, pointer offsets can be obtained with the sparsity mask (Figure (c)c). There is an adder–AND chain to attain nine offsets, each of which denotes the distance to the nearest zero. The offsets can be elaborated as follows. Firstly, NOT operation is applied to each bit of sparsity mask. Secondly, we can obtain the pointer offsets by accumulating the number of zeros between every two nonzero weights. With pointer offsets, we can attain the corresponding pointers to fetch the needed weights from the kernel register. With the help of PCNN and the sharedactivation dataflow, the workloads of weights and activations in different PEs are identical, contributing to higher resource utilization and better parallelism.
Figure 5
shows the pipeline strategy in the proposed patternaware PE. In order to achieve high throughput, all the operations are pipelined. The first stage is the data preprocess stage. Weights are restored to the original kernel according to the SPM indices. Activations in a convolution window are loaded into the registers and activation sparsity masks are generated simultaneously. In the second stage, nonzero pointers are generated with the calculated offsets, which will select the effectual workload in the next stage to perform MACs. Last, when all partial sums from various input channels are added up together, ReLU is employed to attain the final result.
Benchmark  Top1 acc  Top1 acc Loss  CONV FLOPs  FLOPs Pruned  CONV Parameters 




VGG16, Baseline  93.54%    3.13    1.47      
VGG16, n = 4  93.79%  +0.25%  1.39  56.5%  0.65  2.3  2.2  
VGG16, n = 3  93.58%  +0.04%  1.04  66.7%  0.49  3.0  2.9  
VGG16, n = 2  93.52%  0.02%  0.30  77.8%  0.33  4.5  4.1  
VGG16, n = 1  93.07%  0.21%  0.35  88.9%  0.16  9.0  8.4  
VGG16, Various setting  93.33%  0.21%  0.35  88.8%  0.16  9.0  8.4 
n in various layers: 2111111111111 with 32 patterns in n = 2 layers and 8 patterns in n = 1 layers.
Benchmark  Top1 acc  Top1 acc Loss  CONV FLOPs  FLOPs Pruned  CONV Parameters 




ResNet18, Baseline  96.58%    5.55    1.12      
ResNet18, n = 4  96.58%  +0.06%  2.50  54.5%  0.51  2.2  2.1  
ResNet18, n = 3  96.38%  0.20%  1.89  65.5%  0.38  3.0  2.8  
ResNet18, n = 2  96.15%  0.43%  1.28  76.7%  0.26  4.3  4.0  
ResNet18, n = 1  95.55%  1.03%  0.67  88.0%  0.14  7.9  7.3  
ResNet18, Various setting  95.83%  0.75%  0.86  84.5%  0.14  7.9  7.3 
n in various layers in bottle: 2221111111111111 with 32 patterns in n = 2 layers and 8 patterns in n = 1 layers.
Benchmark  Top1 acc  Top1 acc Loss  CONV FLOPs  FLOPs Pruned  CONV Parameters 




VGG16, Baseline  92.10%    6.82    1.47      
VGG16, n = 5  92.47%  +0.37%  0.85  44.4%  0.82  1.8  1.7  
VGG16, n = 4  92.45%  +0.35%  0.68  56.5%  0.65  2.3  2.2 
Pruning rate and accuracy of different n for VGG16 on ImageNet.
Iv Evaluation
Iva Methodology
Setup for evaluating the proposed PCNN. We summarize our PCNN results for CNN model compression on a series of benchmarks, including VGG16 [2] on CIFAR10 [14] and ImageNet [15], and ResNet18 [11]
on CIFAR10. We initialize our learning framework with pretrained models on PyTorch, following the pattern distillation. After that, an Alternating Direction Method of Multipliers (ADMM)
[3] is employed to finetune our model. Considering that convolution layers are getting more and more dominant at present, we mainly focus on convolution layers.Setup for evaluating our PCNNbased architecture. Based on PCNN encoded with SPM, we implement the patternaware architecture with RTL and evaluate VGG16 based on PCNN with VCS to obtain the performance. The area and power of our architecture are attained with Design Compiler in UMC 55nm standard power CMOS process.
IvB Evaluating the Kernel Sparsity and the Number of Patterns
In this part, we study different choices of kernel sparsity for VGG16 and ResNet18. In ResNet18, we only process the layers with 33 filters and ignore 11 ones which are too accuracysensitive. We set n as 1, 2, 3, and 4 in all layers with at most 8, 32, 32, and 32 patterns respectively.
Table I shows the pattern pruning results of various n on the VGG16 model on CIFAR10. PCNN leads to less than 0.5% accuracy loss even when there is one weight left in each filter. When we apply a different sparsity setting over layers, accuracy can be improved with similar compression rates and speedup.
Similar results can be achieved for ResNet18 on CIFAR10 as shown in Table II. Within 0.5% accuracy loss, pattern pruning achieves the compression rate ranging from 1.78 to 4.31 with the unified sparsity setting (n = 4, 3, and 2). Also, when various sparsity settings are applied, we can achieve better performance than the unified counterpart with a similar compression rate of 9. Furthermore, the results for VGG16 on ImageNet are shown in Table III and we achieve 1.8 compression rate and 2.25 speedup with no harm to accuracy.
Note the last columns of Table IIII containing the actual compression rate considering the overhead of indies. With PCNN, there are marginal compression rate drops due to kernellevel SPM indices. On the contrary, for irregular pruning, taking VGG16 with n = 4 as an example, the actual compression rate is 2.0, three times as low as ours.
Later, we further restrain the number of patterns in each layer to study how regularity impacts accuracy. As Table IV shows, we evaluate n = 4 and n = 2 with full patterns, 32, 16, 8, and 4 patterns. We use full patterns as baselines with 93.79% and 93.52% for n = 4 and n = 2 respectively and the weight compression rates are 2.3 and 4.5. The results show that there are no obvious accuracy drops with fewer patterns, which can help us to save the overhead of index storing. While in the higher sparsity case, the accuracy is more sensitive to the decrease of patterns. Actually, in the cases with high sparsity, we do not need to focus too much on the number of patterns because the loss of compression is little with SPM in PCNN. Averagely, 16 patterns are enough to maintain accuracy with less index overhead.
Benchmark  Relative acc 



n = 4 

  2.14  
= 32  +0.32%  2.18  
= 16  +0.10%  2.20  
= 8  0.05%  2.21  
= 4  0.17%  2.23  
n = 2 

  4.08  
= 32  +0.00%  4.13  
= 16  0.22%  4.19  
= 8  0.54%  4.26  
= 4  0.71%  4.32 
IvC Comparison to Other Regular Compression Methods
In this part, we compare PCNN to other pruning methods in other works for VGG16 and ResNet18 on CIFAR10. For various baselines used in different works, we employ the accuracy loss relative to the respective baseline. The comparison for VGG16 is shown in Table V. Our method can remarkably compress parameters and reduce FLOPs simultaneously with negligible accuracy loss. As for ResNet18 shown in Table VI, PCNN also achieves better performance than other coarsegrained pruning. Especially, PCNN enjoys higher FLOPs reduction.
Benchmark  Relative acc  FLOPs  Compression 

PCNN  +0.01%  66.7%  3.0 
PCNN  0.21%  88.8%  9.0 
Filter pruning [18]  +0.15%  33.3%  2.8 
Network slimming [20]  +0.14%  51.0%  8.7 
tryandlearn b=1 [13]  1.10%  82.7%  2.2 
IKR [25]  0.90%  84.7%  4.3 
A VGG16 inspired CNN containing 6 convoution and 2 FC layers
IvD The Orthogonality to Other Compression Methods
Benchmark  Relative acc  Compression  

PCNN n = 5  +0.38%  1.8  

+0.28%  4.4  

0.27%  7.3 
Benchmark  Relative acc  Compression  


0.02%  34.4  

0.46%  50.3  
Structured ADMM [23]  0.60%  50.0  
SNIP [17]  0.45%  20.0  
Synaptic Strength [19]  +0.43%  25.0 
Combined with kernel level pruning. As shown in Table VII, we perform some experiments for VGG16 on ImageNet in the case where n = 5. We apply 2.4 and 4.1 kernel pruning with 1.8 compression rate from PCNN to achieve fused pruning. Results show that the PCNN is orthogonal to kernel pruning and the combination of them achieves promising results.
Combined with channel level pruning. Channel level pruning has been proved to be more regular with a higher compression rate but less friendly to accuracy [21]. As results shown in Table VIII, the fused compression achieves 34.4 compression rate contributed by 3.75 PCNN compression and 9 channel pruning. Therefore, PCNN is also orthogonal to channel pruning.
IvE Evaluation on PatternAware Architecture
The overhead evaluation. The layout of our patternaware architecture is shown in Figure 6 and the overhead of each component is listed in Table IX. The power consumption is measured under the circumstance of 300 MHz onchip frequency and 1 V voltage. Contrary to irregular pruning, PCNN requires an SPM index for each kernel rather than weight. With PCNN, a 128KB weight SRAM is employed holding up to 32768 kernels of 33 size with 4 nonzeros with 8bit quantization for common cases. In this case, a 4K pattern SRAM is enough for the workload with 16 patterns in each layer. As shown in analyses in section IV, 16 patterns are sufficient to maintain accuracy. Consequently, this architecture introduces only 3.1% memory overhead to store indices. The pattern SRAM only takes up 1.9% area and 1.9% power of the whole chip. according to Table IX. In comparison to the irregular pruning based architecture like EIE [9], 64KB index SRAM is needed to denote 128K weights.
The performance evaluation. We evaluate VGG16 based on PCNN with sparsities of 55.6% (n = 4), 66.7% (n = 3), 77.8% (n = 2), and 88.9% (n = 1). The average activation sparsity is 0.8. The results show that we can achieve 2.3, 3.1, 4.5, and 9.0 speedup compared to the dense counterpart. With 256 MAC units running in 300 MHz frequency and 1V voltage, our patternaware architecture achieves 3.15 TOPS/W (no sparsity) 28.39 TOPS/W (88.9% sparsity) power efficiency. It can be observed that our patteraware architecture can fully leverage the strengths of our PCNN method.
Component  Area ()  Share  Power()  Share 

Overall  8.00  100%  48.7  100% 
Data SRAM  3.25  40.6%  13.7  28.2% 
Weight SRAM  2.48  31.0%  15.6  32.1% 
Pattern SRAM  0.19  2.4%  0.9  1.9% 
Register File  1.58  19.8%  13.6  27.4% 
PE group  0.50  6.2%  4.9  10.0% 
V Conclusion
We present PCNN, a novel finegrained regular pruning method that uses SPM to encode the sparsity. Contrary to irregular pruning, the sparsity of every kernel is the same in each layer, which can achieve regularity and maintain fine granularity. Experiments show that 8 9 compression rate and computing speedup can be achieved with less than 1% accuracy loss. Additionally, pattern pruning can be easily combined with coarsegrained pruning methods, achieving 34.4 compression ratio with negligible accuracy loss. With SPM, we can deploy indices at the kernel level rather than weight level, which helps to save a great amount of memory overhead. For computation, with only 3.1% memory overhead for indices, the proposed architecture can achieve full parallelism and obtain up to 9 speedup and 28.39 TOPS/W efficiency based on the PCNN.
References
 [1] (2017) Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC) 13 (3), pp. 32:1–32:18. External Links: Link, Document Cited by: §I.
 [2] Y. Bengio and Y. LeCun (Eds.) (2015) 3rd international conference on learning representations, ICLR 2015, san diego, ca, usa, may 79, 2015, conference track proceedings. External Links: Link Cited by: §I, §IVA.

[3]
(2011)
Distributed optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends® in Machine learning
3 (1), pp. 1–122. Cited by: §IVA.  [4] (2014) DianNao: a smallfootprint highthroughput accelerator for ubiquitous machinelearning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, New York, NY, USA, pp. 269–284. External Links: ISBN 9781450323055, Link, Document Cited by: §I.
 [5] (201701) Eyeriss: an energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits 52 (1), pp. 127–138. External Links: Document, ISSN 00189200 Cited by: §I.
 [6] (2019) REQYOLO: A resourceaware, efficient quantization framework for object detection on fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, FPGA 2019, Seaside, CA, USA, February 2426, 2019, pp. 33–42. External Links: Link, Document Cited by: §I.
 [7] (2019) Bandlimited training and inference for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, pp. 1745–1754. External Links: Link Cited by: TABLE VI.
 [8] (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524. External Links: Link, 1311.2524 Cited by: §I.
 [9] (2016) EIE: efficient inference engine on compressed deep neural network. In 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 1822, 2016, pp. 243–254. External Links: Link, Document Cited by: §I, §IIA, §IVE.
 [10] (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, External Links: Link Cited by: §I.

[11]
(2016)
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §I, §IVA.  [12] (201211) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine 29 (6), pp. 82–97. External Links: Document, ISSN Cited by: §I.
 [13] (2018) Learning to prune filters in convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 1215, 2018, pp. 709–718. External Links: Link, Document Cited by: TABLE V, TABLE VI.
 [14] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §IVA.
 [15] (201705) ImageNet classification with deep convolutional neural networks. Commun. ACM 60 (6), pp. 84–90. External Links: ISSN 00010782, Link, Document Cited by: §I, §IVA.
 [16] (2019) UNPU: an energyefficient deep neural network accelerator with fully variable weight bit precision. J. SolidState Circuits 54 (1), pp. 173–185. External Links: Link, Document Cited by: §I.
 [17] (2019) Snip: singleshot network pruning based on connection sensitivity. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, External Links: Link Cited by: TABLE VIII.
 [18] (2017) Pruning filters for efficient convnets. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: TABLE V.
 [19] (2018) Synaptic strength for convolutional neural network. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pp. 10170–10179. External Links: Link Cited by: TABLE VIII.
 [20] (2017) Learning efficient convolutional networks through network slimming. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pp. 2755–2763. External Links: Link, Document Cited by: TABLE V.
 [21] (2017) Exploring the granularity of sparsity in convolutional neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 2126, 2017, pp. 1927–1934. External Links: Link, Document Cited by: §I, §IVD.
 [22] (2015) Datafree parameter pruning for deep neural networks. In Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 710, 2015, pp. 31.1–31.12. External Links: Link, Document Cited by: §I.
 [23] (2019) Nonstructured DNN weight pruning considered harmful. CoRR abs/1907.02124. External Links: Link, 1907.02124 Cited by: TABLE VIII.
 [24] (2016) Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pp. 2074–2082. External Links: Link Cited by: §I.
 [25] (2018) Efficient hardware realization of convolutional neural networks using intrakernel regular pruning. In 48th IEEE International Symposium on MultipleValued Logic, ISMVL 2018, Linz, Austria, May 1618, 2018, pp. 180–185. External Links: Link, Document Cited by: TABLE V.
Comments
There are no comments yet.