I Introduction
Largescale DNNs achieve significant improvement in many challenging problems, such as image classification [1], speech recognition [2]
and natural language processing
[3]. However, as the number of layers and the layer size are both expanding, the introduced intensive computation and storage have brought challenges to the traditional VonNeumann architecture [4], such as computing wall, massive data movement and high power consumption [5, 6]. Furthermore, the deep structure and large model size will make DNNs prohibitive to embedded systems and IoT devices, where low power consumption are required.To address these challenges, spin orbit torque magnetic randomaccess memory (SOTMRAM) has been used to reduce the power consumption of DNNs since it has the characteristic of nearzero standby power, high density, and nonevolatile [7]. Combined with the inmemory computing technique [8, 9], the SOTMRAM based Processinginmemory (PIM) engine could perform arithmetical and logic computations in parallel. Therefore, the most intensive operation, matrix multiplication and accumulation (MAC) in both convolutional layers (CONV) and fullyconnected layers (FC), can be implemented using bitwise parallel AND, bitcount, bitshift, etc.
Compared to SRAM and DRAM, SOTMRAM has higher write latency and energy [10, 11, 7]
, which decrease its popularity as a favorable energy efficient DNN accelerator. To mitigate these drawbacks, from the software level, weight quantization has been introduced to SOTMRAM PIM. By using binarized weight representation,
[10] efficiently processed data within SOTMRAM to greatly reduce powerhungry and omit long distance data communication. However, weight binarization will cause accuracy degradation, especially for largescale DNNs. In reality, many scenarios require as high accuracy as possible, e.g., selfdriving cars. Thus, binarized weight representation is not favorable.In this work, we propose an ultra energy efficient framework by using model compression techniques [12, 13, 14, 15, 16] including weight pruning and quantization from the software level considering the architecture of SOTMRAM PIM. To further guarantee the solution feasibility and satisfy SOTMRAM hardware constraints while providing high solution quality (no obvious accuracy degradation after model compression and after hardware mapping), we incorporate the alternating direction method of multipliers (ADMM) into the training phase. As a result, we can reduce the footprint and power consumption of SOTMRAM PIM, and improve the overall system throughput, making our proposed ADMMbased SOTMRAM PIM more energy efficiency and suitable for embedded systems or IoT devices.
In the following paper, we first illustrate how to connect ADMM to our model compression techniques in order to achieve deeply compressed models that are tailored to SOTMRAM PIM designs. Then we introduce the architecture and mechanism of SOTMRAM PIM and how to map our compressed model onto it. Finally, we evaluate our proposed framework on different networks. The experimental results show the accuracy and compression rate of our framework is consistently outperforming the baseline works. And the efficiency (area & power) and throughput of SOTMRAM PIM engine can be significantly improved.
Ii A Unified and Systematic Model Compression Framework for Efficient SOTMRAM Based PIM Engine Design
In this section, we propose a unified and systematic framework for DNN model compression, which simultaneously and efficiently achieves DNN weight pruning and quantization. By reforming the pruning and quantization problems into optimization problems, the proposed framework can solve the structured pruning and lowbit quantization problems iteratively and analytically by extending the powerful ADMM [17] algorithm.
In the meantime, our structured pruning model has a unique (i.e., tiny and regular) spatial property which naturally fits into the SOTMRAM based processinginmemory engine utilization, thereby successfully bridges the gap between the largescale DNN inference and the computing ability limited platforms. After mapping the compressed DNN model on SOTMRAM based processinginmemory engine, bitwise convolution can be executed with very high efficiency consistently.
Iia DNN Weight Pruning using ADMM
For an layer DNN of interest. The first layers are CONV layers and the rest are FC layers. The weights and biases of the th layer are respectively denoted by and
, and the loss function associated with the DNN is denoted by
; see [18]. In this paper, and respectively characterize the collection of weights and biases from layer to layer .In this paper, our objective is to implement structured pruning on the DNN. In the following discussion we focus on the CONV layers because they have the highest computation requirements. More specifically, we minimize the loss function subject to specific structured sparsity constraints on the weights in the CONV layers, i.e.,
(1)  
subject to 
where is the set of with desired “structure”. Next we introduce constraint sets corresponding to different types of structured sparsity to facilitate SOTMRAM PIM engine implementation.
The collection of weights in the
th CONV layer is a fourdimensional tensor, i.e.,
, where , and are respectively the number of filters, the number of channels in a filter, the height of the filter, and the width of the filter, in layer .Filterwise structured sparsity: When we train a DNN with sparsity at the filter level, the constraint on the weights in the th CONV layer is given by Here, nonzero filter means that the filter contains some nonzero weight.
Channelwise structured sparsity: When we train a DNN with sparsity at the channel level, the constraint on the weights in the th CONV layer is given by channels in Here, we call the th channel nonzero if contains some nonzero element.
Kernelwise structured sparsity: When we train a DNN with sparsity at the Kernel level, the constraint on the weights in the th CONV layer is given by
nonzero vectors in
To solve the problem, we propose a systematic framework of dynamic ADMM regularization and masked mapping and retraining steps. We can guarantee solution feasibility (satisfying all constraints) and provide high solution quality through this integration.
IiB Solution to the DNN Pruning Problem
Corresponding to every set , we define the indicator function
Furthermore, we incorporate auxiliary variables , . The original problem (1) is then equivalent to
(2)  
subject to 
By adopting augmented Lagrangian [19] on (2), the ADMM regularization decomposes problem (2) into two subproblems, and solves them iteratively until convergence.
The first subproblem is
(3) 
where .
The first term in the objective function of (3) is the differentiable loss function of the DNN, and the second term is a quadratic regularization term of the ’s, which is differentiable and convex. As a result (3) can be solved by standard SGD. Although we cannot guarantee the global optimality, it is due to the nonconvexity of the DNN loss function rather than the quadratic term enrolled by our method. Please note that this subproblem and solution are the same for all types of structured sparsities.
The Second subproblem is
(4) 
Note that is the indicator function of , thus this subproblem can be solved analytically and optimally [19]. For , the optimal solution is the Euclidean projection of onto . The set is different when we apply different types of structured sparsity, and the Euclidean projections will be described next.
Solving the second subproblem for different structured sparsities: For filterwise structured sparsity constraints, we first calculate
for , where denotes the Frobenius norm. We then keep elements in corresponding to the largest values in and set the rest to zero.
For channelwise structured sparsity, we first calculate
for . We then keep elements in corresponding to the largest values in and set the rest to zero.
For kernelwise structured sparsity, we first calculate
for , . We then keep elements in corresponding to the largest values in and set the rest to zero.
IiC SOTMRAM ProcessingInMemory Engine for DNN
The main purpose of SOTMRAM PIM engine is to convert and perform the MACs operations in convolutional layers using bitwise convolution format. There are four main steps included in bitwise convolutions: parallel AND operation, bitcount, bitshift and accumulation. They can be formulated as Eqn.(5). And and stands for the bitlength of inputs and weights respectively. The represents the bit of all inputs in , where the contains the bit of all weights in .
Consider a CONV layer with 22 kernel size, where contains 4 weights on a kernel and contains 4 current inputs covered by this kernel. We assume both weights and inputs are using 3bit representation. From Figure 1 we can see that inputs and weights are mapped to two different subarrays. During the computation, two rows are selected from two subarrays each time, and the parallel AND results can be obtained from sense amplifiers [20]. Each row in one subarray will conduct parallel AND operation with all the rows from the other subarray. And every AND results will go through bitcount and shifter unit, then accumulated with other results to get a bitwise convolution result [11].
(5) 
IiD Framework Mapping
In our proposed framework, based on the architecture of SOTMRAMbased PIM engine, we incorporate structured pruning and quantization using ADMM to effectively reduce the PIM engine’s area and power. Quantization can be integrated in the same ADMMbased framework with different constraints. We omit the details of ADMM quantization due to space limit. Please note that our ADMMbased framework can achieve weight pruning and quantization simultaneously or separately. Moreover, the overall throughput of SOTMRAM based DNN computing system can be improved as well. The SOTMRAMbased PIM engine contains several processing elements (PEs). And each PE consists a column decoder, a row decoder, one computing set and multiple SOTMRAM subarrays as shown in Figure 2(1). It also shows how we map the inputs and weights to the PIM engine. In each PE, the inputs will be mapped on one subarray, and every other subarray will accommodate the different filters’ weights from the same input channel. For example, the will compute the convolution of the inputs in and all the weights in from to . And the number of columns and rows in each subarray depends on the kernel size of the network and the bitlength of the inputs and weights respectively. All PEs are able to work parallelly and individually.
In Figure 2(b), examples are given to show the corresponding structured pruning types that are used in our proposed framework and how it facilitates the reduction of PIM engine size. For the filter pruning, all the subarrays on the same row (i.e., storing the weights from the same filter) can be removed. Thus, the number of subarrays that are contained by each PE will be reduced. For the channel prune, since the pruned channels are no longer needed, the number of required PEs can be reduced without decreasing the throughput. Since each subarray stores the weights from one channel of one filter, which is also considered as the weights from one convolution kernel, the kernel pruning will remove all the weights on a subarray. By applying filter pruning and channel pruning, all the pruned subarrays or PEs can be physically removed directly. However, removing the subarrays by kernel pruning may cause an uneven size between different PEs. But since all the subarrays have same size and a subarray is considered as a basic computing unit in bitwise convolutions, the control overheads for addressing uneven subarray numbers in PEs is ignorable. An alternative way is to use a lookuptable (LUT) to mark those pruned subarrays and skip them during computation instead of removing them physically.
Each pruning type has its own advantages. The filter pruning and channel pruning has propagation property. Because filter pruning (channel pruning) can not only remove the pruned weights but also removes the corresponding output channels (input channels) as well. By taking the advantage of that, the corresponding channels (filters) in next (previous) layer become redundant and can also be removed. The kernel pruning is especially tailored to the SOTMRAMbased DNN computing system. Compared to filter and channel pruning, kernel pruning provides higher pruning flexibility, which means it is easier to maintain network accuracy under the same pruning rate. And none of them will incur complicated control logic.
The ADMM based quantization is also used in our proposed framework. In each weight subarray, the number of rows equals to the bitlength that is used to represent the weights. The number of rows in each subarray can be evenly reduced by quantizing the weights to fewer bits.





MNIST (LeNet5)  

Group Scissor [21]  99.15%  99.14%  4.2  
Our’s  99.16%  99.12%  81.3  
CIFAR10  
RNT18 
DCP [22]  88.9%  87.6%  2.0  
AMC [23]  90.5%  90.2%  2.0  
Our’s  94.1%  93.2%  59.8  
VGG16 
Iterative Pruning [24, 25]  92.5%  92.2%  2.0  
2PFPCE [26]  92.9%  92.8%  4.0  
Our’s  93.7%  93.3%  44.8  
ImageNet  
AlexNet 
Deep compression [27]  57.2/82.2%  57.2/80.3%  2.7  
NeST [28]  57.2/82.2%  57.2/80.3%  4.2  
Our’s  57.4/82.4%  57.3/82.2%  5.2  
RNT18 
Network Slimming [29]  68.9/88.7%  67.2/87.4%  1.4  
DCP [22]  69.6/88.9%  69.2/88.8%  3.3  
Our’s  69.9/89.1%  69.1/88.4%  3.0  
RNT50 
Soft Filter Prune [30]  76.1/92.8%  74.6/92.1%  1.7  
ThiNet [31]  72.9/91.1%  68.4/88.3%  3.3  
Our’s  76.0/92.8%  75.5/92.3%  2.7 
Iii experimental results
In our experiment, our generated compressed models are based on four widely used network structures, LeNet5 [32], AlexNet [1], VGG16 [33] and ResNet18/50 [34]
, and are trained on an eight NVIDIA RTX2080Ti GPUs server using PyTorch
[35]. For hardware results, we choose 32nm CMOS technology for the peripheral circuits. Cacti 7 [36] is utilized to compute the energy and area of buffers and onchip interconnects. NVSim platform [37] with modified SOTMRAM configuration is used to model the SOTMRAM subarrays. The power and area results of ADC are taken from [38].Several groups of experiments are performed, and we only show one result under each dataset and network, which achieves highest compression rate with minor accuracy degradation. Our results are based on 8bit quantization, and we use combined pruning scheme (which means all three pruning types are used simultaneously). Table I shows our result on MNIST dataset using LeNet5 can achieve 81.3 compression without accuracy degradation, which is 19.4 higher than Group Scissor [21]. On CIFAR10 dataset, our compression rates achieve 59.8 and 44.8 on ResNet18 and VGG16 networks with minor accuracy degradation. And on ImageNet, our compression rates for AlexNet, ResNet18 and VGG16 is 5.2, 3.0 and 2.7, respectively.
By applying our framework, the power and area of SOTMRAM PIM engine can be significantly reduced and the overall system throughput can also be improved comparing to uncompressed design, as shown in Figure 3. From our observation, channel pruning usually contributes more power and area saving than filter and kernel pruning, since it can remove entire PE with its peripheral circuits. On the other hand, the filter and kernel pruning can reduce the computing iterations between subarrays, which can improve the overall throughput.
Iv conclusion
In this paper, we propose an ultra energy efficient framework by using model compression techniques including weight pruning and quantization at the algorithm level considering the architecture of SOTMRAM PIM. And we incorporate ADMM into the training phase to further guarantee the solution feasibility and satisfy SOTMRAM hardware constraints. The experimental results show the accuracy and pruning rate of our framework is consistently outperforming the baseline works. Consequently, the area and power consumption of SOTMRAM PIM can be significantly reduced, while the overall system throughput is also improved dramatically.
References

[1]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in NeurIPS, 2012, pp. 1097–1105.  [2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: Endtoend speech recognition in english and mandarin,” in ICML, 2016, pp. 173–182.

[3]
R. Collobert and J. Weston, “A unified architecture for natural language
processing: Deep neural networks with multitask learning,” in
Proceedings of the 25th international conference on Machine learning
. ACM, 2008, pp. 160–167.  [4] G. Yuan, X. Ma, C. Ding, S. Lin, T. Zhang, Z. S. Jalali, Y. Zhao, L. Jiang, S. Soundarajan, and Y. Wang, “An ultraefficient memristorbased dnn framework with structured weight pruning and quantization using admm,” in 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 2019, pp. 1–6.
 [5] G. Yuan, C. Ding, R. Cai, X. Ma, Z. Zhao, A. Ren, B. Yuan, and Y. Wang, “Memristor crossbarbased ultraefficient nextgeneration baseband processors,” in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, 2017, pp. 1121–1124.
 [6] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan et al., “C ir cnn: accelerating and compressing deep neural networks using blockcirculant weight matrices,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2017, pp. 395–408.
 [7] S. Umesh and S. Mittal, “A survey of spintronic architectures for processinginmemory and neural networks,” Journal of Systems Architecture, 2018.
 [8] X. Ma, Y. Zhang, G. Yuan, A. Ren, Z. Li, J. Han, J. Hu, and Y. Wang, “An area and energy efficient design of domainwall memorybased deep convolutional neural networks using stochastic computing,” in 2018 19th International Symposium on Quality Electronic Design (ISQED). IEEE, 2018, pp. 314–321.

[9]
Y. Wang, C. Ding, G. Yuan, S. Liao, Z. Li, X. Ma, B. Yuan, X. Qian, J. Tang, Q. Qiu, and X. Lin, “Towards ultrahigh performance and energy efficiency of deep learning systems: an algorithmhardware cooptimization framework,” in
AAAI Conference on Artificial Intelligence, (AAAI18)
. AAAI, 2018.  [10] S. Angizi, Z. He, F. Parveen, and D. Fan, “Imce: energyefficient bitwise inmemory convolution engine for deep neural network,” in Proceedings of the 23rd Asia and South Pacific Design Automation Conference. IEEE Press, 2018, pp. 111–116.
 [11] A. Roohi, S. Angizi, D. Fan, and R. F. DeMara, “Processinginmemory acceleration of convolutional neural networks for energyeffciency, and powerintermittency resilience,” in 20th ISQED. IEEE, 2019.
 [12] S. Lin, X. Ma, S. Ye, G. Yuan, K. Ma, and Y. Wang, “Toward extremely low bit and lossless accuracy in dnns with progressive admm,” arXiv preprint arXiv:1905.00789, 2019.
 [13] X. Ma, G. Yuan, S. Lin, Z. Li, H. Sun, and Y. Wang, “Resnet can be pruned 60x: Introducing network purification and unused path removal (prm) after weight pruning,” arXiv preprint arXiv:1905.00136, 2019.
 [14] C. Ding, A. Ren, G. Yuan, X. Ma, J. Li, N. Liu, B. Yuan, and Y. Wang, “Structured weight matricesbased hardware accelerators in deep neural networks: Fpgas and asics,” in Proceedings of the 2018 on Great Lakes Symposium on VLSI. ACM, 2018, pp. 353–358.
 [15] X. Ma, F.M. Guo, W. Niu, X. Lin, J. Tang, K. Ma, B. Ren, and Y. Wang, “Pconv: The missing but desirable sparsity in dnn weight pruning for realtime execution on mobile devices,” arXiv preprint arXiv:1909.05073, 2019.
 [16] N. Liu, X. Ma, Z. Xu, Y. Wang, J. Tang, and J. Ye, “Autoslim: An automatic dnn structured pruning framework for ultrahigh compression rates,” arXiv preprint arXiv:1907.03141, 2019.
 [17] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, 2011.
 [18] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang, “A systematic dnn weight pruning framework using alternating direction method of multipliers,” in Proceedings of ECCV, 2018, pp. 184–199.
 [19] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
 [20] D. Fan and S. Angizi, “Energy efficient inmemory binary deep neural network accelerator with dualmode SOTMRAM,” in Proceedings, ICCD 2017, 2017.
 [21] Y. Wang, W. Wen, B. Liu, D. Chiarulli, and H. Li, “Group scissor: Scaling neuromorphic computing design to large neural networks,” in DAC. IEEE, 2017.
 [22] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu, “Discriminationaware channel pruning for deep neural networks,” in Advances in NeurIPS, 2018, pp. 875–886.
 [23] Y. He, J. Lin, Z. Liu, H. Wang, L.J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in ECCV. Springer, 2018, pp. 815–832.
 [24] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in NeurIPS, 2015.
 [25] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” arXiv preprint arXiv:1810.05270, 2018.
 [26] C. Min, A. Wang, Y. Chen, W. Xu, and X. Chen, “2pfpce: Twophase filter pruning based on conditional entropy,” arXiv preprint arXiv:1809.02220, 2018.
 [27] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
 [28] X. Dai, H. Yin, and N. K. Jha, “Nest: a neural network synthesis tool based on a growandprune paradigm,” arXiv preprint arXiv:1711.02017, 2017.

[29]
Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient
convolutional networks through network slimming,” in
Proceedings of the IEEE International Conference on Computer Vision
, 2017, pp. 2736–2744.  [30] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for accelerating deep convolutional neural networks,” in International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 2234–2240.
 [31] J.H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5058–5066.
 [32] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on CVPR, 2016.
 [35] A. Paszke, S. Gross, S. Chintala, and G. Chanan, “Pytorch,” 2017.
 [36] R. Balasubramonian, A. B. Kahng, and et al., “Cacti 7: New tools for interconnect exploration in innovative offchip memories,” ACM TACO, 2017.
 [37] X. Dong, C. Xu, S. Member, Y. Xie, S. Member, and N. P. Jouppi, “Nvsim: A circuitlevel performance, energy, and area model for emerging nonvolatile memory,” IEEE TCAD, vol. 31, no. 7, 2012.
 [38] B. Murmann, “Adc performance survey 19972019,[online]. available: http://web.stanford.edu/ murmann/adcsurvey.html.”
Comments
There are no comments yet.