To address these challenges, spin orbit torque magnetic random-access memory (SOT-MRAM) has been used to reduce the power consumption of DNNs since it has the characteristic of near-zero standby power, high density, and none-volatile . Combined with the in-memory computing technique [8, 9], the SOT-MRAM based Processing-in-memory (PIM) engine could perform arithmetical and logic computations in parallel. Therefore, the most intensive operation, matrix multiplication and accumulation (MAC) in both convolutional layers (CONV) and fully-connected layers (FC), can be implemented using bit-wise parallel AND, bit-count, bit-shift, etc.
, which decrease its popularity as a favorable energy efficient DNN accelerator. To mitigate these drawbacks, from the software level, weight quantization has been introduced to SOT-MRAM PIM. By using binarized weight representation, efficiently processed data within SOT-MRAM to greatly reduce power-hungry and omit long distance data communication. However, weight binarization will cause accuracy degradation, especially for large-scale DNNs. In reality, many scenarios require as high accuracy as possible, e.g., self-driving cars. Thus, binarized weight representation is not favorable.
In this work, we propose an ultra energy efficient framework by using model compression techniques [12, 13, 14, 15, 16] including weight pruning and quantization from the software level considering the architecture of SOT-MRAM PIM. To further guarantee the solution feasibility and satisfy SOT-MRAM hardware constraints while providing high solution quality (no obvious accuracy degradation after model compression and after hardware mapping), we incorporate the alternating direction method of multipliers (ADMM) into the training phase. As a result, we can reduce the footprint and power consumption of SOT-MRAM PIM, and improve the overall system throughput, making our proposed ADMM-based SOT-MRAM PIM more energy efficiency and suitable for embedded systems or IoT devices.
In the following paper, we first illustrate how to connect ADMM to our model compression techniques in order to achieve deeply compressed models that are tailored to SOT-MRAM PIM designs. Then we introduce the architecture and mechanism of SOT-MRAM PIM and how to map our compressed model onto it. Finally, we evaluate our proposed framework on different networks. The experimental results show the accuracy and compression rate of our framework is consistently outperforming the baseline works. And the efficiency (area & power) and throughput of SOT-MRAM PIM engine can be significantly improved.
Ii A Unified and Systematic Model Compression Framework for Efficient SOT-MRAM Based PIM Engine Design
In this section, we propose a unified and systematic framework for DNN model compression, which simultaneously and efficiently achieves DNN weight pruning and quantization. By re-forming the pruning and quantization problems into optimization problems, the proposed framework can solve the structured pruning and low-bit quantization problems iteratively and analytically by extending the powerful ADMM  algorithm.
In the meantime, our structured pruning model has a unique (i.e., tiny and regular) spatial property which naturally fits into the SOT-MRAM based processing-in-memory engine utilization, thereby successfully bridges the gap between the large-scale DNN inference and the computing ability limited platforms. After mapping the compressed DNN model on SOT-MRAM based processing-in-memory engine, bit-wise convolution can be executed with very high efficiency consistently.
Ii-a DNN Weight Pruning using ADMM
For an -layer DNN of interest. The first layers are CONV layers and the rest are FC layers. The weights and biases of the -th layer are respectively denoted by and
, and the loss function associated with the DNN is denoted by; see . In this paper, and respectively characterize the collection of weights and biases from layer to layer .
In this paper, our objective is to implement structured pruning on the DNN. In the following discussion we focus on the CONV layers because they have the highest computation requirements. More specifically, we minimize the loss function subject to specific structured sparsity constraints on the weights in the CONV layers, i.e.,
where is the set of with desired “structure”. Next we introduce constraint sets corresponding to different types of structured sparsity to facilitate SOT-MRAM PIM engine implementation.
The collection of weights in the
-th CONV layer is a four-dimensional tensor, i.e.,, where , and are respectively the number of filters, the number of channels in a filter, the height of the filter, and the width of the filter, in layer .
Filter-wise structured sparsity: When we train a DNN with sparsity at the filter level, the constraint on the weights in the -th CONV layer is given by Here, nonzero filter means that the filter contains some nonzero weight.
Channel-wise structured sparsity: When we train a DNN with sparsity at the channel level, the constraint on the weights in the -th CONV layer is given by channels in Here, we call the -th channel nonzero if contains some nonzero element.
Kernel-wise structured sparsity:
When we train a DNN with sparsity at the Kernel level, the constraint on the weights in the -th CONV layer is given by
nonzero vectors in
nonzero vectors in
To solve the problem, we propose a systematic framework of dynamic ADMM regularization and masked mapping and retraining steps. We can guarantee solution feasibility (satisfying all constraints) and provide high solution quality through this integration.
Ii-B Solution to the DNN Pruning Problem
Corresponding to every set , we define the indicator function
Furthermore, we incorporate auxiliary variables , . The original problem (1) is then equivalent to
The first term in the objective function of (3) is the differentiable loss function of the DNN, and the second term is a quadratic regularization term of the ’s, which is differentiable and convex. As a result (3) can be solved by standard SGD. Although we cannot guarantee the global optimality, it is due to the non-convexity of the DNN loss function rather than the quadratic term enrolled by our method. Please note that this subproblem and solution are the same for all types of structured sparsities.
The Second subproblem is
Note that is the indicator function of , thus this subproblem can be solved analytically and optimally . For , the optimal solution is the Euclidean projection of onto . The set is different when we apply different types of structured sparsity, and the Euclidean projections will be described next.
Solving the second subproblem for different structured sparsities: For filter-wise structured sparsity constraints, we first calculate
for , where denotes the Frobenius norm. We then keep elements in corresponding to the largest values in and set the rest to zero.
For channel-wise structured sparsity, we first calculate
for . We then keep elements in corresponding to the largest values in and set the rest to zero.
For kernel-wise structured sparsity, we first calculate
for , . We then keep elements in corresponding to the largest values in and set the rest to zero.
Ii-C SOT-MRAM Processing-In-Memory Engine for DNN
The main purpose of SOT-MRAM PIM engine is to convert and perform the MACs operations in convolutional layers using bit-wise convolution format. There are four main steps included in bit-wise convolutions: parallel AND operation, bitcount, bitshift and accumulation. They can be formulated as Eqn.(5). And and stands for the bit-length of inputs and weights respectively. The represents the -bit of all inputs in , where the contains the -bit of all weights in .
Consider a CONV layer with 22 kernel size, where contains 4 weights on a kernel and contains 4 current inputs covered by this kernel. We assume both weights and inputs are using 3-bit representation. From Figure 1 we can see that inputs and weights are mapped to two different sub-arrays. During the computation, two rows are selected from two sub-arrays each time, and the parallel AND results can be obtained from sense amplifiers . Each row in one sub-array will conduct parallel AND operation with all the rows from the other sub-array. And every AND results will go through bit-count and shifter unit, then accumulated with other results to get a bit-wise convolution result .
Ii-D Framework Mapping
In our proposed framework, based on the architecture of SOT-MRAM-based PIM engine, we incorporate structured pruning and quantization using ADMM to effectively reduce the PIM engine’s area and power. Quantization can be integrated in the same ADMM-based framework with different constraints. We omit the details of ADMM quantization due to space limit. Please note that our ADMM-based framework can achieve weight pruning and quantization simultaneously or separately. Moreover, the overall throughput of SOT-MRAM based DNN computing system can be improved as well. The SOT-MRAM-based PIM engine contains several processing elements (PEs). And each PE consists a column decoder, a row decoder, one computing set and multiple SOT-MRAM sub-arrays as shown in Figure 2(1). It also shows how we map the inputs and weights to the PIM engine. In each PE, the inputs will be mapped on one sub-array, and every other sub-array will accommodate the different filters’ weights from the same input channel. For example, the will compute the convolution of the inputs in and all the weights in from to . And the number of columns and rows in each sub-array depends on the kernel size of the network and the bit-length of the inputs and weights respectively. All PEs are able to work parallelly and individually.
In Figure 2(b), examples are given to show the corresponding structured pruning types that are used in our proposed framework and how it facilitates the reduction of PIM engine size. For the filter pruning, all the sub-arrays on the same row (i.e., storing the weights from the same filter) can be removed. Thus, the number of sub-arrays that are contained by each PE will be reduced. For the channel prune, since the pruned channels are no longer needed, the number of required PEs can be reduced without decreasing the throughput. Since each sub-array stores the weights from one channel of one filter, which is also considered as the weights from one convolution kernel, the kernel pruning will remove all the weights on a sub-array. By applying filter pruning and channel pruning, all the pruned sub-arrays or PEs can be physically removed directly. However, removing the sub-arrays by kernel pruning may cause an uneven size between different PEs. But since all the sub-arrays have same size and a sub-array is considered as a basic computing unit in bit-wise convolutions, the control overheads for addressing uneven sub-array numbers in PEs is ignorable. An alternative way is to use a look-up-table (LUT) to mark those pruned sub-arrays and skip them during computation instead of removing them physically.
Each pruning type has its own advantages. The filter pruning and channel pruning has propagation property. Because filter pruning (channel pruning) can not only remove the pruned weights but also removes the corresponding output channels (input channels) as well. By taking the advantage of that, the corresponding channels (filters) in next (previous) layer become redundant and can also be removed. The kernel pruning is especially tailored to the SOT-MRAM-based DNN computing system. Compared to filter and channel pruning, kernel pruning provides higher pruning flexibility, which means it is easier to maintain network accuracy under the same pruning rate. And none of them will incur complicated control logic.
The ADMM based quantization is also used in our proposed framework. In each weight sub-array, the number of rows equals to the bit-length that is used to represent the weights. The number of rows in each sub-array can be evenly reduced by quantizing the weights to fewer bits.
||Group Scissor ||99.15%||99.14%||4.2|
|Iterative Pruning [24, 25]||92.5%||92.2%||2.0|
|Deep compression ||57.2/82.2%||57.2/80.3%||2.7|
|Network Slimming ||68.9/88.7%||67.2/87.4%||1.4|
|Soft Filter Prune ||76.1/92.8%||74.6/92.1%||1.7|
Iii experimental results
, and are trained on an eight NVIDIA RTX-2080Ti GPUs server using PyTorch. For hardware results, we choose 32nm CMOS technology for the peripheral circuits. Cacti 7  is utilized to compute the energy and area of buffers and on-chip interconnects. NVSim platform  with modified SOT-MRAM configuration is used to model the SOT-MRAM sub-arrays. The power and area results of ADC are taken from .
Several groups of experiments are performed, and we only show one result under each dataset and network, which achieves highest compression rate with minor accuracy degradation. Our results are based on 8-bit quantization, and we use combined pruning scheme (which means all three pruning types are used simultaneously). Table I shows our result on MNIST dataset using LeNet-5 can achieve 81.3 compression without accuracy degradation, which is 19.4 higher than Group Scissor . On CIFAR-10 dataset, our compression rates achieve 59.8 and 44.8 on ResNet-18 and VGG-16 networks with minor accuracy degradation. And on ImageNet, our compression rates for AlexNet, ResNet-18 and VGG-16 is 5.2, 3.0 and 2.7, respectively.
By applying our framework, the power and area of SOT-MRAM PIM engine can be significantly reduced and the overall system throughput can also be improved comparing to uncompressed design, as shown in Figure 3. From our observation, channel pruning usually contributes more power and area saving than filter and kernel pruning, since it can remove entire PE with its peripheral circuits. On the other hand, the filter and kernel pruning can reduce the computing iterations between sub-arrays, which can improve the overall throughput.
In this paper, we propose an ultra energy efficient framework by using model compression techniques including weight pruning and quantization at the algorithm level considering the architecture of SOT-MRAM PIM. And we incorporate ADMM into the training phase to further guarantee the solution feasibility and satisfy SOT-MRAM hardware constraints. The experimental results show the accuracy and pruning rate of our framework is consistently outperforming the baseline works. Consequently, the area and power consumption of SOT-MRAM PIM can be significantly reduced, while the overall system throughput is also improved dramatically.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in NeurIPS, 2012, pp. 1097–1105.
-  D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in ICML, 2016, pp. 173–182.
R. Collobert and J. Weston, “A unified architecture for natural language
processing: Deep neural networks with multitask learning,” in
Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 160–167.
-  G. Yuan, X. Ma, C. Ding, S. Lin, T. Zhang, Z. S. Jalali, Y. Zhao, L. Jiang, S. Soundarajan, and Y. Wang, “An ultra-efficient memristor-based dnn framework with structured weight pruning and quantization using admm,” in 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 2019, pp. 1–6.
-  G. Yuan, C. Ding, R. Cai, X. Ma, Z. Zhao, A. Ren, B. Yuan, and Y. Wang, “Memristor crossbar-based ultra-efficient next-generation baseband processors,” in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, 2017, pp. 1121–1124.
-  C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan et al., “C ir cnn: accelerating and compressing deep neural networks using block-circulant weight matrices,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2017, pp. 395–408.
-  S. Umesh and S. Mittal, “A survey of spintronic architectures for processing-in-memory and neural networks,” Journal of Systems Architecture, 2018.
-  X. Ma, Y. Zhang, G. Yuan, A. Ren, Z. Li, J. Han, J. Hu, and Y. Wang, “An area and energy efficient design of domain-wall memory-based deep convolutional neural networks using stochastic computing,” in 2018 19th International Symposium on Quality Electronic Design (ISQED). IEEE, 2018, pp. 314–321.
Y. Wang, C. Ding, G. Yuan, S. Liao, Z. Li, X. Ma, B. Yuan, X. Qian, J. Tang, Q. Qiu, and X. Lin, “Towards ultra-high performance and energy efficiency of deep learning systems: an algorithm-hardware co-optimization framework,” in
AAAI Conference on Artificial Intelligence, (AAAI-18). AAAI, 2018.
-  S. Angizi, Z. He, F. Parveen, and D. Fan, “Imce: energy-efficient bit-wise in-memory convolution engine for deep neural network,” in Proceedings of the 23rd Asia and South Pacific Design Automation Conference. IEEE Press, 2018, pp. 111–116.
-  A. Roohi, S. Angizi, D. Fan, and R. F. DeMara, “Processing-in-memory acceleration of convolutional neural networks for energy-effciency, and power-intermittency resilience,” in 20th ISQED. IEEE, 2019.
-  S. Lin, X. Ma, S. Ye, G. Yuan, K. Ma, and Y. Wang, “Toward extremely low bit and lossless accuracy in dnns with progressive admm,” arXiv preprint arXiv:1905.00789, 2019.
-  X. Ma, G. Yuan, S. Lin, Z. Li, H. Sun, and Y. Wang, “Resnet can be pruned 60x: Introducing network purification and unused path removal (p-rm) after weight pruning,” arXiv preprint arXiv:1905.00136, 2019.
-  C. Ding, A. Ren, G. Yuan, X. Ma, J. Li, N. Liu, B. Yuan, and Y. Wang, “Structured weight matrices-based hardware accelerators in deep neural networks: Fpgas and asics,” in Proceedings of the 2018 on Great Lakes Symposium on VLSI. ACM, 2018, pp. 353–358.
-  X. Ma, F.-M. Guo, W. Niu, X. Lin, J. Tang, K. Ma, B. Ren, and Y. Wang, “Pconv: The missing but desirable sparsity in dnn weight pruning for real-time execution on mobile devices,” arXiv preprint arXiv:1909.05073, 2019.
-  N. Liu, X. Ma, Z. Xu, Y. Wang, J. Tang, and J. Ye, “Autoslim: An automatic dnn structured pruning framework for ultra-high compression rates,” arXiv preprint arXiv:1907.03141, 2019.
-  S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, 2011.
-  T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang, “A systematic dnn weight pruning framework using alternating direction method of multipliers,” in Proceedings of ECCV, 2018, pp. 184–199.
-  S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
-  D. Fan and S. Angizi, “Energy efficient in-memory binary deep neural network accelerator with dual-mode SOT-MRAM,” in Proceedings, ICCD 2017, 2017.
-  Y. Wang, W. Wen, B. Liu, D. Chiarulli, and H. Li, “Group scissor: Scaling neuromorphic computing design to large neural networks,” in DAC. IEEE, 2017.
-  Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu, “Discrimination-aware channel pruning for deep neural networks,” in Advances in NeurIPS, 2018, pp. 875–886.
-  Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in ECCV. Springer, 2018, pp. 815–832.
-  S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in NeurIPS, 2015.
-  Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” arXiv preprint arXiv:1810.05270, 2018.
-  C. Min, A. Wang, Y. Chen, W. Xu, and X. Chen, “2pfpce: Two-phase filter pruning based on conditional entropy,” arXiv preprint arXiv:1809.02220, 2018.
-  S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
-  X. Dai, H. Yin, and N. K. Jha, “Nest: a neural network synthesis tool based on a grow-and-prune paradigm,” arXiv preprint arXiv:1711.02017, 2017.
Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient
convolutional networks through network slimming,” in
Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2736–2744.
-  Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for accelerating deep convolutional neural networks,” in International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 2234–2240.
-  J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5058–5066.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on CVPR, 2016.
-  A. Paszke, S. Gross, S. Chintala, and G. Chanan, “Pytorch,” 2017.
-  R. Balasubramonian, A. B. Kahng, and et al., “Cacti 7: New tools for interconnect exploration in innovative off-chip memories,” ACM TACO, 2017.
-  X. Dong, C. Xu, S. Member, Y. Xie, S. Member, and N. P. Jouppi, “Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory,” IEEE TCAD, vol. 31, no. 7, 2012.
-  B. Murmann, “Adc performance survey 1997-2019,[online]. available: http://web.stanford.edu/ murmann/adcsurvey.html.”