1 Introduction
Structured weight pruning [1, 2, 3] and weight quantization [4, 5, 6] techniques are developed to facilitate weight compression and computation acceleration to solve the high demand for parallel computation and storage resources [7, 8, 9]. Even though models are compressed, computation complexity still burden the overall performance of the stateofart CMOS hardware applications.
To mitigate the bottleneck caused by CMOSbased DNN architectures, the nextgeneration device/circuit technologies [10, 11]
triumph CMOS in their nonvolatility, high energy efficiency, inmemory computing capability and high scalability. Memristor crossbar device has shown its potential for bearing all these characteristic which makes it intrinsically suitable for large DNN hardware architecture design. A memristor crossbar device can perform matrixvector multiplication in the analog domain and the computation is in
time complexity [12, 13]. Motivated by the fact that there is no precedent model that is structured pruned and quantized as well as satisfying memristor hardware constraints, in this work, a memristorbased ADMM regularized optimization method is utilized both on structured pruning and weight quantization in order to mitigate the accuracy degradation during extreme model compression. A structured pruned model can potentially benefit for highparallelism implementation in crossbar architecture. Further more, quantized weights can reduce hardware imprecision during read/write procedure, and save more hardware footprint due to less peripheral circuits are needed to support fewer bits.However, to achieve ultrahigh compression ratio, an ADMM pruning method [3, 14] cannot fully exploit all redundancy in a neural network model. As a result, we design a hardwaresoftware cooptimization framework in which we investigate Network Purification and Unused Path Removal after the procedure of structured weight pruning with ADMM. Moreover, we utilize distilled knowledge from software model to guide our memristor hardware constraint quantization. To the best of our knowledge, we are the first to combine extreme structured weight pruning and weight quantization in an unified and systematic memristorbased framework. Also, we are the first to discover the redundant weights and unused path in a structured pruned DNN model and design a sophisticate cooptimization framework to boost higher model compression rate as well as maintain high network accuracy. By incorporating memristor hardware constraints in our model, our frameworks are guaranteed feasible for a real memristor crossbar device. The contributions of this paper include:

We adopt ADMM for efficiently optimizing the nonconvex problem and utilized this method on structured weight pruning.

We systematically investigate the weight quantization on a pruned model with memristor hardware constraints.

We design a softwarehardware cooptimization framework in which Network Purification and Unused Path Removal are first proposed.
We evaluate our proposed memristor framework on different networks. We conclude that structured pruning method with memristorbased ADMM regularized optimization achieves high compression ratio and desirable high accuracy. Software and hardware experimental results shows our memristor framework is very energy efficient and saves great amount of hardware footprint.
2 Related Works
Heuristic weight pruning methods [15] are widely used in neuromorphic computing designs to reduce the weight storage and computing delay [16]. [16] implemented weight pruning techniques on a neuromorphic computing system using irregular pruning caused unbalanced workload, greater circuits overheads and extra memory requirement on indices. To overcome the limitations, [17] proposed group connection deletion, which structually prunes connections to reduce routing congestion between memristor crossbar arrays.
Weight quantization can mitigate hardware imperfection of memristor including state drift and process variations, caused by the imperfect fabrication process or by the device feature itself [4, 5]. [18] presented a technique to reduce the overhead of DigitaltoAnalog Converters (DACs)/AnalogtoDigital Converters (ADCs) in resistive randomaccess memory (ReRAM) neuromorphic computing systems. They first normalized the data, and then quantized intermediary data to 1bit value. This can be directly used as the analog input for ReRAM crossbar and, hence, avoids the need of DACs.
3 Background on Memristors
3.1 Memristor Crossbar Model
Memristor [10] crossbar is an array structure consists of memristors, horizontal Wordlines and Vertical Bitlines, as shown in Figure 1. Due to its outstanding performance on computing matrixvector multiplications (MVM), memristor crossbars are widely used as dotproduct accelerator in recent neuromorphic computing designs [19]. By programming the conductance state (which is also known as “memristance”) of each memristor, the weight matrix can be mapped onto the memristor crossbar. Given the input voltage vector , the MVM output current vector can be obtained in time complexity of .
3.2 Challenges in Memristor Crossbars Implementation and Mitigation Techniques
Different from the softwarebased designs, hardware imperfection is one of the key issues that causes the hardware nonideal behaviors and needs to be considered in memristorbased designs. The hardware imperfection of memristor devices are mainly come from the imperfect fabrication process and the memristor features.
Process Variation. Process variation is one major hardware imperfection that caused by the fluctuations in fabrication process. It mainly comes from the lineedge roughness, oxide thickness fluctuations, and random dopant variations [20]. Inevitably, process variation plays an increasingly significant role as the process technology scales down to nanometer level. In a DNN hardware design, the nonideal behaviors caused by process variations may lead to an accuracy degradation.
State Drift. State drift is the phenomenon that the memristance would change after several reading opertions [21]. It is known that memristor is a thinfilm device constructed by a region highly doped with oxygen vacancies and an undoped region. By nature, applying an electric field across the memristor over a period of time, the oxygen vacancies would migrate to the direction along with the electric field, which leads to the (memristance) state drift. Consequently, an error will incur when the state of memristor drifts to another state level.
It has been proved that applying quantization on memristorbased designs can mitigate the undesired impacts caused by hardware imperfections [22].
4 A MemristorBased Highly Compressed DNN Framework
The memristor crossbar structure has shown its potential for neuromorphic computing system compared to the CMOS technologies[16]. Due to great amount of weights and computations that involved in networks, an efficient and highly performed framework is needed to conquer the memory storage and energy consumption problems. We propose an unified memristorbased framework including memristorbased ADMM regularized optimization and masked mapping.
4.1 Problem Formulation
ADMM[23] is an advanced optimization technique which decompose an original problem into subproblems that can be solved separately and iteratively. By adopting memristorbased ADMM regularized optimization, the framework can guarantee the solution feasibility (satisfying memristor hardware constraints) while provide high solution quality (no obvious accuracy degradation after pruning).
First, the memristorbased ADMM regularized optimization starts from a pretrained full size DNN model without compression. Consider an layer DNNs, sets of weights of the th (CONV or FC) layer are denoted by . And the loss function associated with the DNN is denoted by . The overall problem is defined by
(1)  
subject to 
Given the value of , the memristorbased constraint set and ={the weights in the th layer are mapped to the quantization values}, where is predefined hyper parameters. The general constraint can be extended in structured pruning such as filter pruning, channel pruning and column pruning, which facilitate highparallelism implementation in hardware.
Similarly, for weight quantization, elements in are the solutions of . Assume set of is the available memristor state value which is the elements in , where denotes the number of available quantization level in layer . Suppose indicates the th quantization level in layer , which gives
(2) 
where are the minimum and maximum memristance value of a specified memristor device.
4.2 Memristorbased ADMM regularized optimization step
Corresponding to every memristorbased constraint set of and , a indicator functions is utilized to incorporate and into objective functions, which are
for . Substituting into (1) and we get
(3)  
subject to 
We incorporate auxiliary variables and , dual variables and , and the augmented Lagrangian formation of problem (3) is
(4)  
We can see that the first term in problem (4) is original DNN loss function, and the second and third term are differentiable and convex. As a result, subproblem (4
) can be solved by stochastic gradient descent
[24] as the original DNN training.The standard ADMM algorithm [23] steps proceed by repeating, for , the following subproblems iterations:
(5)  
(6)  
(7) 
The optimal solution is the Euclidean projection (masked mapping) of and onto and . Namely, elements in solution that less than will be set to zero. In the meantime, those kept elements are quantized to the closest valid memristor state value.
4.3 MemristorBased Structured Weight Pruning
In order to accommodate highparallelism implementation in hardware, we use structured pruning method [1] instead of the irregular pruning method [15] to reduce the size of the weight matrix while avoid extra memory storage requirement for indices. Figure 2 shows different types of structured sparsity which include filterwise sparsity, channelwise sparsity and shapewise sparsity.
Figure 3 (a) shows the general matrix multiplication (GEMM) view of the DNN weight matrix and the different structured weight pruning methods. The structured pruning corresponds to removing rows (filterswise) or columns (shapewise) or the combination of them. We can see that after structured weight pruning, the remaining weight matrix is still regular and without extra indices.
Figure 3 (b) illustrate the memristor crossbar schematic size reduction from corresponding structured weight pruning and Figure 3 (c) shows physical view of the memristor crossbar blocks. A CONV layer has filters, channels which include total columns, and is denoted as . Due to the increasing reading/writing errors caused by expanding the memristor crossbar size, we limited our design by using multiple 12864 [25] crossbars for all DNN layers. In Figure 3 (c), denote columns and rows for each crossbar, represent inputs and is the column number which is also shown in Figure 3 (a). By easy calculation, one can derived that there’s different crossbars to store one filter’s weights as a block unit. So there’s total blocks to store . Within each block, the outputs of each crossbar will be propagated through an ADC. Then We columnwisely sum the intermediate results of all crossbars.
5 Softwarehardware Cooptimization
Due to the existence of the nonoptimality of ADMM process and the accuracy degradation problem of quantizing sparse DNN, a softwarehardware cooptimization framework is desired. In this section we propose: (i) network purification and unused path removal to efficiently remove redundant channels or filters, (ii) memristor model quantization by using distilled knowledge from software helper.
5.1 Network Purification and Unused Path Removal
Weight pruning with memristorbased ADMM regularized optimization can significantly reduce the number of weights while maintaining high accuracy. However, does the pruning process really remove all unnecessary weights?
From our analysis on the DNN data flow, we find that if a whole filter is pruned, after General Matrix Multiply (GEMM), the generated feature maps by this filter will be “blank”. If we map those “blank” feature input to next layer, the corresponding unused input channel weights become removable. By the same token, a pruned channel also causes the corresponding filter in previous layer removable. Figure 4 gives a clear illustration about the corresponding relationship between pruned filters/channels and correspond unused channels/filters.
To better optimize the unused path removal effect we discussed above, we derive an emptiness ratio parameter to define what can be treated as an empty channel. Suppose is the number of columns per channel in layer , and is channel index. We have
(8) 
If exceeds a predefined threshold, we can assume that this channel is empty and thus actually prune every column in it. However, if we remove all columns that satisfy , dramatic accuracy drop will occur and it will be hard to recover by retraining because some relatively “important” weights might be removed. To mitigate this problem, we design Network Purification algorithm dealing with the nonoptimality problem of the ADMM process. We setup an criterion constant to represent channel ’s importance score, which is derived from an accumulation procedure:
(9) 
One can think of this process as if collection evidence for whether each channel that contains one or several columns need to be removed. A channel can only be treated as empty when both equation (8) and (9) are satisfied. Network Purification also works on purifying remaining filters and thus remove more unused path in the network. Algorithm 1 shows our generalized method of the PRM method where are hyperparameter thresholds values.
Method  Original model Accuracy  Compression Rate Without PRM  Accuracy Without PRM  Prune Ratio With PRM  Accuracy With PRM  Weight Quantization Accuracy (8bit) 
MNIST  
Group Scissor [17]  99.15%  4.16  99.14%  N/A  N/A  N/A 
our LeNet5  99.17%  23.18  99.20%  39.23  99.20%  99.16% 
34.46  99.06%  *87.93  99.06%  99.04%  
45.54  98.48%  231.82  98.48%  98.05%  
*numbers of parameter reduced: 25.2K  
CIFAR10  
Group Scissor [17]  82.01%  2.35  82.09%  N/A  N/A  N/A 
our ConvNet  84.41%  2.35  84.55%  N/A  N/A  84.33% 
*2.93  84.53%  N/A  N/A  83.93%  
5.88  83.58%  N/A  N/A  83.01%  
our VGG16  93.70%  20.16  93.36%  44.67  93.36%  93.04% 
*50.02  92.73%  92.46%  
our ResNet18  94.14%  5.83  93.79%  52.07  93.79%  93.71% 
15.14  93.20%  *59.84  93.22%  93.27%  
*numbers of parameter reduced on ConvNet: 102.30K, VGG16: 14.42M, ResNet18: 10.97M  
ImageNet ILSVRC2012  
SSL [1] AlexNet  80.40%  1.40  80.40%  N/A  N/A  N/A 
our AlexNet  82.40%  4.69  81.76%  5.13  81.76%  80.45% 
our ResNet18  89.07%  3.02  88.41%  3.33  88.36%  88.47% 
our ResNet50  92.86%  2.00  92.26%  2.70  92.27%  92.20% 
numbers of parameter reduced on AlexNet: 1.66M, ResNet18: 7.81M, ResNet50: 14.77M 
5.2 Memristor Weight Quantization
Traditionally, DNN in software is composed by 32bit weights. But on a memristor device, the weights of a neural network are represented by the memristance of the memristor (i.e. the memristance range constraint in ADMM process). Due to the limited memristance range of the memristor devices, the weight values exceeding memristance range cannot be represented precisely. Meanwhile, the writeon value and the exact value mismatch when mapping weights on memristor crossbar will also cause the reading mismatch if the amount of the value shift exceeds state level range.
In order to mitigate the memristance range limitation and the mapping mismatch, larger range between state level () is needed which means fewer bits are representing weights. To better maintain accuracy, we use a pretrained highaccuracy teacher model to provide distillation loss to add on our memristor model (referred as student model) loss to provide better training performance.
(10) 
The in first term in (10) is the memristor model (student) loss, and in second term is distillation loss between student and teacher. and are outputs of student and teacher and is the groundtruth label. is a balancing parameter, and is the temperature parameter.
6 Experimental Results
In this section, we show the experimental results of our proposed memristorbased DNN framework in which structured weight pruning and quantization with memristorbased ADMM regularized optimization are included. Our softwarehardware cooptimization framework (i.e. Network Purification, Unused Path Removal
(PRM)) are also thoroughly compared. We test MNIST dataset on LeNet5 and CIFAR10 dataset using ConvNet (4 CONV layers and 1 FC layer), VGG16 and ResNet18, and we also show our ImageNet results on AlexNet, ResNet18 and ResNet50. The accuracy of pruned and quantized model results are tested based on our software models that incorporated with memristor hardware constraints. Models are trained on an eight NVIDIA GTX2080Ti GPUs server using PyTorch API. Our memristor model on MATLAB and the NVSim
[26] is used to calculate power consumption and area cost of the memristors and memristor crossbars. The 1R crossbar structure is used in our design. And we choose the memristor device that has and . The memristor precision is 4bit, which indicates that 16 statelevels can be represented by a single memristor device, and two memristors are combined to represent 8bit weight in our framework. For the peripheral circuits, the power and area is calculated based on 45nm technology. And Htree distribution networks are used to access all the memristor crossbars.As shown in Table 1, we show groups of different prune ratios and 8bits quantization with accuracies on each network structure. Figure 5 proves our previous arguments that ADMM’s nonoptimality exists in a structured pruned model. PRM can further optimize the loss function. Please note all of the results are based on nonretraining process. Below are some results highlights on different dataset with different network structures.
MNIST. With LeNet5 network, comparing to original accuracy (99.17%), our proposed PRM framework achieve 231.82 compression with minor accuracy loss while other stateofart compression ratios are lossless. And no accuracy losses are observed after quantization on 40 and 88 models and only 0.4% accuracy drop on 231.82 model. On the other hand, Group Scissor [17] only has 4.16 compression rate.
CIFAR10. Convnet structure are relative shallow so ADMM reaches a relative optimal local minimum, so postprocessing is not necessary. But we still outperform Group Scissor [17] in accuracy (84.55% to 82.09%) when compression rate is same (2.35). For larger networks, when a minor accuracy loss is allowed, our proposed PRM method improves the prune ratio to 50.02 and 59.84 on VGG16 and ResNet18 respectively, and no obvious accuracy loss after quantization on pruned models.
ImageNet. AlexNet model outperform SSL [1] both in compression rate (4.69 to 1.40) and network accuracy (81.76% to 80.40%), with or without PRM. Our ResNet18 and ResNet50 models also achieve unprecedented 3.33 with 88.36% accuracy and 2.70 with 92.27% respectively. No accuracy losses are observed after quantization on pruned ResNet18/50 models and around 1% accuracy loss on 5.13 compressed AlexNet model.
Table 2 shows our highlighted memristor crossbar power and area comparisons of ResNet18 and VGG16 models. By using our proposed PRM method, the area and power of the ResNet18 model is reduced from 0.235 (0.117) and 3.359 (1.622) to 0.042 (0.041) and 0.585 (0.556), without any accuracy loss. For VGG16 model, after using our PRM method, the area and power is reduced from 0.113 and 1.611 to 0.056 (0.053) and 0.824 (0.754), where the compression ratio is achieved 44.67 (50.02) with 0% (0.63%) accuracy degradation.
7 Conclusion
In this paper, we designed an unified memristorbased DNN framework which is tiny in overall hardware footprint and accurate in test performance. We incorporate ADMM in weight structured pruning and quantization to reduce model size in order to fit our designed tiny framework. We find the nonoptimality of the ADMM solution and design Network Purification and Unused Path Removal in our softwarehardware cooptimization framework, which achieve better results comparing to Gourp Scissor [17] and SSL [1]. On AlexNet, VGG16 and ResNet18/50, after structured weight pruning and 8bit quantization, model size, power and area are significant reduced with negligible accuracy loss.
References
 [1] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in NeurIPS, 2016, pp. 2074–2082.
 [2] X. Ma, G. Yuan, S. Lin, Z. Li, H. Sun, and Y. Wang, “Resnet can be pruned 60x: Introducing network purification and unused path removal (prm) after weight pruning,” arXiv preprint arXiv:1905.00136, 2019.
 [3] T. Zhang, K. Zhang, S. Ye, J. Li, J. Tang, W. Wen, X. Lin, M. Fardad, and Y. Wang, “Adamadmm: A unified, systematic framework of structured weight pruning for dnns,” arXiv preprint arXiv:1807.11091, 2018.
 [4] E. Park, J. Ahn, and S. Yoo, “Weightedentropybased quantization for deep neural networks,” in CVPR, 2017.

[5]
J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional neural networks for mobile devices,” in
CVPR, 2016.  [6] S. Lin, X. Ma, S. Ye, G. Yuan, K. Ma, and Y. Wang, “Toward extremely low bit and lossless accuracy in dnns with progressive admm,” arXiv preprint arXiv:1905.00789, 2019.
 [7] W. Niu, X. Ma, Y. Wang, and B. Ren, “26ms inference time for resnet50: Towards realtime execution of all dnns on smartphone,” arXiv preprint arXiv:1905.00571, 2019.

[8]
H. Li, N. Liu, X. Ma, S. Lin, S. Ye, T. Zhang, X. Lin, W. Xu, and Y. Wang, “Admmbased weight pruning for realtime deep learning acceleration on mobile devices,” in
Proceedings of the 2019 on Great Lakes Symposium on VLSI, 2019.  [9] C. Ding, A. Ren, G. Yuan, X. Ma, J. Li, N. Liu, B. Yuan, and Y. Wang, “Structured weight matricesbased hardware accelerators in deep neural networks: Fpgas and asics,” in Proceedings of the 2018 on Great Lakes Symposium on VLSI, 2018.
 [10] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The missing memristor found,” nature, vol. 453, no. 7191, p. 80, 2008.
 [11] X. Ma, Y. Zhang, G. Yuan, A. Ren, Z. Li, J. Han, J. Hu, and Y. Wang, “An area and energy efficient design of domainwall memorybased deep convolutional neural networks using stochastic computing,” in ISQED. IEEE, 2018.
 [12] L. Chua, “Memristorthe missing circuit element,” IEEE Transactions on circuit theory, vol. 18, no. 5, pp. 507–519, 1971.
 [13] G. Yuan, C. Ding, R. Cai, X. Ma, Z. Zhao, A. Ren, B. Yuan, and Y. Wang, “Memristor crossbarbased ultraefficient nextgeneration baseband processors,” in MWSCAS, 2017.
 [14] S. Ye, X. Feng, T. Zhang, X. Ma, S. Lin, Z. Li, K. Xu, W. Wen, S. Liu, J. Tang et al., “Progressive dnn compression: A key to achieve ultrahigh weight pruning and quantization rates using admm,” arXiv preprint arXiv:1903.09769, 2019.
 [15] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in NeurIPS, 2015.
 [16] A. Ankit, A. Sengupta, and K. Roy, “Trannsformer: Neural network transformation for memristive crossbar based neuromorphic system design,” in Proceedings of ICCD, 2017.
 [17] Y. Wang, W. Wen, B. Liu, D. Chiarulli, and H. Li, “Group scissor: Scaling neuromorphic computing design to large neural networks,” in DAC. IEEE, 2017.
 [18] L. Xia, T. Tang, W. Huangfu, M. Cheng, X. Yin, B. Li, Y. Wang, and H. Yang, “Switched by input: power efficient structure for rrambased convolutional neural network,” in DAC. ACM, 2016, p. 125.
 [19] A. Shafiee, A. Nag, N. Muralimanohar, and et.al, “ISAAC: A Convolutional Neural Network Accelerator with InSitu Analog Arithmetic in Crossbars,” in ISCA 2016.
 [20] S. Kaya, A. R. Brown, A. Asenov, D. Magot, e. D. LintonI, T.”, and C. Tsamis, “Analysis of statistical fluctuations due to line edge roughness in sub0.1m mosfets,” 2001.
 [21] J. J. Yang, M. D. Pickett, X. Li, D. A. Ohlberg, D. R. Stewart, and R. S. Williams, “Memristive switching mechanism for metal/oxide/metal nanodevices,” Nature Nanotechnology, 2008.
 [22] C. Song, B. Liu, W. Wen, H. Li, and Y. Chen, “A quantizationaware regularized learning method in multilevel memristorbased neuromorphic computing system,” in 2017 NVMSA. IEEE, 2017.

[23]
S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al.,
“Distributed optimization and statistical learning via the alternating
direction method of multipliers,”
Foundations and Trends® in Machine learning
, 2011.  [24] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [25] M. Hu, C. E. Graves, C. Li, and e. Li, Yunning, “MemristorBased Analog Computation and Neural Network Classification with a Dot Product Engine,” Advanced Materials, 2018.
 [26] X. Dong, C. Xu, S. Member, Y. Xie, S. Member, and N. P. Jouppi, “Nvsim: A circuitlevel performance, energy, and area model for emerging nonvolatile memory,” IEEE TRANSACTIONS ON COMPUTERAIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS.
Comments
There are no comments yet.