1 Introduction
In order to solve the high demand for computation and storage resources of a DNN application, weight pruning (Han et al., 2015) (Wen et al., 2016) are developed to facilitate weight compression and computation acceleration. In this work, a structured pruning technique is utilized to compress the DNN models which reduces weight storage and computation, and the structured weight matrix storage also has potential advantages for highparallelism implementation in hardware by eliminating the required weight indices compared with irregular pruning. (Ding et al., 2018) (Zhang et al., 2018)
However, the accuracy loss problem in structured pruning is inevitable. By adopting ADMM (Boyd et al., 2011)
, the original weight pruning problem is decomposed into two subproblems, one solved using stochastic gradient descent as original DNN training, while the other solved optimally and analytically via Euclidean projection
(Zhang et al., 2018) (Ye et al., 2019). ADMM method achieves one of the stateofart structured weight pruning results, 40 weight reduction on LeNet5 (LeCun et al., 1998) with MNIST (LeCun et al., 2015), 20 on VGG16 (Simonyan & Zisserman, 2014) with CIFAR10 (Krizhevsky & Hinton, 2009) and 4.7 on AlexNet (Krizhevsky et al., 2012)with ImageNet
(Deng et al., 2009) without postprocessing optimization.During postprocessing procedure, we find that after model retraining, some weights become less contributing to the network performance. This phenomenon is caused by the shortcoming that ADMM technique lacks the guarantee on solution feasibility (nonoptimality) due to the nonconvex nature of objective function (loss function). We propose a novel algorithm to detect and remove the redundant weights which slip away from ADMM pruning. Also, we are the first to discover the unused path in a structured pruned DNN model and design a sophisticate optimization framework to further boost compression rate as well as maintain high network accuracy. The contributions of this paper include:

We adopt ADMM for efficiently optimizing the nonconvex problem and successfully utilized this method on structured weight pruning.

We design a novel Network Purification and Unused Path Removal (PRM) algorithm focused on postprocessing an ADMM structured pruned model to boost compression rate while maintain accuracy.
2 ADMM model compression
Consider an layer DNNs, sets of weights of the th (CONV or FC) layer are denoted by , respectively. And the loss function associated with the DNN is denoted by . In this paper, characterize the set of weights from layer to layer . The overall problem is defined by
(1)  
subject to 
Given the value of , the constraint set is denoted by , where “card” refers to cardinality and “supp” refers to the support set. Elements in are the solution of satisfying the number of nonzero elements in is limited by for layer . The general constraint can be extended in structured pruning such as filter pruning, channel pruning and column pruning.
The standard ADMM regularized optimization steps are shown as follow, consider a indicator function is utilized to incorporate into objective functions, which is
(2) 
Then original problem (1) can be equivalently rewritten as
(3)  
subject to 
Auxiliary variables and dual variables are imported. ADMM decompose problem (3) into simpler subproblems and solve subproblems iteratively until convergence. The augmented Lagrangian formation of problem (3) is
(4) 
The first term in problem (4) is the differentiable loss function of the DNN, and the second term is a quadratic regularization term of the , which is differentiable and convex, and denotes Frobenius norm. As a result, subproblem (4) can be solved by stochastic gradient descent algorithm (Kingma & Ba, 2014) as the original DNN training.
The standard ADMM algorithm (Boyd et al., 2011) steps proceed by repeating, for , the following subproblems iterations:
(5) 
(6) 
(7) 
which (5) is the proximal step, (6) is projection step and (7) is dual variables update. However, due to the nonconvexity of the DNN loss function rather than the quadratic term in our method, the global optimality cannot be guaranteed.
Figure 1 illustrate a combined structured pruning techniques in General Matrix Multiply (GEMM) view. We adopt filter pruning and column pruning together to reduce matrix dimension. As a result shows in Figure 1 (c), the weight matrix size is reduced drastically compared with the original one, in the meantime, the shape of the weight matrix is still regular.
3 Network Purification and Unused Path Removal (PRM)
ADMM weight pruning can significantly reduce weights while maintaining high accuracy. However, does the pruning process really remove all unnecessary weights?
From our observation and analysis on the data flow through a network, we find that if a whole filter is pruned, then after GEMM, the generated feature maps by this filter will be all “blank”. If those “blank” feature maps input to next layer, then no matter what values are in the corresponding channel for those feature maps, the GEMM result will be zero. As a result, that channel will become unused channel which can be removed. By the same token, if a channel is pruned, then no matter what values are in the previous layer’s corresponding filter, the GEMM result of the generated feature maps by this channel will be all zeros, in which case make that corresponding filter an unused one. Figure 2 gives a clear illustration about the corresponding relationship between the ADMM pruned filters/columns and the correspond unused channels/filters.
We further improved the empty channels caused unused filters method by creating a more generalized criterion that define what is “emptiness” of a channel. Suppose is the number of columns per channel in layer , and is the emptiness ratio. We have
(8) 
If exceed a predefined threshold, we can assume that this channel is empty. But this indiscriminate criterion has its limitation. The reason is that after pruning, the remaining columns are remained for a reason which is they are relatively “important” to the whole network. If we remove all columns that satisfy , disastrous accuracy drop will occur and hard to recover by retraining.
In order to make our previous assumption work, we design a unified algorithm called “Network Purification” which is targeted on dealing with the nonoptimality problem of the ADMM process. By solving the problem, the above assumption can be validated simultaneously. We add a criterion constraint to compare the importance of the remaining columns channelwisely and to help decide which columns can be sacrifice and which can not. We setup an criterion constant to represent channel ’s importance score, which is derived from an accumulation procedure:
(9) 
One can think of this process as if collection evidence for whether each channel that contains one or several columns need to be removed. Network Purification also works on purifying remaining filters and thus remove more unused path in the network. The effect of the combinatorially using Network Purification and Unused Path Removal (PRM) is network will achieve extremely high compression rate without having any accuracy drop. Algorithm 1 shows our generalized method of the PRM method where are hyperparameter thresholds values.
4 Experimental Results
Figure 3 proves that ADMM’s nonoptimality exists in a structured pruned model. By purifying the redundant weights, we can further optimize the loss function. All of the results are based on nonretraining Network Purification process. The purification along with removal of unused path (PRM) process has great compression boost effect when the network is deep enough.
Structured Weight Pruning Statistics  
Method 






MNIST  
SSL  26.10  99.00%  N/A  N/A  

99.17%  23/18  99.20%  39.23  99.20%  
34.46  99.06%  *87.93  99.06%  
45.54  98.48%  231.82  98.48%  
*numbers of parameter reduced: 25.2K  
CIFAR10  
2PFPCE  92.98%  4.00  92.76%  N/A  N/A  

93.70%  20.16  93.36%  44.67  93.36%  
*50.02  92.73%  
AMC  93.53%  1.70  93.55%  N/A  N/A  

94.14%  5.83  93.79%  52.07  93.79%  
15.14  93.20%  *60.11  93.22%  
*numbers of parameter reduced on:  
VGG16: 14.42M, ResNet18: 10.97M  
ImageNet ILSVRC2012  
SSL AlexNet  80.40%  1.40  80.40%  N/A  N/A  
our AlexNet  82.40%  4.69  81.76%  5.13  81.76%  
our ResNet18  89.07%  3.02  88.41%  3.33  88.47%  
our ResNet50  92.86%  2.00  92.26%  2.70  92.27%  
numbers of parameter reduced on:  
AlexNet: 1.66M, ResNet18: 7.81M, ResNet50: 14.77M  
Table1 shows our experimental results of network pruning on Lenet5, VGG16, AlexNet and ResNet18/50. The accuracy and pruning ratio results of SSL (Wen et al., 2016) method is compared with our structured pruned Lenet5 and AlexNet model, and 2PFPCE (Min et al., 2018) and AMC (He et al., 2018) (ResNet50) methods are compared with our VGG16 and ResNet18 on CIFAR10 results. By using ADMM structured prune, Network Purification and Unused Path Removal (PRM), LeNet5 achieve 39 compression rate without accuracy drop, 88 compression with negligible accuracy drop and 232 with only 0.7% accuracy drop. On CIFAR10 dataset, our VGG16 compressed model achieves 44 compression without accuracy degradation and 50 with 1% accuracy drop and our ResNet18 achieve 52 compression without noticeable accuracy loss and 60 compression with 0.9% accuracy loss.
On ImageNet dataset, we increase AlexNet compression rate from 4.69 to 5.13, ResNet18 from 3.02 to 3.33 and ResNet50 from 2 to 2.7. All of our compression rate boost doesn’t cause noticeable accuracy degradation.
5 Conclusion
In this paper, we provide an ADMM regularized method to achieve highly compressed DNN models with combination of different weight pruning structures, and maintain the network accuracy in a high level. We further investigate the postprocess of ADMM pruning to solve the nonoptimal solution caused by nonconvex DNN loss function. We proposed Network Purification and Unused Path Removal that increase our model compression rate significantly.
References
 Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 2011.
 Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and FeiFei, L. Imagenet: A largescale hierarchical image database. 2009.

Ding et al. (2018)
Ding, C., Ren, A., Yuan, G., Ma, X., Li, J., Liu, N., Yuan, B., and Wang, Y.
Structured weight matricesbased hardware accelerators in deep neural networks: Fpgas and asics.
In Proceedings of the 2018 on Great Lakes Symposium on VLSI, pp. 353–358. ACM, 2018.  Han et al. (2015) Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In NeurIPS, 2015.

He et al. (2018)
He, Y., Lin, J., Liu, Z., Wang, H., Li, L.J., and Han, S.
Amc: Automl for model compression and acceleration on mobile devices.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 784–800, 2018.  Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Krizhevsky et al. (2012)
Krizhevsky, A., Sutskever, I., and Hinton, G. E.
Imagenet classification with deep convolutional neural networks.
In NeurIPS, pp. 1097–1105, 2012.  LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
 Min et al. (2018) Min, C., Wang, A., Chen, Y., Xu, W., and Chen, X. 2pfpce: Twophase filter pruning based on conditional entropy. arXiv preprint arXiv:1809.02220, 2018.
 Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Wen et al. (2016) Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. In NeurIPS, pp. 2074–2082, 2016.
 Ye et al. (2019) Ye, S., Feng, X., Zhang, T., Ma, X., Lin, S., Li, Z., Xu, K., Wen, W., Liu, S., Tang, J., et al. Progressive dnn compression: A key to achieve ultrahigh weight pruning and quantization rates using admm. arXiv preprint arXiv:1903.09769, 2019.
 Zhang et al. (2018) Zhang, T., Zhang, K., Ye, S., Li, J., Tang, J., Wen, W., Lin, X., Fardad, M., and Wang, Y. Adamadmm: A unified, systematic framework of structured weight pruning for dnns. arXiv preprint arXiv:1807.11091, 2018.
Comments
There are no comments yet.