In order to solve the high demand for computation and storage resources of a DNN application, weight pruning (Han et al., 2015) (Wen et al., 2016) are developed to facilitate weight compression and computation acceleration. In this work, a structured pruning technique is utilized to compress the DNN models which reduces weight storage and computation, and the structured weight matrix storage also has potential advantages for high-parallelism implementation in hardware by eliminating the required weight indices compared with irregular pruning. (Ding et al., 2018) (Zhang et al., 2018)
However, the accuracy loss problem in structured pruning is inevitable. By adopting ADMM (Boyd et al., 2011)
, the original weight pruning problem is decomposed into two subproblems, one solved using stochastic gradient descent as original DNN training, while the other solved optimally and analytically via Euclidean projection(Zhang et al., 2018) (Ye et al., 2019). ADMM method achieves one of the state-of-art structured weight pruning results, 40 weight reduction on LeNet-5 (LeCun et al., 1998) with MNIST (LeCun et al., 2015), 20 on VGG-16 (Simonyan & Zisserman, 2014) with CIFAR-10 (Krizhevsky & Hinton, 2009) and 4.7 on AlexNet (Krizhevsky et al., 2012)
with ImageNet(Deng et al., 2009) without post-processing optimization.
During post-processing procedure, we find that after model retraining, some weights become less contributing to the network performance. This phenomenon is caused by the shortcoming that ADMM technique lacks the guarantee on solution feasibility (non-optimality) due to the non-convex nature of objective function (loss function). We propose a novel algorithm to detect and remove the redundant weights which slip away from ADMM pruning. Also, we are the first to discover the unused path in a structured pruned DNN model and design a sophisticate optimization framework to further boost compression rate as well as maintain high network accuracy. The contributions of this paper include:
We adopt ADMM for efficiently optimizing the non-convex problem and successfully utilized this method on structured weight pruning.
We design a novel Network Purification and Unused Path Removal (P-RM) algorithm focused on post-processing an ADMM structured pruned model to boost compression rate while maintain accuracy.
2 ADMM model compression
Consider an -layer DNNs, sets of weights of the -th (CONV or FC) layer are denoted by , respectively. And the loss function associated with the DNN is denoted by . In this paper, characterize the set of weights from layer to layer . The overall problem is defined by
Given the value of , the constraint set is denoted by , where “card” refers to cardinality and “supp” refers to the support set. Elements in are the solution of satisfying the number of non-zero elements in is limited by for layer . The general constraint can be extended in structured pruning such as filter pruning, channel pruning and column pruning.
The standard ADMM regularized optimization steps are shown as follow, consider a indicator function is utilized to incorporate into objective functions, which is
Then original problem (1) can be equivalently rewritten as
Auxiliary variables and dual variables are imported. ADMM decompose problem (3) into simpler subproblems and solve subproblems iteratively until convergence. The augmented Lagrangian formation of problem (3) is
The first term in problem (4) is the differentiable loss function of the DNN, and the second term is a quadratic regularization term of the , which is differentiable and convex, and denotes Frobenius norm. As a result, subproblem (4) can be solved by stochastic gradient descent algorithm (Kingma & Ba, 2014) as the original DNN training.
The standard ADMM algorithm (Boyd et al., 2011) steps proceed by repeating, for , the following subproblems iterations:
which (5) is the proximal step, (6) is projection step and (7) is dual variables update. However, due to the non-convexity of the DNN loss function rather than the quadratic term in our method, the global optimality cannot be guaranteed.
Figure 1 illustrate a combined structured pruning techniques in General Matrix Multiply (GEMM) view. We adopt filter pruning and column pruning together to reduce matrix dimension. As a result shows in Figure 1 (c), the weight matrix size is reduced drastically compared with the original one, in the meantime, the shape of the weight matrix is still regular.
3 Network Purification and Unused Path Removal (P-RM)
ADMM weight pruning can significantly reduce weights while maintaining high accuracy. However, does the pruning process really remove all unnecessary weights?
From our observation and analysis on the data flow through a network, we find that if a whole filter is pruned, then after GEMM, the generated feature maps by this filter will be all “blank”. If those “blank” feature maps input to next layer, then no matter what values are in the corresponding channel for those feature maps, the GEMM result will be zero. As a result, that channel will become unused channel which can be removed. By the same token, if a channel is pruned, then no matter what values are in the previous layer’s corresponding filter, the GEMM result of the generated feature maps by this channel will be all zeros, in which case make that corresponding filter an unused one. Figure 2 gives a clear illustration about the corresponding relationship between the ADMM pruned filters/columns and the correspond unused channels/filters.
We further improved the empty channels caused unused filters method by creating a more generalized criterion that define what is “emptiness” of a channel. Suppose is the number of columns per channel in layer , and is the emptiness ratio. We have
If exceed a pre-defined threshold, we can assume that this channel is empty. But this indiscriminate criterion has its limitation. The reason is that after pruning, the remaining columns are remained for a reason which is they are relatively “important” to the whole network. If we remove all columns that satisfy , disastrous accuracy drop will occur and hard to recover by retraining.
In order to make our previous assumption work, we design a unified algorithm called “Network Purification” which is targeted on dealing with the non-optimality problem of the ADMM process. By solving the problem, the above assumption can be validated simultaneously. We add a criterion constraint to compare the importance of the remaining columns channel-wisely and to help decide which columns can be sacrifice and which can not. We set-up an criterion constant to represent channel ’s importance score, which is derived from an accumulation procedure:
One can think of this process as if collection evidence for whether each channel that contains one or several columns need to be removed. Network Purification also works on purifying remaining filters and thus remove more unused path in the network. The effect of the combinatorially using Network Purification and Unused Path Removal (P-RM) is network will achieve extremely high compression rate without having any accuracy drop. Algorithm 1 shows our generalized method of the P-RM method where are hyper-parameter thresholds values.
4 Experimental Results
Figure 3 proves that ADMM’s non-optimality exists in a structured pruned model. By purifying the redundant weights, we can further optimize the loss function. All of the results are based on non-retraining Network Purification process. The purification along with removal of unused path (P-RM) process has great compression boost effect when the network is deep enough.
|Structured Weight Pruning Statistics|
|*numbers of parameter reduced: 25.2K|
|*numbers of parameter reduced on:|
|VGG-16: 14.42M, ResNet-18: 10.97M|
|numbers of parameter reduced on:|
|AlexNet: 1.66M, ResNet-18: 7.81M, ResNet-50: 14.77M|
Table1 shows our experimental results of network pruning on Lenet-5, VGG-16, AlexNet and ResNet-18/50. The accuracy and pruning ratio results of SSL (Wen et al., 2016) method is compared with our structured pruned Lenet-5 and AlexNet model, and 2PFPCE (Min et al., 2018) and AMC (He et al., 2018) (ResNet-50) methods are compared with our VGG-16 and ResNet-18 on CIFAR-10 results. By using ADMM structured prune, Network Purification and Unused Path Removal (P-RM), LeNet-5 achieve 39 compression rate without accuracy drop, 88 compression with negligible accuracy drop and 232 with only 0.7% accuracy drop. On CIFAR-10 dataset, our VGG-16 compressed model achieves 44 compression without accuracy degradation and 50 with 1% accuracy drop and our ResNet-18 achieve 52 compression without noticeable accuracy loss and 60 compression with 0.9% accuracy loss.
On ImageNet dataset, we increase AlexNet compression rate from 4.69 to 5.13, ResNet-18 from 3.02 to 3.33 and ResNet-50 from 2 to 2.7. All of our compression rate boost doesn’t cause noticeable accuracy degradation.
In this paper, we provide an ADMM regularized method to achieve highly compressed DNN models with combination of different weight pruning structures, and maintain the network accuracy in a high level. We further investigate the post-process of ADMM pruning to solve the non-optimal solution caused by non-convex DNN loss function. We proposed Network Purification and Unused Path Removal that increase our model compression rate significantly.
- Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 2011.
- Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. 2009.
Ding et al. (2018)
Ding, C., Ren, A., Yuan, G., Ma, X., Li, J., Liu, N., Yuan, B., and Wang, Y.
Structured weight matrices-based hardware accelerators in deep neural networks: Fpgas and asics.In Proceedings of the 2018 on Great Lakes Symposium on VLSI, pp. 353–358. ACM, 2018.
- Han et al. (2015) Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In NeurIPS, 2015.
He et al. (2018)
He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S.
Amc: Automl for model compression and acceleration on mobile devices.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800, 2018.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
Krizhevsky et al. (2012)
Krizhevsky, A., Sutskever, I., and Hinton, G. E.
Imagenet classification with deep convolutional neural networks.In NeurIPS, pp. 1097–1105, 2012.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
- Min et al. (2018) Min, C., Wang, A., Chen, Y., Xu, W., and Chen, X. 2pfpce: Two-phase filter pruning based on conditional entropy. arXiv preprint arXiv:1809.02220, 2018.
- Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Wen et al. (2016) Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. In NeurIPS, pp. 2074–2082, 2016.
- Ye et al. (2019) Ye, S., Feng, X., Zhang, T., Ma, X., Lin, S., Li, Z., Xu, K., Wen, W., Liu, S., Tang, J., et al. Progressive dnn compression: A key to achieve ultra-high weight pruning and quantization rates using admm. arXiv preprint arXiv:1903.09769, 2019.
- Zhang et al. (2018) Zhang, T., Zhang, K., Ye, S., Li, J., Tang, J., Wen, W., Lin, X., Fardad, M., and Wang, Y. Adam-admm: A unified, systematic framework of structured weight pruning for dnns. arXiv preprint arXiv:1807.11091, 2018.