1 Introduction
Recently, deep Convolutional Neural Networks (CNNs) have made a remarkable success in computer vision tasks by leveraging largescale networks learning from big amount of data. However, CNNs usually lead to massive computation and storage consumption, thus hindering their deployment on mobile and embedded devices. To solve this problem, many research works focus on compressing the scale of CNNs. Parameter pruning is a promising approach for CNN compression and acceleration, which aims at eliminating redundant model parameters at tolerable performance loss. To avoid hardwareunfriendly irregular sparsity, structured pruning is proposed for CNN acceleration
AnwSun16 ; SzeCheYanEme17 . In theim2col
implementation ChePurSim06 ; CheWooVan14 of convolution, weight tensors are expanded into matrices, so there are generally two kinds of structured sparsity, i.e., row sparsity (or filterwise sparsity) and column sparsity (or shapewise sparsity)
WenWuWan16 ; Wang2018Structured .There are mainly two categories of structured pruning. One is importancebased methods, which prune weights in groups based on some established importance criteria LiKadDurEtAl17 ; MolTyrKar17 ; Wang2018Structured . The other is regularizationbased methods, which add group regularization terms to learn structured sparsity WenWuWan16 ; VadLem16 ; He2017Channel . Existing group regularization approaches mainly focus on the regularization form (e.g., Group LASSO Yuan2006Model ) to learn structured sparsity, while ignoring the influence of regularization factor. In particular, they tend to use a large and constant regularization factor for all weight groups in the network WenWuWan16 ; VadLem16 , which has two problems. Firstly, this ‘onesizefitall’ regularization scheme has a hidden assumption that all weights in different groups are equally important, which however does not hold true, since weights with larger magnitude tend to be more important than those with smaller magnitude. Secondly, few works have noticed that the expressiveness of CNNs is so fragile yosinski2014transferable during pruning that it cannot withstand a large penalty term from beginning, especially for large pruning ratios and compact networks (like ResNet HeZhaRenSun16 ). AFP DinDinHanTan18 was proposed to solve the first problem, while ignored the second one. In this paper, we propose a new regularizationbased method named IncReg to incrementally learn structured sparsity.
Apart from pruning, there are also other kinds of methods for CNN acceleration, such as lowrank decomposition LebYarRakOseLem16 ; ZhaZouHeSun16 , quantization CheWilTyrWeiChe15 ; CouBen16 ; LinCouMemBen16 ; RasOrdRedFar16 , knowledge distillation Hinton2015Distilling and architecture redesign IanMosAsh16 ; Howard2017MobileNets ; Zhang2017ShuffleNet ; zhong2018shift . Our method is orthogonal to these methods.
2 The Proposed Method
Given a conv kernel, modeled by a 4D tensor , where , , and are the dimension of the th () weight tensor along the axis of filter, channel, height and width, respectively, our proposed objective function for regularization can be formulated as: , where denotes the collection of all weights in the CNN;
is the loss function for prediction;
is nonstructured regularization on every weight, i.e., weight decay in this paper; is the structured sparsity regularization term on group of layer and is the number of weight groups in layer . In VadLem16 ; WenWuWan16 , the authors used the same for all groups and adopted Group LASSO Yuan2006Model for . In this work, since we emphasize the key problem of group regularization lies in the regularization factor rather than the regularization form, we use the most common regularization form weight decay, for , but we vary the regularization factors for different weight groups and at different iterations. Especially, we propose a theorem (Theorem 1 and its proof in Appendix) to show that we can gradually compress the magnitude of a parameter by adjusting its regularization factor.Our method prunes all the conv layers simultaneously and independently. For simplicity, we omit the layer notation for following description. All the ’s are initialized to zero. At each iteration, is increased by . Like AFP DinDinHanTan18 , we agree that unimportant weights should be punished more, so we propose a decreasing piecewise linear punishment function (Eqn.1, Fig.1) to determine . Note that the regularization increment is negative (i.e., reward actually) when ranking is above the threshold ranking , since abovethethreshold means these weights are expected to stay in the end. Regularization on these important weights is not only unnecessary but also very harmful via our experimental confirmation.
(1) 
where is the preassigned pruning ratio for a layer, is the number of weight groups, is a hyperparameter in our method to describe the maximum penalty increment (set to half of the original weight decay in default), is the ranking obtained by sorting in ascending order based on a proposed importance criterion, which is essentially an averaged ranking over time, defined as , where is the ranking by norm at th iteration, is the number of passed iterations. This averaging is adopted as smoothing for a more stable pruning process.
As training proceeds, the regularization factors of different weight groups increase gradually, which will push the weights towards zero little by little. When the magnitude of a weight group is lower than some threshold (), the weights are permanently removed from the network, thus leading to increased structured sparsity. When the sparsity of a layer reaches its preassigned pruning ratio , that layer automatically stops structured regularization. Finally, when all conv layers reach their preassigned pruning ratios, pruning is over, followed by a retraining process to regain accuracy.
3 Experiments
3.1 Analysis with ConvNet on CIFAR10
We firstly compare our proposed IncReg with other two group regularization methods, i.e., SSL WenWuWan16 and AFP DinDinHanTan18 , with ConvNet KriSutHin12 on CIFAR10 KriHin09
, where both row sparsity and column sparsity are explored. Caffe
JiaSheDonEtAl14 is used for all of our experiments. Experimental results are shown in Tab.1. We can see that IncReg consistently achieves higher speedups and accuracies than the other two constant regularization schemes. Notably, even though AFP achieves similar performance as our method under relatively small speedup (about ), when the speedup ratio is large (about ), our method outperforms AFP by a large margin. We argue that this is because the incremental way of regularization gives the network more time to adapt during pruning, which is especially important in face of large pruning ratios.3.2 VGG16 and ResNet50 on ImageNet
We further evaluate our method with VGG16 Simonyan2014Very ( conv layers) and ResNet50 HeZhaRenSun16 (
conv layers) on ImageNet
DenDonSocEtAl09 . We download the opensourced caffemodel as our pretrained model, whose singleview top5 accuracy on ImageNet validation dataset is (VGG16) and (ResNet50). For VGG16, following SPP Wang2018Structured , the proportion of remaining ratios of low layers (conv1_x
to conv3_x
), middle layers (conv4_x
) and high layers (conv5_x
) are set to , for easy comparison. For ResNet50, constant pruning ratio is adopted for all conv layers. Pruning is conducted with batch size and fixed learning rate , followed by retraining with batch size . Experimental results are shown in Tab.2. On VGG16, our method is slightly better than CP and SPP, and outperforms FP by a significant margin. Notably, since we use the same pruning ratios as SPP does, the only explanation for the performance improvement should be a better pruning process itself, guided by our incremental regularization scheme. On ResNet50, our method is significantly better than CP and SPP, demonstrating the effectiveness of IncReg when pruning compact networks. Moreover, to confirm the actual speedup, we also evaluate our method with VGG16 on CPU and GPU. The result is shown in Tab.3.
Method  CPU time (baseline: 1815 ms)  GPU time (baseline: 5.159 ms)  

CP He2017Channel  
Ours 
4 Conclusion
We propose a new structured pruning method based on an incremental way of regularization, which helps CNNs to transfer their expressiveness to the rest parts during pruning by increasing the regularization factors of unimportant weight groups little by little. Our method is proved to be comparably effective on popular CNNs compared with stateoftheart methods, especially in face of large pruning ratios and compact networks.
Acknowledgment. This work is supported by the Fundamental Research Funds for the Central Universities under Grant 2017FZA5007, Natural Science Foundation of Zhejiang Province under Grant LY16F010004 and Zhejiang Public Welfare Research Program under Grant 2016C31062.
References
 (1) S. Anwar and W. Sung. Compact deep convolutional neural networks with coarse pruning. arXiv preprint arXiv: 1610.09639, 2016.
 (2) K. Chellapilla, S. Puri, and P. Simard. High performance convolutional neural networks for document processing. In International Workshop on Frontiers in Handwriting Recognition, 2006.
 (3) W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. In ICML, 2015.
 (4) S. Chetlur, C. Woolley, and P. Vandermersch. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
 (5) M. Courbariaux and Y. Bengio. BinaryNet: Training deep neural networks with weights and activations constrained to or . arXiv preprint arXiv:1602.02830, 2016.
 (6) J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Feifei. ImageNet: A largescale hierarchical iage database. In CVPR, 2009.
 (7) X. Ding, G. Ding, J. Han, and S. Tang. Autobalanced filter pruning for efficient convolutional neural networks. 2018.
 (8) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 (9) Y. He, J. Lin, Z. Liu, H. Wang, L.J. Li, and S. Han. AMC: Automl for model compression and acceleration on mobile devices. In ECCV, 2018.
 (10) Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
 (11) G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. Computer Science, 14(7):38–39, 2015.
 (12) A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 (13) F. Iandola, M. Moskewicz, and K. Ashraf. SqueezeNet: Alexnetlevel accuracy with 50x fewer parameters and 0.5MB model size. arXiv preprint arXiv:1602.07360, 2016.
 (14) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrel. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 (15) A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 (16) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
 (17) V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speedingup convolutional neural networks using finetuned CPdecomposition. arXiv preprint arXiv:1510.03009, 2016.
 (18) V. Lebedev and V. Lempitsky. Fast convnets using groupwise brain damage. In CVPR, 2016.
 (19) H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In ICLR, 2017.
 (20) Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2016.
 (21) P. Molchanov, S. Tyree, and T. Karras. Pruning convolutional neural networks for resource efficient inference. In ICLR, 2017.
 (22) M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
 (23) K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. Computer Science, 2014.
 (24) V. Sze, Y. H. Chen, T. J. Yang, and J. Emer. Efficient processing of deep neural networks: A tutorial and survey. arXiv preprint arXiv:1703.09039, 2017.
 (25) H. Wang, Q. Zhang, Y. Wang, and H. Hu. Structured probabilistic pruning for convolutional neural network acceleration. In BMVC, 2018.
 (26) W. Wen, C. Wu, and Y. Wang. Learning structured sparsity in deep neural networks. In NIPS, 2016.
 (27) J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In NIPS.

(28)
M. Yuan and Y. Lin.
Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society, 68(1):49–67, 2006.  (29) X. Zhang, X. Zhou, M. Lin, and J. Sun. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.
 (30) X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. PAMI, 38(10):1943–1955, 2016.
 (31) H. Zhong, X. Liu, Y. He, Y. Ma, and K. Kitani. Shiftbased primitives for efficient convolutional neural networks. arXiv preprint arXiv:1809.08458, 2018.
Appendix: The Proposed Theorem and Proof
Theorem 1
Considering the objective function
(2) 
if there exists a tuple which satisfies the following three properties:

;

has the second derivative at ;

is the local minimum of function ,
then there exists an that for any , we can find an which satisfies:

is the local minimum of function ;

.
Proof of Theorem 1: For a given , the which is the local minimum of the function should satisfy , which gives:
(3) 
In this situation, we can calculate the derivative of by using Eqn.(3):
(4) 
Since is the local minimum of the function , it should satisfy that
which yields
(5)  
(6) 
If we take Eqn.(5) into (4), we can obtain
(7) 
By taking Eqn.(6) into (7), we can conclude that
(8) 
In other words, when is greater than zero, a small increment of will decrease the value of ; and when is less than zero, a small increment of will increase the value of . In both cases, when increases, will decrease at the new local minimum of .
Thus, we finished the proof of Theorem 1.