1 Introduction
In recent years, deep convolutional neural networks (CNN’s) have made remarkable success in computer vision tasks such as classification, detection and segmentation by leveraging largescale networks learning from big amount of data. However, CNN leads to massive computation and storage consumptions, thus hindering its deployment on mobile and embedded devices. To reduce computation cost, many research works are carried out to compress the scales of CNN’s, which include designing compact network architectures, parameter quantization, matrix decomposition and parameter pruning.
Parameter pruning is a promising approach for CNN compression and acceleration, which aims at eliminating redundant model parameters with tolerable performance loss. One problem of parameter pruning is that it often produces unstructured and random connections which is hard to implement for speedup on general hardware platforms [Han et al.2016]. Even with sparse matrix kernels, the speedup is very limited [Wen et al.2015]. To solve this problem, many works focus on structured pruning which can shrink a network into a thinner one so that the implementation of the pruned network is efficient [Anwar and Sung2016, Sze et al.2017].
There are two categories for structured pruning. The first is importancebased methods, which prune weights in groups based on some established importance criteria. Another category is regularizationbased methods, which add group regularization terms to the objective function and prune the weights by optimizing the objective function. In this paper, we propose a new regularizationbased method to learn group sparsity which also combines the idea of importancebased methods.
Existing group regularization approaches tend to use the same regularization parameter for all weight groups in the network [Wen et al.2015, Lebedev and Lempitsky2016]
. One hidden assumption of equally assigning regularization parameters is that all weights in different groups are equally important. However, intuitively, weights with greater magnitude tend to be more important during pruning than those with smaller magnitude. Thus, we need a scheme to distinguish the importance of weights and treat them with different regularization strengths. Compared with previous works, the main novelty of our proposed method is to assign different regularization parameters to different weight groups based on their estimated importance.
One advantage of the proposed method is that it is theoretically sound. In this paper, we theoretically prove that the proposed scheme of varying regularization parameters results in zeroing out unimportant weights and finally make the network converge to a pruned one. Considering that many methods in parameter pruning are heuristic, the proposed method stands out with concrete theoretical basis.
Our method accelerates AlexNet for theoretical speedup while improving its performance, and VGG16 for theoretical speedup with accuracy loss. The proposed method can also be applied to compact multibranch networks such as ResNet50, on which we achieves speedup and only suffers accuracy loss.
2 Related Work
Parameter pruning has a long history in the development of neural networks [Reed1993], which can be categorized into importancebased and regularizationbased methods. Importancebased methods prune weights in groups based on some established importance criteria. For example, Optimal Brain Damage (OBD) [LeCun et al.1990] and Optimal Brain Surgery (OBS) [Hassibi and Stork1993]
proposed an importance criteria based on the secondorder derivatives of loss function derived from Taylor expansions. Deep compression
[Han et al.2015a, Han et al.2015b] pruned smallmagnitude weights and obtained parameter reduction on AlexNet and VGG16. Taylor Pruning [Molchanov et al.2017]also derived a new importance criteria based on Taylor expansion, but they used the firstorder derivatives, which is easy to access in backpropagation. Their method was shown effective to prune filters of CNN’s on transfer learning tasks.
[Li et al.2017] usednorm to guide oneshot filter pruning, which was proved effective on CIFAR10 and ImageNet with VGG16 and ResNet, but the reported speedup is very limited. Channel Pruning
[He et al.2017]alternatively uses LASSOregressionbased channel selection and feature map reconstruction to prune filters, and achieves the stateoftheart result on VGG16. Among regularizationbased methods,
[Weigend et al.1991] added a penaltyterm to eliminate redundant weights for better generalization. Groupwise Brain Damage [Lebedev and Lempitsky2016] and Structured Sparsity Learning [Wen et al.2015] embedded Group LASSO [Yuan and Lin2006] into CNN regularization and obtained regularshape sparsity.Apart from parameter pruning, there are other methods for CNN model compression and acceleration, including designing compact architecture, parameter quantization and matrix decomposition. Compact architecture designing methods target designing more efficient and compact neural network architectures. For example, SqueezeNet [Iandola et al.2016] was proposed to stack compact blocks, which decreased the number of parameters by less than the original AlexNet. MobileNet [Howard et al.2017] and ShuffleNet [Zhang et al.2017] leveraged new convolutional implementations to design networks for mobile applications.
Parameter quantization methods reduce CNN storage by quantizing the weights and using less representation bits. [Chen et al.2015]
proposed a hash function to group weights of each CNN layer into different hash buckets for parameter sharing. As the extreme form of quantization, binarized networks
[Courbariaux and Bengio2016, Lin et al.2016, Rastegari et al.2016] proposed to learn binary weights or activations. Quantization reduces floating computational complexity, but the actual speedup may be very related to specific hardware implementations.Matrix decomposition decomposes large matrices into several small matrices to reduce computation. [Denton et al.2014]
showed that the weight matrix of a fullyconnected layer can be compressed via truncated SVD. Tensor decomposition was proposed and obtained better compression result than SVD
[Novikov et al.2014]. Several methods based on lowrank decomposition of convolutional kernel tensors were also proposed to accelerate convolutional layers [Lebedev et al.2016, Zhang et al.2016].3 The Proposed Method
Suppose weights of convolutional layers in a CNN form a sequence of 4D tensors . Here , , and are the dimension of the th () weight tensor along the axis of filter, channel, spatial height and spatial width, respectively. Our proposed objective function for regularization can be formulated as:
(1) 
Here represents the collection of all weights in the CNN; is the loss function caused by data; is nonstructured regularization on every weight, which is the norm in this paper. is the structured sparsity regularization on each layer. In [Wen et al.2015, Lebedev and Lempitsky2016], the authors used the same for all groups and adopted Group LASSO for because it can effectively zero out weights in groups, i.e., . In this paper, we use the squared norm for , i.e., and vary the regularization parameters for different groups.
The finally learned ‘structure’ is decided by the way of splitting groups of . Normally, there are filterwise, channelwise and shapewise sparsity with different size of weight groups [Wen et al.2015]. The sparsity terms are represented as
In im2col implementation, weight tensors are expanded into matrices where shapes are represented as columns and filters are represented as rows. Thus, shape and filterwise sparsity are equivalent to column and rowsparsity, as shown in Fig.1.
3.1 Theoretical Analysis
For the proposed method, the regularization parameter is differently assigned for different weight groups. We find that by slightly augmenting of a weight group, and then train the network through backpropagation until the objective function, i.e. Eqn.(1), reaches the local minimum, the norm of that group will also decrease. This phenomenon is formally summarized by the following theorem.
Theorem 1
Considering the objective function
(2) 
if there exists a tuple which satisfies the following three properties:

;

has the second derivative at ;

is the local minimum of function ,
then there exists an that for any , we can find an which satisfies:

is the local minimum of function ;

.
This theorem is proved in Appendix A, which indicates that we can slightly increase the regularization parameter to compress the magnitude of weights to zero. By Eqn.(12), the magnitude of weights will be more compressed if the regularization parameter increases more.
3.2 Method Description
Theorem 1 guarantees that we can modify the norm of weight groups by increasing or decreasing their corresponding regularization parameters. Thus, we can assign different regularization parameters to weight groups based on their importance to the network. In this paper, we use the norms as the importance criterion for weight groups. Note that our method can be easily applied to other criteria such as norm and Taylor expansions.
Normalization of importance criteria is necessary because the values of norms have huge variations across different networks, layers and weight groups. Contrary to previous works, normalization is based on the ranks of weight groups in the same layer. The advantages of rankbased normalization lies in two parts – (1) Compared to other normalization methods like max/min normalization, the range of ranks is fixed from to , where is the total number of weight groups in the layer; (2) For the pruning task, we need to set a pruning ratio to each layer, say, means that we need to prune of weight groups which are ranked the lowest when pruning is finished. Normalization by ranks makes the pruning process controllable since it is directly towards the goal of pruning.
Specifically, we sort weight groups by their norms in ascending order. Meanwhile, to mitigate the oscillation of ranks in one training iteration, we average the rank of each group through training iterations. For a weight group, its average rank through iterations is defined as
(3) 
Here is the rank of the th iteration. The final average rank is obtained by sorting of different weight groups in ascending order, making its range from to .
Our aim is to assign an increment to each weight group, so that its regularization parameter is gradually updated through the pruning process.
(4) 
Following the above idea, of each group is assigned by its average rank with a piecewise linear function, as shown in Eqn.(5).
(5) 
Fig.2
depicts . It is seen from Fig.2 that for weight groups whose norms are small, i.e., the average ranks less than , we need to increase their regularization parameters to further decrease their norms; and those with greater norms and rank above , we need to decrease regularization parameters to further increase their norms. After obtaining and by Eqn.(5) and (4), we threshold it by zero to prevent negative values of regularization:
(6) 
After updating , the weights of CNN are trained through backpropagation deduced from Eqn.(1).
(7) 
After training the weights for several iterations, we recalculate and the training process continues until convergence. Since we decrease the norms of weight groups whose ranks are less than and increase the norms of weight groups whose ranks are greater than , there should be pruned weight groups at the convergence point. In Eqn.(5), is a hyperparameter to control the speed of convergence. Greater value of results in faster convergence. Finally, we summarize the proposed algorithm in Algorithm 1.
3.3 Discussions
The difference between Algorithm 1 and Theorem 1 is that Theorem 1 requires weights reach local minimum of the objective function (Eqn.1) before another updates of , but in Algorithm 1, we only train the network through backpropagation for several iterations. Such a compromise is mainly attributed to timecomplexity. Actually, it is timeconsuming if we obtain local minimum by gradually lowering the learning rate . In addition, because objective functions of CNN’s are very complex, it is also difficult to decide whether local minimum is obtained or not. Experiments indicate that iterations are enough for convergence of the method.
An amazing factor of the proposed method is that it can automatically adjust search steps without any knowledge about the property of the objective function itself. By Eqn.(12), the increment of weight is reversely proportional to the second derivative of the objective function which is produced by CNN. Such a property makes the modification of slower when the objective function reaches steeper areas even without knowing the exact form of , which is a nice property to make refined search of local minimum possible. We believe that the good performance of the proposed method is partly because of the automatic adjustment of search steps.
4 Experiments
We first compare fixed and varying regularization with ConvNet on CIFAR10 dataset [Krizhevsky and Hinton2009]
. Then we evaluate the proposed method with large networks on large scale datasets. All of our experiments are conducted with Caffe. We set the weight decay factor
the same with baselines, and set the hyperparameter as half of . Since we focus on acceleration in this paper, we only compress weights in convolutional layers and keep fully connected layers unchanged. Methods for comparison include Taylor Pruning (TP) [Molchanov et al.2017], Filter Pruning (FP) [Li et al.2017], Structured Sparsity Learning (SSL) [Wen et al.2015], Channel Pruning (CP) [He et al.2017] and Structured Probabilistic Pruning (SPP) [Wang et al.2017]. For all experiments, speedup is calculated by GFLOPS reduction.4.1 Analysis with ConvNet on CIFAR10
We firstly compare our proposed method with the Group LASSO [Yuan and Lin2006] where the regularization parameter is fixed for all weights. Group LASSO is widely applied to generate group sparsity [Lebedev and Lempitsky2016, Wen et al.2015]. The test network is ConvNet, which is a small CNN with three convolutional layers and one fullyconnected layers. CIFAR10 is a class dataset of tiny images, among which images are used for training, images for validation and the other images for testing.
We first trained a baseline model with test accuracy . Then the proposed method is applied to learn structured sparsity, where both row sparsity and column sparsity are explored. For comparison, we employ the Group LASSO with fixed regularization parameters ( for row sparsity and for column sparsity).
Experimental results are shown in Tab.1.
Method  Row pruning  Column pruning  

speedup  accuracy  speedup  accuracy  
SSL  
Ours 
We can see that varying regularization consistently achieves higher speedups and accuracies than fixed regularization. Fig.3 illustrates the process that the norms of columns changes with training iterations in the conv1
layer, from which we can see that while fixed regularization gradually suppresses unimportant weights, many important weights are also unnecessarily suppressed. In varying regularization, the magnitude of some weights increases dramatically when training starts while some decreases meanwhile. Thus, the pruning process converges much faster. In addition, with varying regularization, the final weights are driven into two distinct groups – the group formed by important weights with very high norms and the group for unimportant weights with closetozero norms.
We also find that under similar speedup ratios, column pruning is better than row pruning in accuracy. It is because that a row typically consists of much more weights than a column and pruning rows may cause more severe sideeffects for accuracy. In the following experiments, we only choose column as our sparsity group to obtain better accuracies.
We further compare our proposed method with more recent pruning methods, and the results are shown in Tab.2. Under different speedup ratios, our method consistently outperforms other pruning methods.
Method  Increased err. (%)  

TP (our impl.)  
FP (our impl.)  
SSL  
SPP    
Ours   
4.2 AlexNet on ImageNet
We apply the proposed method to AlexNet [Krizhevsky et al.2012], which is composed of convolutional layers and fullyconnected layers. We download an open caffemodel from Caffe model zoo as our pretrained model. The baseline single view top5 accuracy on ImageNet 2012 validation dataset is . All images are rescaled to size , then a patch is randomly cropped from each scaled image and randomly mirrored for data augmentation. For testing, the patches are cropped from the center of the scaled images.
Intuitively, different layers have different sensitivity to pruning, but there are few theories to quantify the redundancy of different layers in deep neural networks. Most pruning methods empirically set pruning ratio for different layers [Li et al.2017, Molchanov et al.2017, He et al.2017, Wang et al.2017]. Following these works, we empirically set the proportion of remaining columns of the five convolutional layers as . We train the baseline model with batch size for about epochs until reaching the target pruning ratio. We use a small learning rate and fix it during training. Then the pruned model is retrained with batch size to regain accuracy.
Experimental results are shown in Tab.3.
Method  Increased err. (%)  

TP  
FP (SPP’s impl.)  
SSL  
SPP    
Proposed     
The proposed method and SPP are consistently better than the other three methods, and the proposed method is slightly better than SPP on average. With speedup, our method can even improve the accuracy while SPP degrades the accuracy in this situation.
4.3 VGG16 on ImageNet
We further demonstrate our method on VGG16 [Simonyan and Zisserman2014], which has convolutional layers and fullyconnected layers. We download the open caffemodel as our pretrained model, whose singleview top5 accuracy on ImageNet 2012 validation dataset is . Similar to AlexNet, the resized images are randomly cropped to input size and mirrored for data augmentation in training.
Previous works [He et al.2017, Wang et al.2017] found that lower layers were more redundant in VGG16. Therefore, the proportion of remaining ratios of low layers (conv1_x
to conv3_x
), middle layers (conv4_x
) and high layers (conv5_x
) are set to , the same as [Wang et al.2017]. The first and last convolutional layer (conv1_1
and conv5_3
) are not pruned because both of them require very small amount of GFLOPs calculation. We first train the baseline model with batch size and with a fixed learning rate . Pruning is finished after around epochs. Then the network is finetuned with batch size to regain accuracy.
Experimental results are shown in Tab.4. Our method is slightly better than CP and SPP, and beats TP and FP by a significant margin.
Method  Increased err. (%)  

TP  
FP (CP’s impl.)  
CP  
SPP  
Ours 
4.4 ResNet50 on ImageNet
Unlike AlexNet and VGG16, ResNet50 is a multibranch deep neural network, which has convolutional layers and no fullyconnected layers. Open pretrained caffemodel is adopted as our baseline, whose single view top5 accuracy on ImageNet 2012 validation dataset is . The images are augmented the same way as in the experiment of VGG16 (Sec.4.3).
For simplicity, we adopt the same pruning ratio for all convolutional layers. The training settings are similar to that of VGG16. The pruning process stops after less than epochs before retraining.
From Tab.(5
), our method achieves much better results than two recent methods CP and SPP. This is probably because the proposed method imposes gradually regularization, making the network adapt littlebylittle in the parameter space, while both the matrix decomposition in CP and direct pruning in SPP may bring much modification that a network as compact as ResNet50 cannot endure.
Method  Increased err. (%) 

CP (enhanced)  
SPP  
Proposed 
5 Conclusion
We propose structured sparsity pruning with varying regularization parameters for CNN acceleration, which assigns different regularization parameters to weight groups according to their importance to the network. Theoretical analysis guarantees the convergence of our method. The effectiveness of the proposed method is proved by comparison with stateoftheart methods on popular CNN architectures.
Appendix A – Proof of Theorem 1
Proof: For a given , the which is the local minimum of the function should satisfy , which gives:
(8) 
In this situation, we can calculate the derivative of by using Eqn.(8):
(9) 
Since is the local minimum of the function , it should satisfy that
which yields
(10)  
(11) 
If we take Eqn.(10) into (9), we can obtain
(12) 
By taking Eqn.(11) into (12), we can conclude that
(13) 
In other words, when is greater than zero, a small increment of will decrease the value of ; and when is less than zero, a small increment of will increase the value of . In both cases, when increases, will decrease at the new local minimum of .
Thus, we finished the proof of Theorem 1.
References
 [Anwar and Sung2016] S. Anwar and W. Sung. Compact deep convolutional neural networks with coarse pruning. arXiv preprint, arXiv:1610.09639, 2016.

[Chen et al.2015]
W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen.
Compressing neural networks with the hashing trick.
In
Proceedings of the International Conference on Machine Learning, ICML
, pages 1–10, Lille, France, 2015.  [Courbariaux and Bengio2016] M. Courbariaux and Y. Bengio. BinaryNet: Training deep neural networks with weights and activations constrained to or . arXiv preprint, arXiv:1602.02830, 2016.
 [Denton et al.2014] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, NIPS, Montréal, Canada, 2014.
 [Han et al.2015a] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint, arXiv:1510.00149, 2015.
 [Han et al.2015b] S. Han, J. Pool, and J. Tran. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, NIPS, pages 1135–1143, Montréal, Canada, 2015.
 [Han et al.2016] S. Han, X. Liu, and H. Mao. EIE: efficient inference engine on compressed deep neural network. ACM Sigarch Computer Architecture News, 44(3):243–254, 2016.
 [Hassibi and Stork1993] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Information Processing Systems, NIPS, pages 164–171, Denver, CO, 1993.
 [He et al.2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. 2017.
 [Howard et al.2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. 2017.
 [Iandola et al.2016] F. Iandola, M. Moskewicz, and K. Ashraf. SqueezeNet: Alexnetlevel accuracy with 50x fewer parameters and 0.5MB model size. arXiv preprint, arXiv:1602.07360, 2016.
 [Krizhevsky and Hinton2009] A. Krizhevsky and G. E. Hinton. Learning multiple layers of features from tiny images. 2009.
 [Krizhevsky et al.2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Information Processing Systems, NIPS, pages 1097–1105, Lake Tahoe, CA, 2012.

[Lebedev and Lempitsky2016]
V. Lebedev and V. Lempitsky.
Fast convnets using groupwise brain damage.
In
Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, CVPR
, pages 2554–2564, Las Vegas, NV, 2016.  [Lebedev et al.2016] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speedingup convolutional neural networks using finetuned CPdecomposition. arXiv preprint, arXiv:1510.03009, 2016.
 [LeCun et al.1990] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Information Processing Systems, NIPS, pages 598–605, Denver, CO, 1990.
 [Li et al.2017] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, ICLR, 2017.
 [Lin et al.2016] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. arXiv preprint, arXiv:1510.03009, 2016.
 [Molchanov et al.2017] P. Molchanov, S. Tyree, and T. Karras. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, ICLR, 2017.
 [Novikov et al.2014] A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing neural networks. arXiv preprint, arXiv:1509.06569, 2014.
 [Rastegari et al.2016] M. Rastegari, V. Ordonez, J. Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, ECCV, pages 525–542, Amsterdam, Netherland, 2016.
 [Reed1993] R. Reed. Pruning algorithms – a survey. IEEE Transactions on Neural Networks, 4(5):740–747, 1993.
 [Simonyan and Zisserman2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. Computer Science, 2014.
 [Sze et al.2017] V. Sze, Y. H. Chen, T. J. Yang, and J. Emer. Efficient processing of deep neural networks: A tutorial and survey. arXiv preprint, arXiv:1703.09039, 2017.
 [Wang et al.2017] Huan Wang, Qiming Zhang, Yuehai Wang, and Roland Hu. Structured probabilistic pruning for deep convolutional neural network acceleration. CoRR, abs/1709.06994, 2017.
 [Weigend et al.1991] Andreas S. Weigend, David E. Rumelhart, and Barnardo A. Huberman. Generalization by weight elimination with application to forecasting. In Advances in Neural Information Processing Systems, pages 875–882, 1991.
 [Wen et al.2015] W. Wen, C. Wu, and Y. Wang. Learning structured sparsity in deep neural networks. In Advances in Information Processing Systems, NIPS, pages 2074–2082, Barcelona, Spain, 2015.
 [Yuan and Lin2006] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, 68(1):49–67, 2006.
 [Zhang et al.2016] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):1943–1955, 2016.
 [Zhang et al.2017] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. 2017.