Structured Pruning for Efficient ConvNets via Incremental Regularization

11/20/2018 ∙ by Huan Wang, et al. ∙ Zhejiang University 0

Parameter pruning is a promising approach for CNN compression and acceleration by eliminating redundant model parameters with tolerable performance loss. Despite its effectiveness, existing regularization-based parameter pruning methods usually drive weights towards zero with large and constant regularization factors, which neglects the fact that the expressiveness of CNNs is fragile and needs a more gentle way of regularization for the networks to adapt during pruning. To solve this problem, we propose a new regularization-based pruning method (named IncReg) to incrementally assign different regularization factors to different weight groups based on their relative importance, whose effectiveness is proved on popular CNNs compared with state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, deep Convolutional Neural Networks (CNNs) have made a remarkable success in computer vision tasks by leveraging large-scale networks learning from big amount of data. However, CNNs usually lead to massive computation and storage consumption, thus hindering their deployment on mobile and embedded devices. To solve this problem, many research works focus on compressing the scale of CNNs. Parameter pruning is a promising approach for CNN compression and acceleration, which aims at eliminating redundant model parameters at tolerable performance loss. To avoid hardware-unfriendly irregular sparsity, structured pruning is proposed for CNN acceleration 

AnwSun16 ; SzeCheYanEme17 . In the im2col implementation ChePurSim06 ; CheWooVan14

of convolution, weight tensors are expanded into matrices, so there are generally two kinds of structured sparsity, i.e., row sparsity (or filter-wise sparsity) and column sparsity (or shape-wise sparsity) 

WenWuWan16 ; Wang2018Structured .

There are mainly two categories of structured pruning. One is importance-based methods, which prune weights in groups based on some established importance criteria LiKadDurEtAl17 ; MolTyrKar17 ; Wang2018Structured . The other is regularization-based methods, which add group regularization terms to learn structured sparsity WenWuWan16 ; VadLem16 ; He2017Channel . Existing group regularization approaches mainly focus on the regularization form (e.g., Group LASSO Yuan2006Model ) to learn structured sparsity, while ignoring the influence of regularization factor. In particular, they tend to use a large and constant regularization factor for all weight groups in the network WenWuWan16 ; VadLem16 , which has two problems. Firstly, this ‘one-size-fit-all’ regularization scheme has a hidden assumption that all weights in different groups are equally important, which however does not hold true, since weights with larger magnitude tend to be more important than those with smaller magnitude. Secondly, few works have noticed that the expressiveness of CNNs is so fragile yosinski2014transferable during pruning that it cannot withstand a large penalty term from beginning, especially for large pruning ratios and compact networks (like ResNet HeZhaRenSun16 ). AFP DinDinHanTan18 was proposed to solve the first problem, while ignored the second one. In this paper, we propose a new regularization-based method named IncReg to incrementally learn structured sparsity.

Apart from pruning, there are also other kinds of methods for CNN acceleration, such as low-rank decomposition LebYarRakOseLem16 ; ZhaZouHeSun16 , quantization CheWilTyrWeiChe15 ; CouBen16 ; LinCouMemBen16 ; RasOrdRedFar16 , knowledge distillation Hinton2015Distilling and architecture re-design IanMosAsh16 ; Howard2017MobileNets ; Zhang2017ShuffleNet ; zhong2018shift . Our method is orthogonal to these methods.

2 The Proposed Method

Given a conv kernel, modeled by a 4-D tensor , where , , and  are the dimension of the th () weight tensor along the axis of filter, channel, height and width, respectively, our proposed objective function for regularization can be formulated as: , where  denotes the collection of all weights in the CNN;

is the loss function for prediction;

is non-structured regularization on every weight, i.e., weight decay in this paper; is the structured sparsity regularization term on group of layer and is the number of weight groups in layer . In VadLem16 ; WenWuWan16 , the authors used the same  for all groups and adopted Group LASSO Yuan2006Model for . In this work, since we emphasize the key problem of group regularization lies in the regularization factor rather than the regularization form, we use the most common regularization form weight decay, for , but we vary the regularization factors  for different weight groups and at different iterations. Especially, we propose a theorem (Theorem 1 and its proof in Appendix) to show that we can gradually compress the magnitude of a parameter by adjusting its regularization factor.

Our method prunes all the conv layers simultaneously and independently. For simplicity, we omit the layer notation for following description. All the ’s are initialized to zero. At each iteration, is increased by . Like AFP DinDinHanTan18 , we agree that unimportant weights should be punished more, so we propose a decreasing piece-wise linear punishment function (Eqn.1, Fig.1) to determine . Note that the regularization increment is negative (i.e., reward actually) when ranking is above the threshold ranking , since above-the-threshold means these weights are expected to stay in the end. Regularization on these important weights is not only unnecessary but also very harmful via our experimental confirmation.

(1)

where is the pre-assigned pruning ratio for a layer, is the number of weight groups, is a hyper-parameter in our method to describe the maximum penalty increment (set to half of the original weight decay in default), is the ranking obtained by sorting in ascending order based on a proposed importance criterion, which is essentially an averaged ranking over time, defined as , where is the ranking by -norm at th iteration, is the number of passed iterations. This averaging is adopted as smoothing for a more stable pruning process.

As training proceeds, the regularization factors of different weight groups increase gradually, which will push the weights towards zero little by little. When the magnitude of a weight group is lower than some threshold (), the weights are permanently removed from the network, thus leading to increased structured sparsity. When the sparsity of a layer reaches its pre-assigned pruning ratio , that layer automatically stops structured regularization. Finally, when all conv layers reach their pre-assigned pruning ratios, pruning is over, followed by a retraining process to regain accuracy.

3 Experiments

3.1 Analysis with ConvNet on CIFAR-10

We firstly compare our proposed IncReg with other two group regularization methods, i.e., SSL WenWuWan16 and AFP DinDinHanTan18 , with ConvNet KriSutHin12 on CIFAR-10 KriHin09

, where both row sparsity and column sparsity are explored. Caffe 

JiaSheDonEtAl14 is used for all of our experiments. Experimental results are shown in Tab.1. We can see that IncReg consistently achieves higher speedups and accuracies than the other two constant regularization schemes. Notably, even though AFP achieves similar performance as our method under relatively small speedup (about ), when the speedup ratio is large (about ), our method outperforms AFP by a large margin. We argue that this is because the incremental way of regularization gives the network more time to adapt during pruning, which is especially important in face of large pruning ratios.

Figure 1: The illustration of our proposed punishment function of . Method Row sparsity Column sparsity speedup accuracy speedup accuracy SSL AFP (our impl.) Ours SSL AFP (our impl.) Ours Table 1: Comparison of our method with SSL WenWuWan16 and AFP DinDinHanTan18 with ConvNet on CIFAR-10.

3.2 VGG-16 and ResNet-50 on ImageNet

We further evaluate our method with VGG-16 Simonyan2014Very ( conv layers) and ResNet-50 HeZhaRenSun16 (

conv layers) on ImageNet 

DenDonSocEtAl09 . We download the open-sourced caffemodel as our pre-trained model, whose single-view top-5 accuracy on ImageNet validation dataset is (VGG-16) and (ResNet-50). For VGG-16, following SPP Wang2018Structured , the proportion of remaining ratios of low layers (conv1_x to conv3_x), middle layers (conv4_x) and high layers (conv5_x) are set to , for easy comparison. For ResNet-50, constant pruning ratio is adopted for all conv layers. Pruning is conducted with batch size and fixed learning rate , followed by retraining with batch size . Experimental results are shown in Tab.2. On VGG-16, our method is slightly better than CP and SPP, and outperforms FP by a significant margin. Notably, since we use the same pruning ratios as SPP does, the only explanation for the performance improvement should be a better pruning process itself, guided by our incremental regularization scheme. On ResNet-50, our method is significantly better than CP and SPP, demonstrating the effectiveness of IncReg when pruning compact networks. Moreover, to confirm the actual speedup, we also evaluate our method with VGG-16 on CPU and GPU. The result is shown in Tab.3.

Method Increased err. (%) TP MolTyrKar17 FP LiKadDurEtAl17 (CP’s impl.) CP He2017Channel SPP Wang2018Structured AMC he2018amc Ours Method Increased err. (%) CP (enhanced) He2017Channel SPP Wang2018Structured Ours
Table 2: Acceleration of VGG-16 (left) and ResNet-50 (right, speedup) on ImageNet. The values are increased single-view top-5 error on ImageNet.
Method CPU time (baseline: 1815 ms) GPU time (baseline: 5.159 ms)
CP He2017Channel
Ours
Table 3: Inference time of conv layers of CP and our method on VGG-16. Evaluation is carried out with batch size 10 and averaged by 50 runs on RGB images. CPU: Intel Xeon(R) E5-2620 v4 @ 2.10GHz, single thread; GPU: GeForce GTX 1080Ti, without cuDNN. Open-sourced models of CP are used for this evaluation.

4 Conclusion

We propose a new structured pruning method based on an incremental way of regularization, which helps CNNs to transfer their expressiveness to the rest parts during pruning by increasing the regularization factors of unimportant weight groups little by little. Our method is proved to be comparably effective on popular CNNs compared with state-of-the-art methods, especially in face of large pruning ratios and compact networks.

Acknowledgment. This work is supported by the Fundamental Research Funds for the Central Universities under Grant 2017FZA5007, Natural Science Foundation of Zhejiang Province under Grant LY16F010004 and Zhejiang Public Welfare Research Program under Grant 2016C31062.

References

Appendix: The Proposed Theorem and Proof

Theorem 1

Considering the objective function

(2)

if there exists a tuple  which satisfies the following three properties:

  1. ;

  2. has the second derivative at ;

  3. is the local minimum of function ,

then there exists an  that for any , we can find an  which satisfies:

  1. is the local minimum of function ;

  2. .

Proof of Theorem 1: For a given , the  which is the local minimum of the function  should satisfy , which gives:

(3)

In this situation, we can calculate the derivative of  by using Eqn.(3):

(4)

Since  is the local minimum of the function , it should satisfy that

which yields

(5)
(6)

If we take Eqn.(5) into (4), we can obtain

(7)

By taking Eqn.(6) into (7), we can conclude that

(8)

In other words, when  is greater than zero, a small increment of  will decrease the value of ; and when  is less than zero, a small increment of  will increase the value of . In both cases, when  increases, will decrease at the new local minimum of .

Thus, we finished the proof of Theorem 1.