Structured Deep Neural Network Pruning by Varying Regularization Parameters

04/25/2018 ∙ by Huan Wang, et al. ∙ Zhejiang University 0

Convolutional Neural Networks (CNN's) are restricted by their massive computation and high storage. Parameter pruning is a promising approach for CNN compression and acceleration, which aims at eliminating redundant model parameters with tolerable performance loss. Despite its effectiveness, existing regularization-based parameter pruning methods usually assign a fixed regularization parameter to all weights, which neglects the fact that different weights may have different importance to CNN. To solve this problem, we propose a theoretically sound regularization-based pruning method to incrementally assign different regularization parameters to different weights based on their importance to the network. On AlexNet and VGG-16, our method can achieve 4x theoretical speedup with similar accuracies compared with the baselines. For ResNet-50, the proposed method also achieves 2x acceleration and only suffers 0.1



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep convolutional neural networks (CNN’s) have made remarkable success in computer vision tasks such as classification, detection and segmentation by leveraging large-scale networks learning from big amount of data. However, CNN leads to massive computation and storage consumptions, thus hindering its deployment on mobile and embedded devices. To reduce computation cost, many research works are carried out to compress the scales of CNN’s, which include designing compact network architectures, parameter quantization, matrix decomposition and parameter pruning.

Parameter pruning is a promising approach for CNN compression and acceleration, which aims at eliminating redundant model parameters with tolerable performance loss. One problem of parameter pruning is that it often produces unstructured and random connections which is hard to implement for speedup on general hardware platforms [Han et al.2016]. Even with sparse matrix kernels, the speedup is very limited [Wen et al.2015]. To solve this problem, many works focus on structured pruning which can shrink a network into a thinner one so that the implementation of the pruned network is efficient [Anwar and Sung2016, Sze et al.2017].

There are two categories for structured pruning. The first is importance-based methods, which prune weights in groups based on some established importance criteria. Another category is regularization-based methods, which add group regularization terms to the objective function and prune the weights by optimizing the objective function. In this paper, we propose a new regularization-based method to learn group sparsity which also combines the idea of importance-based methods.

Existing group regularization approaches tend to use the same regularization parameter for all weight groups in the network [Wen et al.2015, Lebedev and Lempitsky2016]

. One hidden assumption of equally assigning regularization parameters is that all weights in different groups are equally important. However, intuitively, weights with greater magnitude tend to be more important during pruning than those with smaller magnitude. Thus, we need a scheme to distinguish the importance of weights and treat them with different regularization strengths. Compared with previous works, the main novelty of our proposed method is to assign different regularization parameters to different weight groups based on their estimated importance.

One advantage of the proposed method is that it is theoretically sound. In this paper, we theoretically prove that the proposed scheme of varying regularization parameters results in zeroing out unimportant weights and finally make the network converge to a pruned one. Considering that many methods in parameter pruning are heuristic, the proposed method stands out with concrete theoretical basis.

Our method accelerates AlexNet for theoretical speedup while improving its performance, and VGG-16 for theoretical speedup with accuracy loss. The proposed method can also be applied to compact multi-branch networks such as ResNet-50, on which we achieves  speedup and only suffers accuracy loss.

2 Related Work

Parameter pruning has a long history in the development of neural networks [Reed1993], which can be categorized into importance-based and regularization-based methods. Importance-based methods prune weights in groups based on some established importance criteria. For example, Optimal Brain Damage (OBD) [LeCun et al.1990] and Optimal Brain Surgery (OBS) [Hassibi and Stork1993]

proposed an importance criteria based on the second-order derivatives of loss function derived from Taylor expansions. Deep compression 

[Han et al.2015a, Han et al.2015b] pruned small-magnitude weights and obtained parameter reduction on AlexNet and VGG-16. Taylor Pruning [Molchanov et al.2017]

also derived a new importance criteria based on Taylor expansion, but they used the first-order derivatives, which is easy to access in back-propagation. Their method was shown effective to prune filters of CNN’s on transfer learning tasks.

[Li et al.2017] used

norm to guide one-shot filter pruning, which was proved effective on CIFAR-10 and ImageNet with VGG-16 and ResNet, but the reported speedup is very limited. Channel Pruning 

[He et al.2017]

alternatively uses LASSO-regression-based channel selection and feature map reconstruction to prune filters, and achieves the state-of-the-art result on VGG-16. Among regularization-based methods,

[Weigend et al.1991] added a penalty-term to eliminate redundant weights for better generalization. Group-wise Brain Damage [Lebedev and Lempitsky2016] and Structured Sparsity Learning [Wen et al.2015] embedded Group LASSO [Yuan and Lin2006] into CNN regularization and obtained regular-shape sparsity.

Apart from parameter pruning, there are other methods for CNN model compression and acceleration, including designing compact architecture, parameter quantization and matrix decomposition. Compact architecture designing methods target designing more efficient and compact neural network architectures. For example, SqueezeNet [Iandola et al.2016] was proposed to stack compact blocks, which decreased the number of parameters by less than the original AlexNet. MobileNet [Howard et al.2017] and ShuffleNet [Zhang et al.2017] leveraged new convolutional implementations to design networks for mobile applications.

Parameter quantization methods reduce CNN storage by quantizing the weights and using less representation bits. [Chen et al.2015]

proposed a hash function to group weights of each CNN layer into different hash buckets for parameter sharing. As the extreme form of quantization, binarized networks 

[Courbariaux and Bengio2016, Lin et al.2016, Rastegari et al.2016] proposed to learn binary weights or activations. Quantization reduces floating computational complexity, but the actual speedup may be very related to specific hardware implementations.

Matrix decomposition decomposes large matrices into several small matrices to reduce computation. [Denton et al.2014]

showed that the weight matrix of a fully-connected layer can be compressed via truncated SVD. Tensor decomposition was proposed and obtained better compression result than SVD 

[Novikov et al.2014]. Several methods based on low-rank decomposition of convolutional kernel tensors were also proposed to accelerate convolutional layers [Lebedev et al.2016, Zhang et al.2016].

3 The Proposed Method

Suppose weights of convolutional layers in a CNN form a sequence of 4-D tensors . Here , , and  are the dimension of the th () weight tensor along the axis of filter, channel, spatial height and spatial width, respectively. Our proposed objective function for regularization can be formulated as:


Here  represents the collection of all weights in the CNN; is the loss function caused by data; is non-structured regularization on every weight, which is the  norm in this paper. is the structured sparsity regularization on each layer. In [Wen et al.2015, Lebedev and Lempitsky2016], the authors used the same  for all groups and adopted Group LASSO for  because it can effectively zero out weights in groups, i.e., . In this paper, we use the squared  norm for , i.e., and vary the regularization parameters  for different groups.

The finally learned ‘structure’ is decided by the way of splitting groups of . Normally, there are filter-wise, channel-wise and shape-wise sparsity with different size of weight groups [Wen et al.2015]. The sparsity terms  are represented as

In im2col implementation, weight tensors are expanded into matrices where shapes are represented as columns and filters are represented as rows. Thus, shape and filter-wise sparsity are equivalent to column and row-sparsity, as shown in Fig.1.

Figure 1: The im2col implementation of CNN is to expand tensors into matrices, so that convolutional operations are transformed to matrix multiplication. The weights at the blue squares are to be pruned. (a) Pruning a row of weight matrix is equivalent to pruning a filter in convolutional kernel. (b) Pruning a column of weight matrix is equivalent to pruning all the weights at the same position in different filters. (c) Pruning a channel is equivalent to pruning several adjacent columns in the expanded weight matrix.

3.1 Theoretical Analysis

For the proposed method, the regularization parameter  is differently assigned for different weight groups. We find that by slightly augmenting  of a weight group, and then train the network through back-propagation until the objective function, i.e. Eqn.(1), reaches the local minimum, the  norm of that group will also decrease. This phenomenon is formally summarized by the following theorem.

Theorem 1

Considering the objective function


if there exists a tuple  which satisfies the following three properties:

  1. ;

  2. has the second derivative at ;

  3. is the local minimum of function ,

then there exists an  that for any , we can find an  which satisfies:

  1. is the local minimum of function ;

  2. .

This theorem is proved in Appendix A, which indicates that we can slightly increase the regularization parameter to compress the magnitude of weights to zero. By Eqn.(12), the magnitude of weights will be more compressed if the regularization parameter increases more.

3.2 Method Description

Theorem 1 guarantees that we can modify the  norm of weight groups by increasing or decreasing their corresponding regularization parameters. Thus, we can assign different regularization parameters to weight groups based on their importance to the network. In this paper, we use the  norms as the importance criterion for weight groups. Note that our method can be easily applied to other criteria such as  norm and Taylor expansions.

Normalization of importance criteria is necessary because the values of norms have huge variations across different networks, layers and weight groups. Contrary to previous works, normalization is based on the ranks of weight groups in the same layer. The advantages of rank-based normalization lies in two parts – (1) Compared to other normalization methods like max/min normalization, the range of ranks is fixed from  to , where  is the total number of weight groups in the layer; (2) For the pruning task, we need to set a pruning ratio  to each layer, say, means that we need to prune  of weight groups which are ranked the lowest when pruning is finished. Normalization by ranks makes the pruning process controllable since it is directly towards the goal of pruning.

Specifically, we sort weight groups by their norms in ascending order. Meanwhile, to mitigate the oscillation of ranks in one training iteration, we average the rank of each group through training iterations. For a weight group, its average rank through  iterations is defined as


Here  is the rank of the th iteration. The final average rank  is obtained by sorting  of different weight groups in ascending order, making its range from  to .

Our aim is to assign an increment  to each weight group, so that its regularization parameter  is gradually updated through the pruning process.


Following the above idea, of each group is assigned by its average rank  with a piecewise linear function, as shown in Eqn.(5).



Figure 2: The functional of , as defined in Eqn.(5).

depicts . It is seen from Fig.2 that for weight groups whose  norms are small, i.e., the average ranks less than , we need to increase their regularization parameters to further decrease their  norms; and those with greater  norms and rank above , we need to decrease regularization parameters to further increase their  norms. After obtaining  and  by Eqn.(5) and (4), we threshold it by zero to prevent negative values of regularization:


After updating , the weights of CNN are trained through back-propagation deduced from Eqn.(1).


After training the weights for several iterations, we recalculate  and the training process continues until convergence. Since we decrease the  norms of weight groups whose ranks are less than  and increase the  norms of weight groups whose ranks are greater than , there should be  pruned weight groups at the convergence point. In Eqn.(5), is a hyper-parameter to control the speed of convergence. Greater value of  results in faster convergence. Finally, we summarize the proposed algorithm in Algorithm 1.

1:Input the training set , the original pre-trained model , the non-structural regularization parameter , pruning threshold  and target pruning ratio .
2:Initialize  for each weight group.
3:Initialize the iteration number .
5:     for each weight group in each layer do
6:         Obtain  of by Eqn.(3).
7:         Update  by Eqn.(5) and (6).
8:     end for
9:     Update weights in each weight groups through back-propagation by Eqn.(7).
10:     for each weight group in each layer do
11:         if the  norm of this group is less than  then
12:              prune this weight group permanently.
13:         end if
14:     end for
15:     .
16:until The ratio of pruned weight groups reaches .

Retrain the pruned CNN for several epochs.

18:Output the pruned CNN model .
Algorithm 1 The Proposed Algorithm

3.3 Discussions

The difference between Algorithm 1 and Theorem 1 is that Theorem 1 requires weights reach local minimum of the objective function (Eqn.1) before another updates of , but in Algorithm 1, we only train the network through back-propagation for several iterations. Such a compromise is mainly attributed to time-complexity. Actually, it is time-consuming if we obtain local minimum by gradually lowering the learning rate . In addition, because objective functions of CNN’s are very complex, it is also difficult to decide whether local minimum is obtained or not. Experiments indicate that  iterations are enough for convergence of the method.

An amazing factor of the proposed method is that it can automatically adjust search steps without any knowledge about the property of the objective function itself. By Eqn.(12), the increment of weight  is reversely proportional to the second derivative of the objective function  which is produced by CNN. Such a property makes the modification of  slower when the objective function  reaches steeper areas even without knowing the exact form of , which is a nice property to make refined search of local minimum possible. We believe that the good performance of the proposed method is partly because of the automatic adjustment of search steps.

4 Experiments

We first compare fixed and varying regularization with ConvNet on CIFAR-10 dataset [Krizhevsky and Hinton2009]

. Then we evaluate the proposed method with large networks on large scale datasets. All of our experiments are conducted with Caffe. We set the weight decay factor 

the same with baselines, and set the hyper-parameter  as half of . Since we focus on acceleration in this paper, we only compress weights in convolutional layers and keep fully connected layers unchanged. Methods for comparison include Taylor Pruning (TP) [Molchanov et al.2017], Filter Pruning (FP) [Li et al.2017], Structured Sparsity Learning (SSL) [Wen et al.2015], Channel Pruning (CP) [He et al.2017] and Structured Probabilistic Pruning (SPP) [Wang et al.2017]. For all experiments, speedup is calculated by GFLOPS reduction.

4.1 Analysis with ConvNet on CIFAR-10

We firstly compare our proposed method with the Group LASSO [Yuan and Lin2006] where the regularization parameter is fixed for all weights. Group LASSO is widely applied to generate group sparsity [Lebedev and Lempitsky2016, Wen et al.2015]. The test network is ConvNet, which is a small CNN with three convolutional layers and one fully-connected layers. CIFAR-10 is a -class dataset of tiny images, among which images are used for training, images for validation and the other images for testing.

We first trained a baseline model with test accuracy . Then the proposed method is applied to learn structured sparsity, where both row sparsity and column sparsity are explored. For comparison, we employ the Group LASSO with fixed regularization parameters ( for row sparsity and for column sparsity).

Experimental results are shown in Tab.1.

Method Row pruning Column pruning
speedup accuracy speedup accuracy
Table 1: Comparison of varying and fixed regularization with ConvNet on CIFAR-10. The baseline test accuracy is .

We can see that varying regularization consistently achieves higher speedups and accuracies than fixed regularization. Fig.3 illustrates the process that the  norms of columns changes with training iterations in the conv1 layer, from which we can see that while fixed regularization gradually suppresses unimportant weights, many important weights are also unnecessarily suppressed. In varying regularization, the magnitude of some weights increases dramatically when training starts while some decreases meanwhile. Thus, the pruning process converges much faster. In addition, with varying regularization, the final weights are driven into two distinct groups – the group formed by important weights with very high  norms and the group for unimportant weights with close-to-zero norms.

We also find that under similar speedup ratios, column pruning is better than row pruning in accuracy. It is because that a row typically consists of much more weights than a column and pruning rows may cause more severe side-effects for accuracy. In the following experiments, we only choose column as our sparsity group to obtain better accuracies.

Figure 3: Comparing the training process of varying regularization with that of fixed regularization. Each line illustrates the norm of a column in the conv1 layer ( columns in total).

We further compare our proposed method with more recent pruning methods, and the results are shown in Tab.2. Under different speedup ratios, our method consistently outperforms other pruning methods.

Method Increased err. (%)
TP (our impl.)
FP (our impl.)
Ours -
Table 2: The increased error of different pruning methods when accelerating ConvNet on CIFAR-10. The baseline test accuracy is . Minus means the test accuracy is improved.

4.2 AlexNet on ImageNet

We apply the proposed method to AlexNet [Krizhevsky et al.2012], which is composed of convolutional layers and fully-connected layers. We download an open caffemodel from Caffe model zoo as our pre-trained model. The baseline single view top-5 accuracy on ImageNet 2012 validation dataset is . All images are rescaled to size , then a patch is randomly cropped from each scaled image and randomly mirrored for data augmentation. For testing, the patches are cropped from the center of the scaled images.

Intuitively, different layers have different sensitivity to pruning, but there are few theories to quantify the redundancy of different layers in deep neural networks. Most pruning methods empirically set pruning ratio for different layers [Li et al.2017, Molchanov et al.2017, He et al.2017, Wang et al.2017]. Following these works, we empirically set the proportion of remaining columns of the five convolutional layers as . We train the baseline model with batch size for about epochs until reaching the target pruning ratio. We use a small learning rate  and fix it during training. Then the pruned model is retrained with batch size to regain accuracy.

Experimental results are shown in Tab.3.

Method Increased err. (%)
FP (SPP’s impl.)
Proposed - -
Table 3: Accelerating AlexNet on ImageNet. The baseline top-5 accuracy of the original network is .

The proposed method and SPP are consistently better than the other three methods, and the proposed method is slightly better than SPP on average. With speedup, our method can even improve the accuracy while SPP degrades the accuracy in this situation.

4.3 VGG-16 on ImageNet

We further demonstrate our method on VGG-16 [Simonyan and Zisserman2014], which has convolutional layers and fully-connected layers. We download the open caffemodel as our pre-trained model, whose single-view top-5 accuracy on ImageNet 2012 validation dataset is . Similar to AlexNet, the resized images are randomly cropped to input size  and mirrored for data augmentation in training.

Previous works [He et al.2017, Wang et al.2017] found that lower layers were more redundant in VGG-16. Therefore, the proportion of remaining ratios of low layers (conv1_x to conv3_x), middle layers (conv4_x) and high layers (conv5_x) are set to , the same as [Wang et al.2017]. The first and last convolutional layer (conv1_1 and conv5_3) are not pruned because both of them require very small amount of GFLOPs calculation. We first train the baseline model with batch size and with a fixed learning rate . Pruning is finished after around  epochs. Then the network is fine-tuned with batch size to regain accuracy.

Experimental results are shown in Tab.4. Our method is slightly better than CP and SPP, and beats TP and FP by a significant margin.

Method Increased err. (%)
FP (CP’s impl.)
Table 4: Acceleration of VGG-16 on ImageNet. The values are increased single-view top-5 error on ImageNet. The baseline top-5 accuracy of the original network is .

4.4 ResNet-50 on ImageNet

Unlike AlexNet and VGG-16, ResNet-50 is a multi-branch deep neural network, which has convolutional layers and no fully-connected layers. Open pre-trained caffemodel is adopted as our baseline, whose single view top-5 accuracy on ImageNet 2012 validation dataset is . The images are augmented the same way as in the experiment of VGG-16 (Sec.4.3).

For simplicity, we adopt the same pruning ratio  for all convolutional layers. The training settings are similar to that of VGG-16. The pruning process stops after less than epochs before retraining.

From Tab.(5

), our method achieves much better results than two recent methods CP and SPP. This is probably because the proposed method imposes gradually regularization, making the network adapt little-by-little in the parameter space, while both the matrix decomposition in CP and direct pruning in SPP may bring much modification that a network as compact as ResNet-50 cannot endure.

Method Increased err. (%)
CP (enhanced)
Table 5: Acceleration of ResNet-50 on ImageNet. The baseline top-5 accuracy of the original network is .

5 Conclusion

We propose structured sparsity pruning with varying regularization parameters for CNN acceleration, which assigns different regularization parameters to weight groups according to their importance to the network. Theoretical analysis guarantees the convergence of our method. The effectiveness of the proposed method is proved by comparison with state-of-the-art methods on popular CNN architectures.

Appendix A – Proof of Theorem 1

Proof: For a given , the  which is the local minimum of the function  should satisfy , which gives:


In this situation, we can calculate the derivative of  by using Eqn.(8):


Since  is the local minimum of the function , it should satisfy that

which yields


If we take Eqn.(10) into (9), we can obtain


By taking Eqn.(11) into (12), we can conclude that


In other words, when  is greater than zero, a small increment of  will decrease the value of ; and when  is less than zero, a small increment of  will increase the value of . In both cases, when  increases, will decrease at the new local minimum of .

Thus, we finished the proof of Theorem 1.