Towards thinner convolutional neural networks through Gradually Global Pruning

03/29/2017 ∙ by Zhengtao Wang, et al. ∙ 0

Deep network pruning is an effective method to reduce the storage and computation cost of deep neural networks when applying them to resource-limited devices. Among many pruning granularities, neuron level pruning will remove redundant neurons and filters in the model and result in thinner networks. In this paper, we propose a gradually global pruning scheme for neuron level pruning. In each pruning step, a small percent of neurons were selected and dropped across all layers in the model. We also propose a simple method to eliminate the biases in evaluating the importance of neurons to make the scheme feasible. Compared with layer-wise pruning scheme, our scheme avoid the difficulty in determining the redundancy in each layer and is more effective for deep networks. Our scheme would automatically find a thinner sub-network in original network under a given performance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

CNNs have achieved great successes in various pattern recognition tasks, especially in large scale image classification

[1, 2, 3]. However, these deep learning models always contain dozens of layers and millions or billions of parameters. For example, AlexNet [1] network contains about 60 millions of parameters, while VGG network contains about 144 millions of parameters. The memory and computation cost of such models are so high that it is difficult to apply them to resource-limited devices such as mobile phones. On the other hand, it turns out that the deep learning models are always over-parameterized [4], which means that the huge amount of connections of deep learning models could be properly pruned and compressed to reduce the storage and the computation cost. This need has driven the development of the research of deep learning model compression, also known as Deep Compression.

Researchers have proposed various of methods on deep compression. We roughly group them into three categories. (1) Approximation [5, 4, 6, 7, 8]

: weight matrices and tensors in deep model could be approximated using tensor decomposition techniques. The storage cost of deep model is thus saved.

(2) Quantization [9, 10, 11, 12, 13, 14, 15]: by searching or constructing a finite set for candidate parameters, one could map parameters from real number to several candidates. These candidates could be properly encoded later for further compression. The extreme case for quantization is the binary networks, in which the parameters only have two possible values. (3) Pruning [16, 17, 18, 19, 20]: methods in this category aim to reduce redundant connections, neurons or entire layers of the model. Model pruning could be conducted with different granularities, resulting in reduction in model depth, width or number of connections. Approaches directly training sparse networks are not deemed to be pruning methods because they did not really prune any networks. Compared with directly training network from scratch, deep compression could make the most of existing pre-trained models, which were carefully trained by experts and are efficient in extracting features.

Compared with approximation and quantization in deep compression, model pruning would directly change the structure of the model. The pruned model will have sparse connections, fewer layers or neurons according to different pruning granularities. In practical terms, model pruning at neuron level essentially selects a sub-network in original network, keeping the regularity unchanged. As a comparison, pruning at connection level always results in sparse-connected networks, which not only need extra representation efforts but also not fit well for parallel computation [21].

In this paper, we will mainly focus on neuron level pruning, whose aim is to reduce the width of layers in the model. To avoid confusion, we will use the term “neuron” to refer to a single neuron in fully-connected layers or a filter in convolution layers. A normal neuron level pruning scheme usually contains three steps: (1) Select neurons to be removed. (2) Drop redundant neurons. (3) Fine-tune the model to recover the performance. This process is usually done in layer-wise. The main disadvantage of the layer-wise scheme is that it is time-consuming and for a given performance target, it is hard to determine how many neurons should be dropped in each layer. We try to solve these problems by pruning the network globally. In each pruning step, all neurons in the network will be taken into consideration at the same time.

The main contributions of this paper are (1) we propose a gradually global pruning scheme for neuron level model pruning, which could automatically find a near-optimal structure for given performance. (2) we propose a simple method to evaluate the contribution score in different layers when selecting redundant neurons. In our scheme, the redundant neurons are selected globally to avoid the difficulty of determining the number of redundant neurons in each layer. In each pruning step, only a small percent of neurons were selected to keep the model stay close to the original local optimal point as far as possible, so that we can recover the performance fleetly through just few epochs of fine-tuning. As a result, one could find a near-optimal structure for specific task and obtain a thinner network, which is particularly suitable for resource-limited devices. Note that the size of the pruned model could be further compressed through other deep compression methods.

In Section 2.1, we will introduce some neuron contribution evaluation methods and adjust them to be compatible with global neuron selection. The gradually global pruning scheme will be given in Section 2.2. Our experiment results with different contribution score evaluation methods and contrast experiments on layer-wise pruning are shown in Section 3. Finally, in Section 4 we make a brief summary of the work and provide some insights for future research.

(a) score
(b) score
(c) AAWS score
(d) score after modification
(e) score after modification
(f) AAWS score after modification
Figure 1: The score distribution of different metrics

2 Gradually global pruning scheme

2.1 Redundant neurons selection

Selecting unimportant or redundant neurons in CNNs is of the prime importance in neuron level pruning. The unimportant neurons have little contribution to the model performance, thus just a few epochs of fine-tuning could compensate the model degradation caused by removing these neurons. Thanks to the huge amount of neurons in deep learning model, neuron selection by try and error is extremely time-consuming.

We denote as the output of layer -th for a sample. For convolutional layers, is a 3D tensor, where , and are feature map width, height and number of input channels, respectively. We define the mean value of as the response of the corresponding filter, then the response of layer

is a vector whose length equals to the number of filters in this layer. For fully-connected layers, the response of each neurons is just the original output value, which is also a vector. We denote the response of a neuron as


We use 3 contribution score metrics in our scheme. score is a metric we generalized from [22]. The contribution score of neurons is defined as the standard derivation of neuron responses over the training set or a subset sampled from the training set. The idea behind this metric is that a neuron is not important if it has nearly the same output for all training samples. Just like score, the average responses intensity could also be viewed as a contribution metric if we assume neurons with low average responses intensity are not important. The last metric is generalized from [23]. We use the average value of absolute weights sum (AAWS) of a neuron as the contribution score metric. For -th filter at convolutional layer with parameters, the AAWS score for a filter is defined as:


For fully-connected layers, the AAWS score of a neuron is just the mean of absolute weight sum over all connections starting from that neuron.

These metrics could be directly used in layer-wise pruning scheme because scores within a layer are comparable. However, if we want to conduct a global neuron selection across different layers, they fail to evaluate the neuron contribution properly. Fig.1(a) and Fig.1(b) shows the scores distribution across different layers under and metrics respectively. We can find that if we take all layers into consideration when selecting redundant neurons globally, neurons at low layers such as conv1_1 would be treated as the most unimportant neurons and get dropped firstly. However, without low level information extracted by conv1_1, it is impossible for the network to make inferences. AAWS score in VGG-16 network is shown in Fig.1(c), in this case, the scores in higher layers are more likely to be low.

According to our observation, The biases show in Fig.1 are introduced by the position of layers in the model. To make the global neuron selection scheme feasible, we adjust the scores by dividing them by the average score of current layer. After this adjustment, the scores in different layers are mixed up and comparable to each other. As a result, the biases in different layers are eliminated. The modified contribution score can be calculated as:

We make a slight modification on these metrics to make a feasible global neuron selection. We just divided the score of each neuron by the average score of current layer to mix up scores in different layers. By doing this, the biases in different layers are eliminated and the contribution scores of neurons across the model are comparable.


is the number of neurons in -th layer. Fig.1(d) to Fig.1(f) show the score distribution after this adjustment.

2.2 Gradually global pruning scheme

With the neuron selection method proposed in Sec.2.1, we introduce our gradually global pruning scheme in this part. Our scheme is “global” because in each pruning step, all neurons in the model instead of just in a layer were taken into consideration. The redundant neurons we selected are the most unimportant neurons in the whole network, not just in a single layer. The global neuron selection method bring us two benefits. Firstly, we do not need to determine how many neurons should be dropped in each layer, which is quite difficult and the redundancy may varies in different layers. Secondly, the number of fine-tunings required is controlled by the given pruning ratio, not the depth of the model, which is particularly useful for pruning deep networks.

On the other hand, our scheme is “gradual” because in each pruning step, only a small percent of neurons were dropped. This is to keep the convergent point after pruning close to the original one as far as possible so that we can recover the performance through just few epochs of fine-tuning. By gradually pruning the network, we can get close to a near-optimal network under a given performance.

In practice, the proposed gradually global pruning scheme prunes a trained network through a “select-prune-fine-tune” loop. We summarize the scheme in Algorithm 1

. Dropping neurons and updating the network (line 6) could be realized by zeroing the corresponding parameters and updating a mask that forbid the dropped parameters from updating in fine-tuning, or just extract the sub-network that contains important neurons only. In our experiments, we implement model updating in the second way, which could provide model checkpoints additionally. All experiments were conducted on open source deep learning framework Keras


with TensorFlow as the back-end. We will upload our source code to github

111 after clear-up for reproduction and future works.

0:  A trained Model:
Given performance target:
Contribution score evaluator:
Pruning ratio generator:
Training set:
Validation set:
0:  A thinner model:
1:  Compute the performance of using
2:  while  do
3:     Compute the contribution scores of all neurons in with evaluator
4:     Sort the scores
5:     Select neurons to be prune, where is the number of neurons in current model
6:     Drop the selected neurons in the network, get , update by
7:     Fine-tune with training set
8:     Update by the performance of over
9:  end while
10:  return  
Algorithm 1 Gradually global pruning scheme.

3 Experiments

We build a VGG-like network for CIFAR-10 image classification and train it from scratch. The convergent model reaches an accuracy of 87.32%. The structure of our model is just like VGG-16 model with extra BatchNormalization layers after each convolutional layer and the first fully-connected layer. We set only two fully connected layer before softmax layer.

In our experiments the pruning ratio is set as 0.05. Pruning ratio is a trade-off parameter that controls the redundancy of pruned model and the speed of the algorithm. A higher pruning ratio will result in more redundant neurons in each step, reducing the number of steps before the program returns. On the contrary, if the pruning ratio is small, the scheme will remove small number of neurons in each step, thus it takes more steps to get close to the optimal structure. In the extreme case, we can remove just one neuron in each step. This pruning ratio can also be determined in some adaptive way.

In the first experiment, we evaluate the performance of different neuron contribution metrics. We force to conduct 7 rounds of pruning without considering the performance degradation. The experiment result are shown in Fig.2(a) and the model structure after 7 rounds of pruning and fine-tuning is shown in Table.1. According to Fig.2(a), AAWS score has a clear advantage compared with other two metrics. In addition, AAWS score is a date-independent metric, thus we don’t need to obtain the statistics of training set in each pruning step. The model structures shown in Table.1 indicate that the AAWS metric tends to give higher score to intermediate layers and lower scores to neurons at lower and higher layers.

Figure 2: The performance of different metrics and schemes

We compare our global neurons selection strategy and the layer-wise neurons selection strategy under AAWS score metric. In our second experiment, the neurons were selected at each layer proportionally and all other experiment settings keep unchanged. The results are shown in Fig.2(b). Our global selecting scheme under AAWS score outperforms the layer-wise scheme in every step of pruning. The network structure is shown in the last column of Table.1.

layer org. AAWS Prop.
conv1_1 64 35 3 33 45
conv1_2 64 52 14 34 45
conv2_1 128l 85 70 83 89
conv2_2 128 72 70 128 89
conv3_1 256 93 168 254 179
conv3_2 256 173 194 256 179
conv3_3 256 169 218 256 179
conv4_1 512 257 314 486 357
conv4_2 512 405 395 500 357
conv4_3 512 490 382 448 357
conv5_1 512 468 452 321 357
conv5_2 512 436 434 276 357
conv5_3 512 398 397 229 357
fc1 512 177 199 6 357
total 4736 3310 3310 3310 3304
acc. 87.32% 84.35% 81.88% 86.76% 86.54%
Table 1:

In the last experiment, we compare the result of our scheme and the layer-wise fine-tuning scheme under the same average pruning ratio. To be precise, in each pruning step we drop about 30.11% of neurons of a layer and fine-tune the model at once. The accuracy of the model is 86.48% after 14 rounds of fine-tuning. Note that the number of pruning steps equals to the depth of the model and the optimal pruning ratio for each layer cannot be known in advance. We argue that a similar “gradually layer-wise pruning” experiment is no need for conducting because the fine-tuning round of this experiment is extremely large. If the number of layer in the model is and the pruning round is , the “gradually layer-wise pruning” will require rounds of fine-tunings, while our scheme requires just rounds.

4 Conclusion

In this paper, we propose a gradually global pruning scheme for model compression. By selecting neurons in each pruning step globally, we are able to search a near-optimal structure gradually. As a result, we would obtain thinner networks, which are especially suitable for deep learning application in resource-limited devices. The pruned model inherits the regularity from the original model, thus it is compatible with other deep compression algorithms. The modification on neuron contribution score is proposed to make the global pruning scheme applicable and is not the optimal. Future works on more suitable contribution score evaluation methods would be very valuable.