Log In Sign Up

Structured Probabilistic Pruning for Convolutional Neural Network Acceleration

Although deep Convolutional Neural Network (CNN) has shown better performance in various computer vision tasks, its application is restricted by a significant increase in storage and computation. Among CNN simplification techniques, parameter pruning is a promising approach which aims at reducing the number of weights of various layers without intensively reducing the original accuracy. In this paper, we propose a novel progressive parameter pruning method, named Structured Probabilistic Pruning (SPP), which effectively prunes weights of convolutional layers in a probabilistic manner. Specifically, unlike existing deterministic pruning approaches, where unimportant weights are permanently eliminated, SPP introduces a pruning probability for each weight, and pruning is guided by sampling from the pruning probabilities. A mechanism is designed to increase and decrease pruning probabilities based on importance criteria for the training process. Experiments show that, with 4x speedup, SPP can accelerate AlexNet with only 0.3 0.8 directly applied to accelerate multi-branch CNN networks, such as ResNet, without specific adaptations. Our 2x speedup ResNet-50 only suffers 0.8 of top-5 accuracy on ImageNet. We further prove the effectiveness of our method on transfer learning task on Flower-102 dataset with AlexNet.


page 1

page 2

page 3

page 4


Structured Deep Neural Network Pruning by Varying Regularization Parameters

Convolutional Neural Networks (CNN's) are restricted by their massive co...

Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning

The success of convolutional neural networks (CNNs) in various applicati...

Lossless CNN Channel Pruning via Gradient Resetting and Convolutional Re-parameterization

Channel pruning (a.k.a. filter pruning) aims to slim down a convolutiona...

Retrain or not retrain? – efficient pruning methods of deep CNN networks

Convolutional neural networks (CNN) play a major role in image processin...

EAPruning: Evolutionary Pruning for Vision Transformers and CNNs

Structured pruning greatly eases the deployment of large neural networks...

Convolutional Neural Network Simplification with Progressive Retraining

Kernel pruning methods have been proposed to speed up, simplify, and imp...

Pruning Very Deep Neural Network Channels for Efficient Inference

In this paper, we introduce a new channel pruning method to accelerate v...

1 Introduction

Convolutional Neural Network (CNN) has obtained better performance in classification, detection and segmentation tasks than traditional methods in computer vision. However, CNN leads to massive computation and storage consumptions, thus hindering its deployment on mobile and embedded devices. Previous research indicated that CNN acceleration falls into four categories: designing compact network architectures, parameter quantization, matrix decomposition and parameter pruning. Our work belongs to the last category.

Figure 1: The main idea of probabilistic pruning. We assign different pruning probabilities

to different neurons based on some importance criteria. For each training iteration, we apply the Monte-Carlo sampling process to decide whether a neuron is pruned or not, as shown in (b) that the second neuron has pruning probability less than one so it can be pruned or retained according to its pruning probability value, and since the third neuron has 

pruning probability, it is pruned permanently from the network. The pruning probability changes based on the importance criteria during training, thus producing a competing mechanism for all neurons in the network. Finally, when a pruning ratio  is reached, e.g. for the second layer in (c), the pruning process terminates.

Pruning is a promising way for CNN acceleration which aims at eliminating model parameters based on a performance loss function. However, unstructured pruning will lead to irregular sparsity, which is hard to implement for speedup on general hardware platforms 

[7]. Even with sparse matrix kernels, the speedup is very limited [28]. To solve this problem, many works focus on structured pruning (Fig.2),

Figure 2:

A common implementation of CNN is to expand tensors into matrices, so that convolutional operations are transformed to matrix multiplication. Here 

is the layer number and the weights at the blue squares are to be pruned. (a) Pruning a row of weight matrix is equivalent to pruning a filter in convolutional kernel . (b) Pruning a column of weight matrix is equivalent to pruning all the weights at the same position in different filters. (c) Pruning a channel in is equivalent to pruning the same channel in , also equivalent to pruning several adjacent columns in .

which can shrink a network into a thinner one so that the implementation of the pruned network is efficient [1, 26]. For example, [19] proposed an one-shot pruning method to prune the less important filters based on their norms. [23] proposed a progressive pruning method to prune filters based on a novel importance criteria derived from Taylor expansion. Recently, [11]

achieved the state-of-the-art pruning results on VGG-16, via an alternative filter pruning method using LASSO regression based channel selection and least square reconstruction.

However, existing pruning approaches mainly have three problems:

  1. Training-based methods prune unimportant weights based on some importance criteria and never recover them in the following training process. Given the importance criteria is either simple, such as the commonly used norm and norm [9, 19], or derived under very strong assumptions, such as the parameters need to be i.i.d. in [23], it is probable that some pruned weights may become important later if they were kept through the whole training process. It is necessary to design recovery mechanisms for pruned weights to correct misjudgments during early training stages.

  2. Reconstruction-based methods prune and reconstruct the network layer by layer, so the time complexity of pruning grows linearly with the number of layers. This is time-consuming considering that current CNNs such as ResNet utilize very deep architectures. In this sense, it is better to prune parameters simultaneously from all layers than prune layer-wisely.

  3. Many structured pruning methods target to prune the whole filters. Since filters are big and coarsely-grained units, it is very likely that the accuracy will drop dramatically after filter-level pruning.

To solve the above problems, we propose the Structured Probabilistic Pruning (SPP) for CNN acceleration. Firstly, SPP prunes weights in a probabilistic manner, as shown in Fig.1. Specifically, we assign a pruning probability to each weight. When some weights are below the importance threshold and should have been pruned, we only increase its pruning probability rather than totally eliminate them. Only when reaches will the weights be permanently eliminated from the network. We also design a mechanism to decrease the pruning probability if the weights become more important during training, thus correcting the previous misjudgments. Secondly, SPP prunes the whole network at the same time instead of layer-wisely, so the time complexity is controllable when network becomes deeper. Thirdly, the basic pruning units for SPP are columns of model parameters. Compared with filter-level pruning, the structured unit is smaller and results are more robust against inaccurate importance criteria.

With speedup, SPP can accelerate AlexNet with only loss of top-5 accuracy and VGG-16 with loss of top-5 accuracy in ImageNet classification. Moreover, SPP can be directly applied to accelerate multi-branch CNN networks, such as ResNet, without specific adaptations. Our speedup ResNet-50 only suffers loss of top-5 accuracy on ImageNet. We further prove the effectiveness of our method on transfer learning task on Flower-102 dataset with AlexNet.

2 Related Work

Intensive research has been carried out in CNN acceleration, which is normally categorized into four groups, i.e. designing compact network architectures, parameter quantization, matrix decomposition and parameter pruning.

Compact architecture designing methods use small and compact architectures to replace big and redundant ones. For example, VGG [25] and GoogLeNet [27] used kernels to replace larger convolutional kernels of size and . ResNet [10] used kernels to build compact bottleneck blocks for saving computation. SqueezeNet [12] was proposed to stack compact blocks, which decreased the number of parameters by less than the original AlexNet.

Parameter quantization reduces CNN storage by vector quantization in the parameter space.

[8] and [29] used vector quantization over parameters to reduce redundancy. [2]

proposed a hash function to group weights of each CNN layer into hash buckets for parameter sharing. As the extreme form of quantization, binarized networks were proposed to learn binary value of weights or activation functions in CNN training and testing 

[4, 20, 24]. Quantization reduces floating computational complexity, but the actual speedup may be very related to hardware implementations.

Matrix decomposition modifies weights into smaller components to reduce computation. [6] showed that the weight matrix of a fully-connected layer can be compressed via truncated SVD. Tensor decomposition was proposed and obtained better compression result than SVD [22]. Several methods based on low-rank decomposition of convolutional kernel tensor were also proposed to accelerate the convolutional layer [6, 13, 17].

Parameter pruning was pioneered in the early development of neural networks. Optimal Brain Damage [18] leveraged a second-order Taylor expansion to select parameters for deletion, using pruning as regularization to improve training and generalization. Deep Compression [8] removed close-to-zero connections and quantized the remained weights for further compression. Although these pruning methods achieve remarkable reduction in storage, the induced irregular sparsity is hard to implement for acceleration. Structured pruning was proposed to overcome this problem, which tended to prune structured units of parameters (e.g. rows and columns of weight matrix), so that it can accelerate CNN computation without specific implementation and hardware modification [1, 26]. Structured Sparsity Learning [28] uses group LASSO regularization to prune weight rows or columns, which achieves speedup of AlexNet while suffers loss of top-1 accuracy. Taylor Pruning [23] uses a Taylor expansion based importance criteria to prune filters, which was reported effective on transfer learning tasks with AlexNet, while not very impressive on large dataset like ImageNet (about top-5 accuracy loss under speedup). Filter Pruning [19] is an one-shot pruning method, using the  norm to prune filters, which was proved effective on CIFAR-10 and ImageNet with VGG-16 and ResNet, while the reported speedup is very limited. Recently, Channel Pruning [11] alternatively uses LASSO regression based channel selection and feature map reconstruction to prune filters, and achieves the state-of-the-art result on VGG-16. They also verify their method on ResNet and Xception [3].

3 The Proposed Method

Suppose that the we have a dataset  consisting of  inputs  and their corresponding labels :

The parameters of a CNN with  convolutional layers is represented by

which are learned to minimize the discrepancy, i.e. the loss function , between network outputs and labels. The common loss function for classification tasks is the negative log-likelihood of Softmax output , which is defined as


where  represents the th element of the Softmax output for the th input.

The aim of parameter pruning is to find a simpler network  with fewer convolutional parameters based on the original network , in which the increase of loss is minimized. This minimization problem is defined by Eqn.(2).


Normally for CNN, an input tensor of convolutional layer is firstly convoluted with the weight tensor , then a non-linear activation function

, usually Rectified Linear Units (ReLU), will be applied to it. Then the output will be passed as input to the next layer. For pruning, a mask 

is introduced for every weight, which determines whether this weight is used in the network. Thus, the output of th layer is described as


where denotes element-wise multiplication and  denotes the convolution operation. Note that the masked weights are also not updated during back propagations. Since SPP prunes columns of weights, we assign the same  to all weights in the same column, so weights in each column are pruned or retained simultaneously in each iteration. We choose columns to prune because they are the smallest structured granularity in CNN, which gives us more freedom to select pruning components.

For traditional pruning methods, when weights are pruned, they will never be reused in the network, which we call it deterministic pruning. On the contrary, SPP prunes weights in a probabilistic manner by assigning a pruning probability  to each weight. For example, means that there is likelihood that the mask  of the corresponding weight is set to zero. During the training process, we increase or decrease all ’s based on the importance criteria of weights. Only when  is increased to , its corresponding weight is permanently pruned out from the network. Obviously deterministic pruning can be regarded a specific case of probabilistic pruning when we choose to increase from  to  in a single iteration.

For SPP training, all ’s are updated by algorithms described in Sec.3.1. then the mask  is generated by Monte Carlo sampling according to :


After  is obtained for each weight, the pruning process is applied by Eqn.(3).

3.1 How to update 

Assume that a convolutional layer consists of  columns. Our aim is to prune  columns, where is the pruning ratio, indicating the fraction of columns to be pruned at the final stage of SPP.

SPP updates  by a competition mechanism according to some weight importance criteria. In this paper, we choose the importance criteria as the sum of  norm of each column: the bigger the  norm, the more important that column is. It was shown by experiments that  and  norms have similar performance as pruning criteria [9, 19]. There are also other importance criteria such as Taylor expansions to guide pruning [18, 23]. In this paper, we choose the  norm for simplicity. Our method can be easily generalized to other criteria.

The increment  of  is a function of the rank , where the rank  is obtained by sorting the  norm of columns in ascending order. The function  should satisfy the following three properties:

  1. is a strictly decreasing function because higher rank means greater  norm. In this situation, the increment of pruning probability should be smaller since weights are more important based on the  norm importance criteria.

  2. The integral of  with  ranging from  to  should be positive. In this situation, the total sum of pruning probabilities increases after each update, so that we tend to prune more and more weights until reaching the final stage of the algorithm.

  3. should be zero when . Since we aim at pruning  columns at the final stage, we need to increase the pruning probability of weights whose ranks are below , and decrease the pruning probability of weights whose ranks are above . By doing this, we can ensure that exactly  columns are pruned at the final stage of the algorithm.

The simplest form satisfying these properties is a linear function, as shown in Fig.3,

Figure 3: The functional relationship between pruning probability increment  and the rank . We illustrate four functions that satisfy the above three properties. The first one is a linear function and the other three are symmetric exponential functions as defined in Eqn.(5) with hyper-parameter  respectively.

a line penetrating point  and , where  is a hyper-parameter indicating the increment of pruning probability for the worst-ranked column.

However, experimental results are not very good if we take  as an linear function (refer to Sec.4.7). The reason is that we use the  norm to measure the weight importance, but the 

norm is not uniformly distributed. In Fig.

4, we plot the  norm histogram of convolutional layers in AlexNet, VGG-16 and ResNet-50.

Figure 4: The  norm histogram of convolutional layers in AlexNet, VGG-16 and ResNet-50. Because VGG-16 and ResNet-50 have many layers, we only show four representative layers in this figure.

It is observed that the  norm of each layer shares a Gaussian-like distribution in which the vast majority of  values are accumulated within a very small range. If we set  linearly, the variations of increments would be huge for middle-ranked columns, but their actual  values are very close. Intuitively, we need to set  of middle-ranked columns similar and make  steeper on both ends, making it in correspondence with the distribution of  norms. In this paper, we propose a center-symmetric exponential function to achieve this goal, as shown in Eqn.(5):


Here  is a hyper-parameter to control the flatness of the function, smaller  indicates that  for middle-ranked columns is flatter, as shown in Fig.3. Note that  is a decaying parameter for the exponential function, and the function is center-symmetric on point . Since we need to compel  passing through  and , we can solve out  and  as  and .

For each column, its pruning probability is updated by


where  denotes the th update of , the min/max is to ensure that  is within the range of . During the th update, for a specific column, if its rank  is less than , would be positive to make  increase; and if its rank  is greater than , would be negative to make  decrease. The pruning action are sampled by Eqn.(4) and back propagation is applied to update these weights. Under this mechanism, the weights will compete to survive pruning. They will be eliminated gradually until the pruning ratio  is reached.

3.2 When to update

Another question is when to update . A common practice is to prune weights at a fixed interval [23]. In SPP, we keep this simple rule: is updated every  iterations of training, where  is a fixed number. Note that  is monotonically increasing when and decreasing otherwise. As stated above, the increased portion is larger than the decreased portion at each update, so it is ensured that the pruning ratio will be achieved within finite iterations. Finally, after the pruning ratio  is reached, we stop the pruning process and retrain the pruned model for several iterations to compensate for accuracy. The whole algorithm of SPP is summarized in Algorithm 1.

1:Input the training set , the original pre-trained CNN model  and target pruning ratio .
2:Set hyper-parameters , and (in default, , and ).
3:Set update number .
4:For each column of all convolutional layers, initialize its pruning probability .
5:Initialize iteration number .
7:     If , then update by Eqn.(5) and (6), and set .
8:     For each column, obtain  by Monte Carlo sampling based on , as shown in Eqn.(4).
9:     Prune the network by Eqn.(3).
10:     Train the pruned network, updating weights by back propagations.
11:     .
12:until The column ratio of  equals to .
13:Retrain the pruned CNN for several iterations.
14:Output the pruned CNN model .
Algorithm 1 The SPP Algorithm

4 Experiments

The hyper-parameters in SPP are , and . Here we set , and , which are the same for all experiments. The settings such as weight decay, momentum, and dropout are unchanged from the baseline model. In the pruning and retraining process, we only adjust the learning rate and batch size, which will be elaborated in the following experiments.

On the small-scale dataset CIFAR-10 [15], We firstly evaluate our method on a shallow single-branch model ConvNet and a deep multi-branch model ResNet-56. Secondly, on the large-scale dataset ImageNet-2012 [5], we evaluate our method with three state-of-the-art models – AlexNet [16], VGG-16 [25] and ResNet-50 [10]. We also test our method for transfer learning tasks on the Oxford Flower-102 dataset [21]

. We use Caffe 

[14] for all of our experiments.

Methods for comparison include four structured pruning approaches which were proposed in recent years: Structured Sparsity Learning (SSL) [28], Taylor Pruning (TP) [23], Filter Pruning (FP) [19] and Channel Pruning (CP) [11].

4.1 ConvNet on CIFAR-10

CIFAR-10 dataset contains  classes with  images for training and  for testing. We take  images from the training set as the validation set. ConvNet was firstly proposed by [16] for classification on CIFAR-10, which is composed of  convolutional layers and  fully connected layer. The batch size is set to  and initial learning rate set to .

The performance of methods is shown in Fig.5. We carefully vary the pruning ratio  to obtain the same speedup ratio, and compare accuracies of these methods. When the speedup is small (), SPP can even improve the performance, for which we argue that the modest pruning can regularize the objective function and increase accuracy. This phenomenon was also found by other pruning method [28]. It can be observed that when speedup ratio is greater than , SPP performs much better than the other three methods.

Figure 5: Accelerating ConvNet on CIFAR-10. The baseline accuracy of the original network is . We carefully tune the speedup ratio to be the same, and compare accuracies of SPP, TP, FP and SSL. SPP is significantly better than the other three methods for all speedup ratios.

The second experiment studies the fraction of weights which are not important at the beginning, but become important at the final stage of training. We calculate the fraction of weights whose ranks are below  at the beginning and then above  at the final stage. These weights should be pruned by many one-shot pruning methods, but they are finally retained by SPP. Tab.1 shows that this fraction, termed as ‘recovery ratio’, for three convolutional layers of ConvNet.

Table 1: Recovery ratios for the three convolutional layers of ConvNet, with different speedup ratios.

Because the first layer (conv1) is very small, consisting of only  columns, it lacks dynamics for SPP to perform well. However, for conv2 and conv3, which consist of  columns each, the recovery ratios are very prominent, indicating that SPP could effectively utilize the dynamics of model parameters and achieve good performance. This may explain why the performance of SPP is more robust for big pruning ratios than the other three methods.

4.2 ResNet-56 on CIFAR-10

ResNet-56 is a deep multi-branch residual network, which is composed of  convolutional layers. We trained it from scratch with batch size  and learning rate . The obtained baseline accuracy is  on the test set.

The pruning results are shown in Tab.2. For acceleration, our method prunes ResNet-56 with only  error increase, which is better than FP and CP.

Method Increased err. (%)
FP [19] ([11]’s impl.)
CP [11]
Table 2: Accelerating ResNet-56 on CIFAR-10. The baseline accuracy of the original network is . We compare the error increase of FP, CP and SPP with  speedup ratio.

4.3 AlexNet on ImageNet

We further verify our method on ImageNet-2012, which is a large dataset of  classes, containing M images for training, for testing and  for validation. AlexNet [16] is composed of  convolutional layers and  fully connected layers. Here we adopt the CaffeNet, an open re-implementation of AlexNet with Caffe, as pre-trained model. The baseline top-5 accuracy is on ImageNet-2012 validation set.

Because AlexNet, VGG-16 and ResNet-50 are much deeper CNN models, where constant pruning ratio for all layers is not optimal. Usually pruning ratios of different layers are determined according to their sensitivity. There are mainly two ways to evaluate sensitivity: (1) Fix other layers and prune one layer, using the accuracy of the pruned model as the sensitivity measure for the pruned layer [19]; (2) Apply PCA to each layer and take the reconstruction error as the sensitivity measure [11, 28]. Here we take the PCA approach because it is simple and not time-consuming.

The PCA analysis of AlexNet is shown in Fig.6.

Figure 6: The normalized reconstruction errors with different remaining PCA ratios for convolutional layers in AlexNet.

We plot the normalized reconstrucction errors with different remaining principle component ratios. It can be seen that, under the same remaining principle component ratio, the normalized construction error of upper layers (like conv5) are greater than that of the bottom layers (like conv1), which means that the upper layers are less redundant. Thus, we set the proportion of the remaining ratios (one minus the pruning ratio) of these five layers to .

Accuracy comparison of TP, FP, CP and SPP is shown in Tab.3.

TP [23]
SSL [28]
FP [19] (our impl.)
Table 3: Accelerating AlexNet on ImageNet. The baseline top-5 accuracy of the original network is . We carefully tune the speedup ratio to be the same, and compare the top-5 error increase (%) of SPP, TP, FP and SSL.

SPP outperforms the other three methods by a large margin. SPP can accelerate AlexNet by speedup with only increase of top-5 error. Similar to the results of ConvNet on CIFAR-10, when the speedup ratio is small (), SPP can even improve the performance (by ). Previous work [28] reported similar improvements, while their improvement is relatively small () and under less speedup settings ().

4.4 VGG-16 on ImageNet

VGG-16 is a deep single-branch convolutional neural network with convolutional layers. We use the open pre-trained model111 vgg/research/very_deep/, whose single-view top-5 accuracy is .

Like the experiment with AlexNet, we firstly use PCA to explore the redundancy of different layers. Similarly, we found that deeper layers have less redundancy, where [11] obtains similar results. According to the sensitivity analysis, we set the proportion of remaining ratios of shallow layers (conv1_x to conv3_x), middle layers (conv4_x) and top layers (conv5_x) to . Because layer conv1_1 and conv5_3 are very small, and their contribution to total computation is and respectively, we choose not to prune these two layers.

We firstly prune the network with batch size and learning rate . Then we retrain the pruned model with batch size  and learning rate . Tab.4

TP [23]
FP [19] ([11]’s impl.)
CP [11]
Table 4: Acceleration of VGG-16 on ImageNet. The baseline top-5 accuracy of the original network is . We carefully tune the speedup ratio to be the same, and compare the top-5 error increase (%) of TP, FP, CP and SPP.

shows the accuracy comparison of TP, FP, CP and SPP. Generally, SPP and CP are much better than FP and TP. With small speedup ratio (), both SPP and CP can achieve zero increase of top-5 error. When the speedup ratio is greater ( and ), the performance of SPP is comparably with CP. SPP is better for  speedup and CP is better for  speedup.

We further evaluate the actual speedup on CPU with Caffe. Results are averaged from 50 runs with batch size . From Tab.5,

Method CPU time (ms)
VGG-16 baseline
SPP ()
SPP ()
SPP ()
Table 5: Actual speedup of VGG-16. The CPU time is obtained by forward processing one image of size . Experiments are conducted on Intel Xeon(R) CPU E5-2603.

we can see that the pruned model by SPP can achieve actual acceleration on general platform with off-the-shelf libraries. The discrepancy between speedup (measured by GFLOPs reduction) and actual speedup (measured by inference time) is mainly because of the influence of memory access, the unpruned fully-connected layers and the non-weight layers such as pooling and ReLU.

4.5 ResNet-50 on ImageNet

Unlike single-branch AlexNet and VGG-16, ResNet-50 is a more compact CNN with multi-branches, which is composed of convolutional layers. We use the open pre-trained caffemodel222, whose top-5 accuracy on ImageNet-2012 validation set is .

The PCA analysis of ResNet-50 is shown in Fig.7.

Figure 7: The normalized reconstruction errors with different remaining PCA ratios for representative convolutional layers in ResNet-50. In each bottleneck residual block, we choose the kernels to draw the figure. Layers of the same stage are plotted with the same color and marker.

Unlike AlexNet and VGG-16, the redundancy of different layers in ResNet-50 is not very relevant to their positions in the network. For example, bottom layers res2c_branch2b, res3b_branch2b and res3d_branch2b share similar sensitivity with top layers res5x. Thus, we set the pruning ratio of all convolutional layers of ResNet-50 to the same value in this experiment.

We firstly prune the net with batch size  and learning rate . Then retrain the pruned model with batch size  and learning rate .

The result is shown in Tab.6.

Method Increased err. (%)
CP (enhanced) [11]
Table 6: Acceleration of ResNet-50 on ImageNet. The baseline top-5 accuracy of the original network is . We carefully tune the speedup ratio to , and compare the top-5 error increase of CP and SPP.

It can be seen that our method achieves better result than CP for  speedup ratio. Note that the implementation of SPP on ResNet-50 is very simple, just the same as previous experiments. However, because of the special structure of ResNet, CP needs to add the multi-branch enhancement procedure to generalize the method to ResNet.

4.6 Transfer Learning

Finally, we apply SPP to transfer learning, where a well-trained model is finetuned by the data from other knowledge domains. We use pre-trained AlexNet model to finetune the Oxford Flower-102 dataset [21]. The 102-class dataset is composed of  images, among which is used for training, for validation and the other for testing. We use the open pre-trained caffemodel333 as the baseline with test accuracy . We compare the performance of our method with TP [23], which was reported very effective for transfer learning.

Like the above experiments, we firstly prune the pre-trained model with SPP, then finetune the model to regain accuracy. The learning rate is set to be  and batch size . For simplicity, we use a constant pruning ratio in this experiment. Tab.7

TP [23] (our impl.)
Table 7: Comparison of SPP and TP on Flower-102 dataset. The baseline accuracy is and the results are test error increases (%).

shows that SPP consistently outperforms TP at all speedup ratios. Note that in [23], the authors claim that Taylor-based criteria was significantly better than  norm in pruning for the transfer learning task,while our result shows that SPP+ norm outperforms TP+Taylor-based criteria, which proves the effectiveness of SPP in transfer learning tasks.

4.7 Comparison of Exponential and Linear  Functions

We experimentally compare our proposed symmetric exponential function  with the linear one (Sec.3.1) with ConvNet on CIFAR-10. The result is shown in Tab.8.

SPP (exp)
SPP (lin)
Table 8: Comparison of our proposed symmetric exponential function  with the linear function. The baseline accuracy is and the result is the test accuracy (%) with different speedup ratios.

We can see that, with relatively small speedup ratios, linear function has similar performance with the exponential one. However, when the speedup ratio becomes larger, our proposed exponential function significantly outperforms the linear one, which indicates that the proposed exponential function is more resistant to harsh pruning.

5 Conclusions

We proposed Structured Probabilistic Pruning (SPP) for CNN acceleration, which prunes weights of CNN in a probabilistic manner and able to correct misjudgement of importance in early training stages. The effectiveness of SPP is proved by comparison with state-of-the-art methods on popular CNN architectures. We also show that SPP can be applied to transfer learning tasks and achieve better results than previous methods.