DropPruning for Model Compression

12/05/2018
by   Haipeng Jia, et al.
0

Deep neural networks (DNNs) have dramatically achieved great success on a variety of challenging tasks. However, most of the successful DNNs are structurally so complex, leading to much storage requirement and floating-point operation. This paper proposes a novel technique, named Drop Pruning, to compress the DNNs by pruning the weights from a dense high-accuracy baseline model without accuracy loss. Drop Pruning also falls into the standard iterative prune-retrain procedure, where a drop strategy exists at each pruning step: drop out, stochastic deleting some unimportant weights and drop in, stochastic recovering some pruned weights. Drop out and drop in are supposed to handle the two drawbacks of the traditional pruning methods: local importance judgment and irretrievable pruning process, respectively. The suitable choosing of drop probabilities can decrease the model size during pruning process and lead it to flow to the target sparsity. Drop Pruning also has some similar spirits with dropout, a stochastic algorithm in Integer Optimization and the Dense-Sparse-Dense training technique. Drop Pruning can significantly reducing overfitting while compressing the model. Experimental results demonstrates that Drop Pruning can achieve the state-of-the-art performance on many benchmark pruning tasks, about 11.1× compression of VGG-16 on CIFAR10 and 14.3× compression of LeNet-5 on MNIST without accuracy loss, which may provide some new insights into the aspect of model compression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

04/12/2022

Neural Network Pruning by Cooperative Coevolution

Neural network pruning is a popular model compression method which can s...
09/17/2020

Holistic Filter Pruning for Efficient Deep Neural Networks

Deep neural networks (DNNs) are usually over-parameterized to increase t...
06/07/2020

EDropout: Energy-Based Dropout and Pruning of Deep Neural Networks

Dropout is a well-known regularization method by sampling a sub-network ...
05/24/2021

Towards Compact CNNs via Collaborative Compression

Channel pruning and tensor decomposition have received extensive attenti...
12/05/2017

Automated Pruning for Deep Neural Network Compression

In this work we present a method to improve the pruning step of the curr...
10/05/2017

To prune, or not to prune: exploring the efficacy of pruning for model compression

Model pruning seeks to induce sparsity in a deep neural network's variou...
02/05/2019

Animated Drag and Drop Interaction for Dynamic Multidimensional Graphs

In this paper, we propose a new drag and drop interaction technique for ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, various kinds of deep neural networks (DNNs) have dramatically improved the accuracy in many computer vision tasks, from basic image classification challenge 

(Krizhevsky et al., 2012; Simonyan & Zisserman, 2014b; He et al., 2016) to some advanced applications, e.g. object detection (Liu et al., 2016) and semantic segmentation (Noh et al., 2015). However, these networks generally contains tens of millions parameters, leading to much storage requirement and floating-point operation, which increase the difficulty of applying DNNs on mobile platforms with limited memory and processing units (Canziani et al., 2016; Cheng et al., 2017).

One way to address the above issue is model compression, because the models are always greatly overparametrized (Ullrich et al., 2017; Molchanov et al., 2017). Various approaches were proposed to compress the model, including quantization (Wu et al., 2016), parameter sharing (Chen et al., 2015), pruning (Han et al., 2015b), low rank factorization (Lebedev et al., 2014) and knowledge distillation (Ba & Caruana, 2014; Hinton et al., 2015). Among these methods, pruning appears to be an outstanding one, because it can prevent the accuracy loss with high compression ratio. As mentioned in (Han et al., 2015b), Han et al. reduced the model size without accuracy loss by removing all the weights lower than a threshold, and retraining the sparse model. Specially, as shown in Figure 1 (a)

(b), starting from a baseline model, i.e., the uncompressed model (denoted by a vector), a traditional pruning process firstly deleted some unimportant weights (the entries in the vector) and then retrained the model. After deleting and retraining several times, the pruning process outputted the pruned model (a smaller vector). Taking a glance at the pruning process in Figure

1 (a)(b), there are two issues during the pruning process:

  • Which weights are unimportant? A normal way is determining the weights’ importance at each pruning step, for instance, by its magnitude (Han et al., 2015b). However, since the interconnections among the weights are so complicated, the weights’ importance may change dramatically during pruning, e.g., the importance is just a local judgment, which means the weights of less importance at this time may become more important at the future.

  • Once pruned, no chance to come back. If we view the pruning problem as an Integer Optimization problem, i.e., whether each weight should be deleted or not, the pruning process in Figure 1 (a)(b) will force the optimizing domain to become smaller and smaller, i.e., the pruned weights have no chance to come back, thus the optimization process with no chance to escape from local minimal111Here, the local minimal is a pruning strategy, i.e., the one in solving the corresponding Integer Optimization problem..

Figure 1: (a): A schematic of the model accuracy varying during pruning process, (a)(b): the pruning method by (Han et al., 2015b), and (a)(c): the proposed Drop Pruning. Here the vector with entries from to denote the baseline model with 6 weights. In the pruning process (a)(b), we iteratively delete some unimportant weights and then retrain the model, in which the current pruned model is a subset of the previous one. In the proposed Drop Pruning (a)(c), we iteratively drop out some unimportant weights, drop in some pruned weights and retrain, in which the current pruned model is smaller than the previous one. We use the white character on black background to represent the weights that will be dropped out at next pruning step and the the black character on white background to represent the weights that are dropped in at this pruning step.

To address the above two issues, this paper proposes a novel pruning strategy, named Drop Pruning, which introduces the stochastic optimization in pruning, i.e., pruning the weights with some probability. Drop Pruning also falls into the standard iteratively prune-retrain procedure, where a drop strategy exists at each pruning step: we delete the unimportant weights with some probability, named drop out and recover some weights from the deleted weights with some probability, named drop in. For instance, as shown in Figure 1 (a)(c), at the second pruning step, we drop out the weights and drop in the weight , at the third pruning step, we drop out the weights and drop in the weight . At last, Drop Pruning also outputs the pruned model with only weights but of different locations compared with the pruning process in Figure 1 (a)(b). Obviously, in Drop Pruning, drop out can reduce the influence of only locally judging the weights’ importance by its magnitude, i.e., toward the first issue, while drop in will make the deleted weights still have a chance to come back, i.e., toward the second issue. Figure 2 clearly shows the improvement of the proposed Drop Pruning (denoted by Drop out Pruning and Drop Pruning) proposed by (Han et al., 2015b) (denoted by Traditional Pruning).

(a) VGG-16 on CIFAR10. (b) LeNet-5 on MNIST.
Figure 2: The performance of Traditional Pruning (Han et al., 2015b), Drop out Pruning (Drop Pruning only with drop out) and Drop Pruning for (a): VGG-16 on CIFAR10 with baseline accuracy and (b): LeNet-5 on MNIST with baseline accuracy . These results can clearly demonstrates the individual contribution of drop out and drop in in Drop Pruning.

In conclusion, the contributions of this paper can be summarized as follows:

  • A novel pruning strategy, named Drop Pruning is proposed to handle the two key issues, i.e., local importance judgment and irretrievable pruning process of traditional pruning methods (Han et al., 2015b), such that it can achieve a better pruned model, e.g. same model size, better accuracy or same accuracy, smaller model size, as shown in Figure 2.

  • A similar idea of dropout (Srivastava et al., 2014) was introduced into pruning but with different intentions: dropout trained different “thinned" networks and used the average outputs of these ones in the inference, while Drop Pruning tried to directly find the best “thinned" network and used that one in the inference. Drop Pruning can also be seen as a technique to handle over-fitting, while it should start from a high-accuracy baseline model.

  • By formulating the pruning problem to an Integer Optimization problem, we introduce the randomness in the solving procedure, which improve the performance of proposed Drop Pruning. The similar idea has been proved to be effective in solving another Integer Optimization problem (Sun et al., 2018).

  • Compared with the Dense-Sparse-Dense (DSD) training technique (Han et al., 2016), Drop Pruning also introduces a similar Sparse-Dense action, named drop in to lead the model to escape from the local minimal222Here, the local minimal is a model (a group of weights) that minimizing the loss in training DNNs. with some probability, which may improve the performance of DNNs even during the pruning process, as shown in Figure 2.

The rest of this paper is as follows. In Section 2, we will introduce the related work. In Section 3, we will introduce the implementation details of drop out and drop in. Especially, we will give the detailed relations with dropout, Integer Optimization and Dense-Sparse-Dense training. Section 4 experimentally analyses Drop Pruning and Section 5 draws the conclusions.

2 Related work

Pruning the weights of a neural network is a very straightforward approach to reduce its complexity. The early work of pruning was called Biased Weight Decay (Hanson & Pratt, 1989), which tried to choose minimal representations during the back propagation process. Then the Optimal Brain Damage (LeCun et al., 1990) and the Optimal Brain Surgeon (Hassibi & Stork, 1993)

were proposed, which used the Hessian information of loss function to prune the connections, and suggested using Hessian will obtain higher accuracy than the magnitude based pruning. Since the Hessian information is computational intensive, especially for large network, Han

et al. (Han et al., 2015b) proposed to reduced the network size by magnitude based pruning and introduced retraining technique. They also combined this pruning scheme with quantitation and Huffman coding to achieve higher compression ratio (Han et al., 2015a). Guo et al. (Guo et al., 2016) proposed dynamic network surgery, which incorporates weights splicing into the whole pruning process to avoid incorrect pruning.

In recent years, the group-wise brain damage (Lebedev & Lempitsky, 2016a) and layer-wise brain surgeon (Dong et al., 2017) were also proposed to compress deep network structures. Li et al. (Li et al., 2016) used -norm of the filters to prune unimportant filters. Luo et al. (Luo et al., 2017)

presented ThiNet, which determines whether a filter can be pruned by the outputs of its next layer. A first order gradient based strategy was proposed for pruning convolutional neural networks 

(Molchanov et al., 2016)

, which is a computationally efficient procedure verified by transfer learning experiments. Other filter pruning methods can also be seen 

(Mathieu et al., 2013; Lavin & Gray, 2016; Li et al., 2016) for convolutional neural networks.

Another growing interest in pruning is directly training compact DNNs with sparsity constraints. The work in (Lebedev & Lempitsky, 2016b) imposed the sparsity constraint on the filters to prune the convolution kernels in a group-wise fashion. A group-sparse regularizer was also introduced in (Zhou et al., 2016) to learn compact filters during training. Recently, a regularizer (Louizos et al., 2017) was proposed to prune the model during training by encouraging weights to become exactly zero. Maximilian et al. (Golub et al., 2018) proposed a pruning strategy both during and after training by constraining the total number of weights using the gradients, not magnitude.

Drop Pruning is proposed to directly pruning the weights of a dense high-accuracy model, but with different pruning strategy. The proposed approach also follows the standard iterative prune-retrain procedure  (Han et al., 2015b). The detailed difference will be discussed in the next Section.

3 Drop Pruning

3.1 Notations

Denote vector a DNN model and a binary vector with the same size of . Denote a pruned model, in which the entries of indicating the states of model . Denote the ones vector of the same size of . Denote the dimensionality of a vector . Suppose a set which contains the locations of ones in a binary vector . Retraining a pruned model means we just retrain the un-pruned weights in , which is indicated in .

3.2 Drop out

Apparently, a normal way to do pruning is deleting the unimportant weights and keep the important ones. As we discussed in Section 1, the key point is to find the unimportant weights, because the weights’ importance may change dramatically during the pruning process. In the traditional pruning method (Han et al., 2015b), starting from the baseline model, i.e., setting , at each pruning step , it firstly find the unimportant weights of model ,

(1)

and then updates by

(2)

where is a predefined threshold that can vary in different pruning steps and layers. After deleting the unimportant weights, we retrain the pruned model to . Then will update like that shown in Figure 3 (a) and we can easily check that , i.e., the pruned model is a subset of the previous one.

Figure 3: The pruning process of (a): Traditional Pruning (Han et al., 2015b), (b): Drop out Pruning and (c): Drop Pruning. At each pruning step, Traditional Pruning follows Find-Delete-Retrain, Drop out Pruning follows Find-Drop out-Retrain and Drop Pruning follows Find-Drop out-Drop in -Retrain.

The drop out of Drop Pruning will introduce randomness in (2) to judge the importance of weights. The motivation is simple: “The weights with less magnitude does not mean less important, but may have high probability to be less important.” Similar with the idea in dropout (Srivastava et al., 2014), here we introduce a vector to update by

(3)

where

is a vector of independent Bernoulli random variables, each of which has probability

of . At this time, drop out will let update like that shown in Figure 3 (b). However, similar with Figure 3 (a), we still have . Thus if the pruning process falls into a local minimal, it will have no chance to escape.

3.3 Drop in

The drop in of Drop Pruning is proposed to overcome the problem of falling into a local minimal in pruning process. Let be a vector of independent Bernoulli random variables, each of which has probability of . Then when drop out the unimportant weights by (3), we also drop in the pruned weights with some probability, that is

(4)

which make the model have a chance to escape from a local minimal, like the Sparse-Dense action in DSD training (Han et al., 2016). As shown Figure 3 (c), since drop in will reload some pruned weights into the model, then obviously, at this time, . In practice, we will suitably choose and to impose , i.e., the pruned model is smaller than the previous one. Of course, it is also optional to choose and like the threshold , i.e., varying in different pruning steps and layers.

In conclusion, the Drop Pruning algorithm is summarized in Algorithm 1. By the definition of , can represent the sparsity of the pruned model . Given a baseline model and the target sparsity, the algorithm will output a pruned model by iteratively Drop out-Drop in-Retrain until it reaches the target sparsity.

Input: Baseline model and target sparsity .
Output: Pruned model .

1:  Set drop probabilities , ;
2:  Let ;
3:  while  do
4:     Find the unimportant weights by (1);
5:     Drop out some weights by (3);
6:     Drop in some weights by (4);
7:     Retrain ;
8:  end while
Algorithm 1 Drop Pruning

Dynamic network surgery (DNS)  (Guo et al., 2016) also introduced a similar idea of reloading some pruned weights, which was named splicing in their paper. The importance about the weights they imposed still depends on its magnitude and they proposed two thresholds and with a small margin () to determine the state of weights during pruning: pruned (less than ), persisted (more than ) or have the same state of the last pruning step (between and ). After pruning, DNS retrained the whole network, not only the un-pruned important weights. Thus, the pruned weights (its magnitude less than before) have a chance to come back if its magnitude are more than after retraining. In the retraining procedure, DNS reloaded all the pruned weights and the splicing is almost deterministic, while the drop in is stochastic. In addition, it’s too tricky to choose the two thresholds (varying along different layers and pruning steps) in DNS and it just do pruning at the last step, while it just needs two drop probabilities to make the Drop Pruning process slowly flow to the pruned model with desired sparsity.

Remark. The importance judgment at each pruning step, i.e., obtaining set , can be improved from two aspects: 1, consider the Gradient, not magnitude; 2, consider a group of weights, like the kernels in CNNs. These modifications deserve deeper experimental investigation at the future.

3.4 Relation with Dropout

Dropout (Hinton et al., 2012; Srivastava et al., 2014; Bouthillier et al., 2015) or dropconnect (Wan et al., 2013) is a simple but efficient way to prevent DNNs from over-fitting. The main idea of dropout or dropconnect is to randomly drop units or connections, during training a neural network. As shown in Figure 4 (a), during training, starting from a random initial model, dropout samples from an exponential number of different “thinned" networks. At inference, see Figure 4 (b), it use the whole network with smaller weights to approximate the effect of averaging the predictions of all these thinned networks.

Compared with dropout, Drop Pruning, see Figure 4 (c), starting from a high-accuracy baseline model, will exactly drops out the weights at training, i.e., choosing a “thinned” network. Since it just drops out some unimportant weights, the accuracy loss may be negligible. Thus, the chosen “thinned” network is somehow a good one. Whereas, if we keep dropping out, the “thinned” network will be smaller and smaller, and may extremely affects the performance. To have the chance of touching another good “thinned” network, drop in is introduced. After drop out and drop in several times, we may have the chance to exactly touch the most “thinned” network and just use this sparse one at inference, as shown in Figure 4 (d).

Figure 4: The diagrams of Drop Pruning and dropout: (a), Training process in dropout; (b), Inference process after dropout; (c), Pruning process in Drop Pruning; (d), Inference process after Drop Pruning. Dropout trained different “thinned" networks (colored entries) and used the average outputs of these ones in the inference, while Drop Pruning tried to directly find the best “thinned" network and used that one in the inference.

Targeted Dropout. Recently, a novel dropout strategy, named targeted dropout, was proposed by Gomez et al. (Gomez et al., 2018). We were unaware at the time we developed this work that Gomez et al. were also working on a similar project of combining dropout and pruning. Targeted dropout is a strategy for post hoc pruning of neural network weights and units that can directly build the pruning mechanism into learning. The excellent performance of included experiments can highly support the proposed idea. At each weight update, targeted dropout firstly selects the bottom threshold and then drop its entries with drop rate , which is almost mathematically the same strategy with the drop out in Drop Pruning, i.e., (3). Whereas the main difference is: in Drop Pruning, we exactly prune the weights, while in targeted dropout, the dropped weights will come back in the next weight update. Thus targeted dropout still falls into the same training process in Figure 4 (a) with different ‘thinned" networks.

We can also observe that: the objective of targeted dropout is training a network which is robust to pruning, i.e., starting from a random initial model; while the objective of Drop Pruning is directly pruning a high accuracy model, i.e., starting from a learned dense model. In a way, targeted dropout is some kind of “Pruning based Dropout” that aiming a more “sparse” dense model, while Drop Pruning is some kind of “Dropout based Pruning” that aiming an exactly sparse model. The effectiveness of targeted dropout can somehow guarantee the meaningfulness of combining pruning and dropout, while Drop Pruning also involves the similar idea but with a different direction. An interesting research point is firstly learning a dense model by targeted dropout and then pruning it by Drop Pruning, which may be considered at the future.

3.5 Relation with Integer Optimization

Integer and mixed integer constrained optimization problems (Karlof, 2005) are NP-complete problems in optimization, in which some or all of the variables are restricted to be integers. Here we introduce a related work (Sun et al., 2018), proposed by Sun et al.

to solve the minimal weighted vertex cover (MWVC) problem, which is indeed an Integer Optimization problem. They introduced randomness into the optimization process, achieving the state-of-the-art performance in experimental results, together with some theoretical results. Their work shows that the stochastic optimization is not only a powerful technique in many other optimization problems, e.g., Genetic Algorithm 

(Banzhaf et al., 1998), Stochastic PCA or SVD (Shamir, 2015), Stochastic Gradient Decent (Ruder, 2016), Random Coordinate Descent (Nesterov, 2012), but also a better choice in the Integer Optimization problem. Randomness is always a powerful technique, even just applied to generating initial weights (He et al., 2018). Here we remark that Genetic Algorithm (Banzhaf et al., 1998) can also be applied to Integer Optimization, whereas it need huge number of samples for generating, making it almost impossible to handle a huge system, like the MWVC problem considered in (Sun et al., 2018) or pruning a huge neural network.

Figure 5: Minimal weighted vertex cover (MWVC) problem: to find a minimal weighted set of vertices (solid ones) that touch all the edges in a given graph.

As shown in Figure 5, the well-known MWVC problem is: given a graph , to find a minimal weighted set of vertices, i.e., that touch all the edges in a given graph (Taoka & Watanabe, 2012). The minimization problem can be formulated as:

(5)

where is the global objective function that constraining the weight sum and the penalty on uncovered edges. When the nodes have equal weights, it degrades into the so called minimal vertex cover (MVC). The MWVC problem has found its practical significance in computational biology, network security, large scale integration design and wireless communication.

In (Sun et al., 2018), they proposed a distributed algorithm for solving MWVC problem, where each player (each vertex in the graph) simultaneously updates its action ( or , whether the desired subset take this vertex or not) by obeying the relaxed greedy rule followed by a mutation with some probability, i.e., a mutative action is randomly drawn from the memory (the history actions). They found that if each player choose the deterministic best response, the algorithm will convergence to the local minimal that depends on initial states. Whereas, if each player choose a random action from its memory, the algorithm will converge to a better solution with high probability. The effectiveness and theoretical analysis of their proposed algorithm demonstrate that: “Stochastic optimization is also an effective technique in handling the Integer Optimization problems”.

Here the pruning problem can also be formulated to an Integer Optimization problem, that is: given an architecture , to find a minimal set of weights, i.e., pruned model that has the ability to achieve the same accuracy as the baseline model. The minimization problem can be formulated as:

(6)

where is the global objective function that constraining the training loss and the size of pruned model . As also mentioned in (Srivastava et al., 2014), a neural network with weights, can be seen as a collection of possible thinned neural networks, such that searching the best one is exactly a NP-hard Integer Optimization problem. Then the pruning process, i.e., Figure 1 (a)(c) can be seen as an optimization process to solve this Integer Optimization problem. Similar with the idea of introducing randomness in (Sun et al., 2018), the Drop Pruning will impose randomness by drop out and drop in the weights with some probability. Especially, the idea of drop in has the similar spirit of acting from the player’s memory in the distributed algorithm (Sun et al., 2018).

3.6 Relation with DSD

To overcome the overfitting problem in large DNNs training, Han et al.  (Han et al., 2016) proposed a Dense-Sparse-Dense (DSD) training flow to regularize DNNs. They first trained a dense network (Dense), and then pruned the unimportant weights, followed with retraining (Sparse). Then they re-initialize the pruned weights to be zero and retrain the whole network (Dense). The included experiments guarantee that DSD training process can improve the performance for a wide range of DNNs.

As shown in Figure 6 (a), during the training process, the pruning and retraining (Sparse) may lead the model escape from a local minimal, leading to a better one by re-initializing the pruned weights and then retraining (Dense). Whereas, in DSD, the model remain its size (see the size of the red node) and it just escapes one time. Compared with DSD, the proposed pruning strategy can be seen as a similar training flow along with reducing the model size. In Drop Pruning, as shown in Figure 6 (b), drop out can be corresponded to Dense-Sparse action, while drop in is corresponded to Sparse-Dense action. Combined with imposing randomness, the iterative process of Drop Pruning may have high probability to lead the model to escape from a local minimal to a better one, even the global one.

Figure 6: Comparison between (a): Dense-Sparse-Dense (DSD) training (Han et al., 2016) and (b): proposed Drop Pruning. In (a) DSD, the Sparse step may lead the model to escape form a local minimal and then flow to a better one by re-initializing the pruned weights and then retraining. The model size remains unchanged (the size of the red nodes); in (b) Drop Pruning, drop out may lead the model to escape from a local minimal and drop in may lead the model to flow a better one. The iterative process with imposing randomness will lead the model have high probability to flow to the global one. The model size will be reduced.

4 Experiments

In this section, we will experimentally analyse the proposed method and apply it to some popular neural networks, i.e., VGG-16 (Simonyan & Zisserman, 2014a). All the experiments were implemented on a GPU cluster with 16 NVIDIA Tesla V100 GPUs (16GB) for training the baseline models and evaluating the performance of Drop Pruning, i.e., each GPU started an individual Drop Pruning process. At this version, we just evaluate the performance of Drop Pruning about VGG-16 on CIFAR10 and LeNet-5 on MNIST. We will keep updating the results at the future.

Off-line pruning. Drop Pruning is a stochastic pruning strategy and each trial will lead to a pruned model with different test accuracy. The following results are the best ones under trials for VGG-16 and trials for LeNet-5, considering VGG-16 costs too much than LeNet-5. Here we use the best one to represent the ability of the proposed algorithm, because the pruning can be done off-line. Once we obtained a pruned model with target sparsity, the on-line applying of the pruned model is deterministic.

Target Sparsity. Here we simply impose that each layer of the pruned model should have the same target sparsity. Of course, this setting will affect the performance of high target sparsity, since most of the weights are located in the full connected layers. We believe that if we adjust the target sparsity in each layer, for instance, high target sparsity of full connected layers, the performance can also be improved.

Basic comparison. Starting from baseline model, we firstly test the performance of the following three pruning process: (a), The straightforward magnitude-based pruning (Han et al., 2015b), i.e., replacing drop out and drop in by deleting all the unimportant weights in Algorithm 1, denoted by Traditional Pruning; (b), The pruning process just with drop out, i.e., no drop in in Algorithm 1, denoted by Drop out Pruning; (c),The Drop Pruning, i.e., Algorithm 1.

4.1 VGG-16 on CIFAR10

As shown in Figure 1 (a), we should train a baseline high-accuracy model at first. We run epochs to obtain a baseline model with a test accuracy of . In the pruning process, the same batch size, learning rate and learning policy are set as the same with the baseline training processes.

All the basic comparison results are shown in Table 1 and Figure 2 (a). These results clearly show that the proposed drop out and drop in in Drop Pruning are indeed both significant to achieve a better pruned model, that is: same model size, better accuracy and same accuracy, smaller model size. In addition, we can also clearly see the accuracy improvement during the Drop Pruning process, especially under small target sparsities, which demonstrates that Drop Pruning can also be used as a technique to handle the overfitting, like dropout (Srivastava et al., 2014) or DSD (Han et al., 2016).

Target sparsity
Compression rate
VGG-16 Traditional Pruning
Drop out Pruning
Drop Pruning
LeNet-5 Traditional Pruning
Drop out Pruning
Drop Pruning
Table 1: The test accuracy () under varying target sparsities of Traditional Pruning (Han et al., 2015b), Drop out Pruning (Drop Pruning only with drop out) and Drop Pruning for VGG-16 on CIFAR10 and LeNet-5 on MNIST. The results of Drop out pruning and Drop Pruning are the best ones under trials for VGG-16 and trials for LeNet-5. The test accuracy of baseline model is for VGG-16 and for LeNet-5. The best results of fixed target sparsity are in bold and the results without accuracy loss than baseline model are underlined.

Next, we compare Drop Pruning with some state-of-the-art pruning methods, as shown in Tabel 2. To achieve the same test accuracy of baseline model, Drop Pruning can compress the model about times, while drop out pruning get compression. In addition, to achieve a better test accuracy, like , compared to compression in  (Li et al., 2016), Drop Pruning can still have compression and drop out pruning get compression.

Methods Accuracy Params. Compression
Baseline
Traditional Pruning (Han et al., 2015b)
Variational dropout (Kingma et al., 2015)
Slimming (Liu et al., 2017)
DropBack (Golub et al., 2018)
Drop out Pruning
Drop Pruning
Filter pruning in (Li et al., 2016)
Drop out Pruning
Drop Pruning
Table 2: The comparison of different compressed models, with the test accuracy, parameters , and the final compression rate for VGG-16 on CIFAR10.
Methods Accuracy Params. Compression
Baseline
Traditional Pruning (Han et al., 2015b)
Network trimming (Hu et al., 2016)
Drop out Pruning
Drop Pruning
Neuron Pruning (Rueda et al., 2017)
Drop Pruning
Density-diversity Penalty (Wang et al., 2016)
Coarse Pruning (Anwar & Sung, 2016)
Drop out Pruning
Drop Pruning
Table 3: The comparison of different compressed models, with the test accuracy, parameters , and the final compression rate for LeNet-5 on MNIST.

4.2 LeNet-5 on MNIST

Now we evaluate Drop Pruning on MNIST handwritten digits using LeNet-5. The baseline model was trained up to epochs. The test accuracy of baseline model is . In Table 1, we also compare the performance of Traditional pruning, Drop out pruning and Drop Pruning against varying target sparsities. We also compare Drop Pruning with other state-of-the-art pruning methods in Tabel 3. We find that Drop Pruning behave similarly as that for VGG-16 on CIFAR10.

Next, we investigate the statistical analysis of the test accuracy over times Drop Pruning, as shown in Figure 7

. These results are primary, while we can still estimate that: under low target sparsity, we have high probability to get a better pruned model than that under high target sparsity..

Figure 7: Statistics analysis of the test accuracy over times Drop Pruning under different target sparsities for LeNet-5 on MNIST. The test accuracy of baseline model is . Under low target sparsity, we have high probability to get a better pruned model than that under high target sparsity.

5 Conclusion

This paper proposed a novel network pruning strategy, named Drop Pruning. Drop Pruning consists of three steps: drop out some unimportant weights, drop in some pruned weights and retrain. Drop out can reduce the influence of locally judge the weights’ importance by its magnitude, while drop in will make the deleted weights still have a chance to come back. Drop Pruning has some similar spirits with dropout, a stochastic algorithm in Integer Optimization and the DSD training technique, which were well addressed. Drop Pruning can significantly reduce overfitting along with compressing the network. Experimental results show that Drop Pruning can outperform the state-of-the-art pruning methods on many benchmark pruning tasks, which may provide some new insights into the aspect of model compression. This research is in its early stage. First, the experimental results deserve deeper investigation, including different pruning tasks and different importance judgments. Second, the idea of drop out and drop in may be also useful in quantization. Third, Drop Pruning may provide an insight into the aspect of Integer Optimization, which also deserves both numerical and theoretical analysis.

6 Acknowledgements

This work was supported by the Innovation Foundation of Qian Xuesen Laboratory of Space Technology. The authors thank Dr. Wuyang Li and Dr. Haidong Xie for their careful proofreading.

References