In recent years, various kinds of deep neural networks (DNNs) have dramatically improved the accuracy in many computer vision tasks, from basic image classification challenge(Krizhevsky et al., 2012; Simonyan & Zisserman, 2014b; He et al., 2016) to some advanced applications, e.g. object detection (Liu et al., 2016) and semantic segmentation (Noh et al., 2015). However, these networks generally contains tens of millions parameters, leading to much storage requirement and floating-point operation, which increase the difficulty of applying DNNs on mobile platforms with limited memory and processing units (Canziani et al., 2016; Cheng et al., 2017).
One way to address the above issue is model compression, because the models are always greatly overparametrized (Ullrich et al., 2017; Molchanov et al., 2017). Various approaches were proposed to compress the model, including quantization (Wu et al., 2016), parameter sharing (Chen et al., 2015), pruning (Han et al., 2015b), low rank factorization (Lebedev et al., 2014) and knowledge distillation (Ba & Caruana, 2014; Hinton et al., 2015). Among these methods, pruning appears to be an outstanding one, because it can prevent the accuracy loss with high compression ratio. As mentioned in (Han et al., 2015b), Han et al. reduced the model size without accuracy loss by removing all the weights lower than a threshold, and retraining the sparse model. Specially, as shown in Figure 1 (a)
(b), starting from a baseline model, i.e., the uncompressed model (denoted by a vector), a traditional pruning process firstly deleted some unimportant weights (the entries in the vector) and then retrained the model. After deleting and retraining several times, the pruning process outputted the pruned model (a smaller vector). Taking a glance at the pruning process in Figure1 (a)(b), there are two issues during the pruning process:
Which weights are unimportant? A normal way is determining the weights’ importance at each pruning step, for instance, by its magnitude (Han et al., 2015b). However, since the interconnections among the weights are so complicated, the weights’ importance may change dramatically during pruning, e.g., the importance is just a local judgment, which means the weights of less importance at this time may become more important at the future.
Once pruned, no chance to come back. If we view the pruning problem as an Integer Optimization problem, i.e., whether each weight should be deleted or not, the pruning process in Figure 1 (a)(b) will force the optimizing domain to become smaller and smaller, i.e., the pruned weights have no chance to come back, thus the optimization process with no chance to escape from local minimal111Here, the local minimal is a pruning strategy, i.e., the one in solving the corresponding Integer Optimization problem..
To address the above two issues, this paper proposes a novel pruning strategy, named Drop Pruning, which introduces the stochastic optimization in pruning, i.e., pruning the weights with some probability. Drop Pruning also falls into the standard iteratively prune-retrain procedure, where a drop strategy exists at each pruning step: we delete the unimportant weights with some probability, named drop out and recover some weights from the deleted weights with some probability, named drop in. For instance, as shown in Figure 1 (a)(c), at the second pruning step, we drop out the weights and drop in the weight , at the third pruning step, we drop out the weights and drop in the weight . At last, Drop Pruning also outputs the pruned model with only weights but of different locations compared with the pruning process in Figure 1 (a)(b). Obviously, in Drop Pruning, drop out can reduce the influence of only locally judging the weights’ importance by its magnitude, i.e., toward the first issue, while drop in will make the deleted weights still have a chance to come back, i.e., toward the second issue. Figure 2 clearly shows the improvement of the proposed Drop Pruning (denoted by Drop out Pruning and Drop Pruning) proposed by (Han et al., 2015b) (denoted by Traditional Pruning).
|(a) VGG-16 on CIFAR10.||(b) LeNet-5 on MNIST.|
In conclusion, the contributions of this paper can be summarized as follows:
A novel pruning strategy, named Drop Pruning is proposed to handle the two key issues, i.e., local importance judgment and irretrievable pruning process of traditional pruning methods (Han et al., 2015b), such that it can achieve a better pruned model, e.g. same model size, better accuracy or same accuracy, smaller model size, as shown in Figure 2.
A similar idea of dropout (Srivastava et al., 2014) was introduced into pruning but with different intentions: dropout trained different “thinned" networks and used the average outputs of these ones in the inference, while Drop Pruning tried to directly find the best “thinned" network and used that one in the inference. Drop Pruning can also be seen as a technique to handle over-fitting, while it should start from a high-accuracy baseline model.
By formulating the pruning problem to an Integer Optimization problem, we introduce the randomness in the solving procedure, which improve the performance of proposed Drop Pruning. The similar idea has been proved to be effective in solving another Integer Optimization problem (Sun et al., 2018).
Compared with the Dense-Sparse-Dense (DSD) training technique (Han et al., 2016), Drop Pruning also introduces a similar Sparse-Dense action, named drop in to lead the model to escape from the local minimal222Here, the local minimal is a model (a group of weights) that minimizing the loss in training DNNs. with some probability, which may improve the performance of DNNs even during the pruning process, as shown in Figure 2.
The rest of this paper is as follows. In Section 2, we will introduce the related work. In Section 3, we will introduce the implementation details of drop out and drop in. Especially, we will give the detailed relations with dropout, Integer Optimization and Dense-Sparse-Dense training. Section 4 experimentally analyses Drop Pruning and Section 5 draws the conclusions.
2 Related work
Pruning the weights of a neural network is a very straightforward approach to reduce its complexity. The early work of pruning was called Biased Weight Decay (Hanson & Pratt, 1989), which tried to choose minimal representations during the back propagation process. Then the Optimal Brain Damage (LeCun et al., 1990) and the Optimal Brain Surgeon (Hassibi & Stork, 1993)
were proposed, which used the Hessian information of loss function to prune the connections, and suggested using Hessian will obtain higher accuracy than the magnitude based pruning. Since the Hessian information is computational intensive, especially for large network, Hanet al. (Han et al., 2015b) proposed to reduced the network size by magnitude based pruning and introduced retraining technique. They also combined this pruning scheme with quantitation and Huffman coding to achieve higher compression ratio (Han et al., 2015a). Guo et al. (Guo et al., 2016) proposed dynamic network surgery, which incorporates weights splicing into the whole pruning process to avoid incorrect pruning.
In recent years, the group-wise brain damage (Lebedev & Lempitsky, 2016a) and layer-wise brain surgeon (Dong et al., 2017) were also proposed to compress deep network structures. Li et al. (Li et al., 2016) used -norm of the filters to prune unimportant filters. Luo et al. (Luo et al., 2017)
presented ThiNet, which determines whether a filter can be pruned by the outputs of its next layer. A first order gradient based strategy was proposed for pruning convolutional neural networks(Molchanov et al., 2016)
, which is a computationally efficient procedure verified by transfer learning experiments. Other filter pruning methods can also be seen(Mathieu et al., 2013; Lavin & Gray, 2016; Li et al., 2016) for convolutional neural networks.
Another growing interest in pruning is directly training compact DNNs with sparsity constraints. The work in (Lebedev & Lempitsky, 2016b) imposed the sparsity constraint on the filters to prune the convolution kernels in a group-wise fashion. A group-sparse regularizer was also introduced in (Zhou et al., 2016) to learn compact filters during training. Recently, a regularizer (Louizos et al., 2017) was proposed to prune the model during training by encouraging weights to become exactly zero. Maximilian et al. (Golub et al., 2018) proposed a pruning strategy both during and after training by constraining the total number of weights using the gradients, not magnitude.
Drop Pruning is proposed to directly pruning the weights of a dense high-accuracy model, but with different pruning strategy. The proposed approach also follows the standard iterative prune-retrain procedure (Han et al., 2015b). The detailed difference will be discussed in the next Section.
3 Drop Pruning
Denote vector a DNN model and a binary vector with the same size of . Denote a pruned model, in which the entries of indicating the states of model . Denote the ones vector of the same size of . Denote the dimensionality of a vector . Suppose a set which contains the locations of ones in a binary vector . Retraining a pruned model means we just retrain the un-pruned weights in , which is indicated in .
3.2 Drop out
Apparently, a normal way to do pruning is deleting the unimportant weights and keep the important ones. As we discussed in Section 1, the key point is to find the unimportant weights, because the weights’ importance may change dramatically during the pruning process. In the traditional pruning method (Han et al., 2015b), starting from the baseline model, i.e., setting , at each pruning step , it firstly find the unimportant weights of model ,
and then updates by
where is a predefined threshold that can vary in different pruning steps and layers. After deleting the unimportant weights, we retrain the pruned model to . Then will update like that shown in Figure 3 (a) and we can easily check that , i.e., the pruned model is a subset of the previous one.
The drop out of Drop Pruning will introduce randomness in (2) to judge the importance of weights. The motivation is simple: “The weights with less magnitude does not mean less important, but may have high probability to be less important.” Similar with the idea in dropout (Srivastava et al., 2014), here we introduce a vector to update by
is a vector of independent Bernoulli random variables, each of which has probabilityof . At this time, drop out will let update like that shown in Figure 3 (b). However, similar with Figure 3 (a), we still have . Thus if the pruning process falls into a local minimal, it will have no chance to escape.
3.3 Drop in
The drop in of Drop Pruning is proposed to overcome the problem of falling into a local minimal in pruning process. Let be a vector of independent Bernoulli random variables, each of which has probability of . Then when drop out the unimportant weights by (3), we also drop in the pruned weights with some probability, that is
which make the model have a chance to escape from a local minimal, like the Sparse-Dense action in DSD training (Han et al., 2016). As shown Figure 3 (c), since drop in will reload some pruned weights into the model, then obviously, at this time, . In practice, we will suitably choose and to impose , i.e., the pruned model is smaller than the previous one. Of course, it is also optional to choose and like the threshold , i.e., varying in different pruning steps and layers.
In conclusion, the Drop Pruning algorithm is summarized in Algorithm 1. By the definition of , can represent the sparsity of the pruned model . Given a baseline model and the target sparsity, the algorithm will output a pruned model by iteratively Drop out-Drop in-Retrain until it reaches the target sparsity.
Dynamic network surgery (DNS) (Guo et al., 2016) also introduced a similar idea of reloading some pruned weights, which was named splicing in their paper. The importance about the weights they imposed still depends on its magnitude and they proposed two thresholds and with a small margin () to determine the state of weights during pruning: pruned (less than ), persisted (more than ) or have the same state of the last pruning step (between and ). After pruning, DNS retrained the whole network, not only the un-pruned important weights. Thus, the pruned weights (its magnitude less than before) have a chance to come back if its magnitude are more than after retraining. In the retraining procedure, DNS reloaded all the pruned weights and the splicing is almost deterministic, while the drop in is stochastic. In addition, it’s too tricky to choose the two thresholds (varying along different layers and pruning steps) in DNS and it just do pruning at the last step, while it just needs two drop probabilities to make the Drop Pruning process slowly flow to the pruned model with desired sparsity.
Remark. The importance judgment at each pruning step, i.e., obtaining set , can be improved from two aspects: 1, consider the Gradient, not magnitude; 2, consider a group of weights, like the kernels in CNNs. These modifications deserve deeper experimental investigation at the future.
3.4 Relation with Dropout
Dropout (Hinton et al., 2012; Srivastava et al., 2014; Bouthillier et al., 2015) or dropconnect (Wan et al., 2013) is a simple but efficient way to prevent DNNs from over-fitting. The main idea of dropout or dropconnect is to randomly drop units or connections, during training a neural network. As shown in Figure 4 (a), during training, starting from a random initial model, dropout samples from an exponential number of different “thinned" networks. At inference, see Figure 4 (b), it use the whole network with smaller weights to approximate the effect of averaging the predictions of all these thinned networks.
Compared with dropout, Drop Pruning, see Figure 4 (c), starting from a high-accuracy baseline model, will exactly drops out the weights at training, i.e., choosing a “thinned” network. Since it just drops out some unimportant weights, the accuracy loss may be negligible. Thus, the chosen “thinned” network is somehow a good one. Whereas, if we keep dropping out, the “thinned” network will be smaller and smaller, and may extremely affects the performance. To have the chance of touching another good “thinned” network, drop in is introduced. After drop out and drop in several times, we may have the chance to exactly touch the most “thinned” network and just use this sparse one at inference, as shown in Figure 4 (d).
Targeted Dropout. Recently, a novel dropout strategy, named targeted dropout, was proposed by Gomez et al. (Gomez et al., 2018). We were unaware at the time we developed this work that Gomez et al. were also working on a similar project of combining dropout and pruning. Targeted dropout is a strategy for post hoc pruning of neural network weights and units that can directly build the pruning mechanism into learning. The excellent performance of included experiments can highly support the proposed idea. At each weight update, targeted dropout firstly selects the bottom threshold and then drop its entries with drop rate , which is almost mathematically the same strategy with the drop out in Drop Pruning, i.e., (3). Whereas the main difference is: in Drop Pruning, we exactly prune the weights, while in targeted dropout, the dropped weights will come back in the next weight update. Thus targeted dropout still falls into the same training process in Figure 4 (a) with different ‘thinned" networks.
We can also observe that: the objective of targeted dropout is training a network which is robust to pruning, i.e., starting from a random initial model; while the objective of Drop Pruning is directly pruning a high accuracy model, i.e., starting from a learned dense model. In a way, targeted dropout is some kind of “Pruning based Dropout” that aiming a more “sparse” dense model, while Drop Pruning is some kind of “Dropout based Pruning” that aiming an exactly sparse model. The effectiveness of targeted dropout can somehow guarantee the meaningfulness of combining pruning and dropout, while Drop Pruning also involves the similar idea but with a different direction. An interesting research point is firstly learning a dense model by targeted dropout and then pruning it by Drop Pruning, which may be considered at the future.
3.5 Relation with Integer Optimization
Integer and mixed integer constrained optimization problems (Karlof, 2005) are NP-complete problems in optimization, in which some or all of the variables are restricted to be integers. Here we introduce a related work (Sun et al., 2018), proposed by Sun et al.
to solve the minimal weighted vertex cover (MWVC) problem, which is indeed an Integer Optimization problem. They introduced randomness into the optimization process, achieving the state-of-the-art performance in experimental results, together with some theoretical results. Their work shows that the stochastic optimization is not only a powerful technique in many other optimization problems, e.g., Genetic Algorithm(Banzhaf et al., 1998), Stochastic PCA or SVD (Shamir, 2015), Stochastic Gradient Decent (Ruder, 2016), Random Coordinate Descent (Nesterov, 2012), but also a better choice in the Integer Optimization problem. Randomness is always a powerful technique, even just applied to generating initial weights (He et al., 2018). Here we remark that Genetic Algorithm (Banzhaf et al., 1998) can also be applied to Integer Optimization, whereas it need huge number of samples for generating, making it almost impossible to handle a huge system, like the MWVC problem considered in (Sun et al., 2018) or pruning a huge neural network.
As shown in Figure 5, the well-known MWVC problem is: given a graph , to find a minimal weighted set of vertices, i.e., that touch all the edges in a given graph (Taoka & Watanabe, 2012). The minimization problem can be formulated as:
where is the global objective function that constraining the weight sum and the penalty on uncovered edges. When the nodes have equal weights, it degrades into the so called minimal vertex cover (MVC). The MWVC problem has found its practical significance in computational biology, network security, large scale integration design and wireless communication.
In (Sun et al., 2018), they proposed a distributed algorithm for solving MWVC problem, where each player (each vertex in the graph) simultaneously updates its action ( or , whether the desired subset take this vertex or not) by obeying the relaxed greedy rule followed by a mutation with some probability, i.e., a mutative action is randomly drawn from the memory (the history actions). They found that if each player choose the deterministic best response, the algorithm will convergence to the local minimal that depends on initial states. Whereas, if each player choose a random action from its memory, the algorithm will converge to a better solution with high probability. The effectiveness and theoretical analysis of their proposed algorithm demonstrate that: “Stochastic optimization is also an effective technique in handling the Integer Optimization problems”.
Here the pruning problem can also be formulated to an Integer Optimization problem, that is: given an architecture , to find a minimal set of weights, i.e., pruned model that has the ability to achieve the same accuracy as the baseline model. The minimization problem can be formulated as:
where is the global objective function that constraining the training loss and the size of pruned model . As also mentioned in (Srivastava et al., 2014), a neural network with weights, can be seen as a collection of possible thinned neural networks, such that searching the best one is exactly a NP-hard Integer Optimization problem. Then the pruning process, i.e., Figure 1 (a)(c) can be seen as an optimization process to solve this Integer Optimization problem. Similar with the idea of introducing randomness in (Sun et al., 2018), the Drop Pruning will impose randomness by drop out and drop in the weights with some probability. Especially, the idea of drop in has the similar spirit of acting from the player’s memory in the distributed algorithm (Sun et al., 2018).
3.6 Relation with DSD
To overcome the overfitting problem in large DNNs training, Han et al. (Han et al., 2016) proposed a Dense-Sparse-Dense (DSD) training flow to regularize DNNs. They first trained a dense network (Dense), and then pruned the unimportant weights, followed with retraining (Sparse). Then they re-initialize the pruned weights to be zero and retrain the whole network (Dense). The included experiments guarantee that DSD training process can improve the performance for a wide range of DNNs.
As shown in Figure 6 (a), during the training process, the pruning and retraining (Sparse) may lead the model escape from a local minimal, leading to a better one by re-initializing the pruned weights and then retraining (Dense). Whereas, in DSD, the model remain its size (see the size of the red node) and it just escapes one time. Compared with DSD, the proposed pruning strategy can be seen as a similar training flow along with reducing the model size. In Drop Pruning, as shown in Figure 6 (b), drop out can be corresponded to Dense-Sparse action, while drop in is corresponded to Sparse-Dense action. Combined with imposing randomness, the iterative process of Drop Pruning may have high probability to lead the model to escape from a local minimal to a better one, even the global one.
In this section, we will experimentally analyse the proposed method and apply it to some popular neural networks, i.e., VGG-16 (Simonyan & Zisserman, 2014a). All the experiments were implemented on a GPU cluster with 16 NVIDIA Tesla V100 GPUs (16GB) for training the baseline models and evaluating the performance of Drop Pruning, i.e., each GPU started an individual Drop Pruning process. At this version, we just evaluate the performance of Drop Pruning about VGG-16 on CIFAR10 and LeNet-5 on MNIST. We will keep updating the results at the future.
Off-line pruning. Drop Pruning is a stochastic pruning strategy and each trial will lead to a pruned model with different test accuracy. The following results are the best ones under trials for VGG-16 and trials for LeNet-5, considering VGG-16 costs too much than LeNet-5. Here we use the best one to represent the ability of the proposed algorithm, because the pruning can be done off-line. Once we obtained a pruned model with target sparsity, the on-line applying of the pruned model is deterministic.
Target Sparsity. Here we simply impose that each layer of the pruned model should have the same target sparsity. Of course, this setting will affect the performance of high target sparsity, since most of the weights are located in the full connected layers. We believe that if we adjust the target sparsity in each layer, for instance, high target sparsity of full connected layers, the performance can also be improved.
Basic comparison. Starting from baseline model, we firstly test the performance of the following three pruning process: (a), The straightforward magnitude-based pruning (Han et al., 2015b), i.e., replacing drop out and drop in by deleting all the unimportant weights in Algorithm 1, denoted by Traditional Pruning; (b), The pruning process just with drop out, i.e., no drop in in Algorithm 1, denoted by Drop out Pruning; (c),The Drop Pruning, i.e., Algorithm 1.
4.1 VGG-16 on CIFAR10
As shown in Figure 1 (a), we should train a baseline high-accuracy model at first. We run epochs to obtain a baseline model with a test accuracy of . In the pruning process, the same batch size, learning rate and learning policy are set as the same with the baseline training processes.
All the basic comparison results are shown in Table 1 and Figure 2 (a). These results clearly show that the proposed drop out and drop in in Drop Pruning are indeed both significant to achieve a better pruned model, that is: same model size, better accuracy and same accuracy, smaller model size. In addition, we can also clearly see the accuracy improvement during the Drop Pruning process, especially under small target sparsities, which demonstrates that Drop Pruning can also be used as a technique to handle the overfitting, like dropout (Srivastava et al., 2014) or DSD (Han et al., 2016).
|Drop out Pruning|
|Drop out Pruning|
Next, we compare Drop Pruning with some state-of-the-art pruning methods, as shown in Tabel 2. To achieve the same test accuracy of baseline model, Drop Pruning can compress the model about times, while drop out pruning get compression. In addition, to achieve a better test accuracy, like , compared to compression in (Li et al., 2016), Drop Pruning can still have compression and drop out pruning get compression.
|Traditional Pruning (Han et al., 2015b)|
|Variational dropout (Kingma et al., 2015)|
|Slimming (Liu et al., 2017)|
|DropBack (Golub et al., 2018)|
|Drop out Pruning|
|Filter pruning in (Li et al., 2016)|
|Drop out Pruning|
|Traditional Pruning (Han et al., 2015b)|
|Network trimming (Hu et al., 2016)|
|Drop out Pruning|
|Neuron Pruning (Rueda et al., 2017)|
|Density-diversity Penalty (Wang et al., 2016)|
|Coarse Pruning (Anwar & Sung, 2016)|
|Drop out Pruning|
4.2 LeNet-5 on MNIST
Now we evaluate Drop Pruning on MNIST handwritten digits using LeNet-5. The baseline model was trained up to epochs. The test accuracy of baseline model is . In Table 1, we also compare the performance of Traditional pruning, Drop out pruning and Drop Pruning against varying target sparsities. We also compare Drop Pruning with other state-of-the-art pruning methods in Tabel 3. We find that Drop Pruning behave similarly as that for VGG-16 on CIFAR10.
Next, we investigate the statistical analysis of the test accuracy over times Drop Pruning, as shown in Figure 7
. These results are primary, while we can still estimate that: under low target sparsity, we have high probability to get a better pruned model than that under high target sparsity..
This paper proposed a novel network pruning strategy, named Drop Pruning. Drop Pruning consists of three steps: drop out some unimportant weights, drop in some pruned weights and retrain. Drop out can reduce the influence of locally judge the weights’ importance by its magnitude, while drop in will make the deleted weights still have a chance to come back. Drop Pruning has some similar spirits with dropout, a stochastic algorithm in Integer Optimization and the DSD training technique, which were well addressed. Drop Pruning can significantly reduce overfitting along with compressing the network. Experimental results show that Drop Pruning can outperform the state-of-the-art pruning methods on many benchmark pruning tasks, which may provide some new insights into the aspect of model compression. This research is in its early stage. First, the experimental results deserve deeper investigation, including different pruning tasks and different importance judgments. Second, the idea of drop out and drop in may be also useful in quantization. Third, Drop Pruning may provide an insight into the aspect of Integer Optimization, which also deserves both numerical and theoretical analysis.
This work was supported by the Innovation Foundation of Qian Xuesen Laboratory of Space Technology. The authors thank Dr. Wuyang Li and Dr. Haidong Xie for their careful proofreading.
- Anwar & Sung (2016) Anwar, S. and Sung, W. Coarse pruning of convolutional neural networks with random masks. 2016.
- Ba & Caruana (2014) Ba, J. and Caruana, R. Do deep nets really need to be deep? In Advances in neural information processing systems, pp. 2654–2662, 2014.
- Banzhaf et al. (1998) Banzhaf, W., Nordin, P., Keller, R. E., and Francone, F. D. Genetic programming: an introduction, volume 1. Morgan Kaufmann San Francisco, 1998.
- Bouthillier et al. (2015) Bouthillier, X., Konda, K., Vincent, P., and Memisevic, R. Dropout as data augmentation. arXiv preprint arXiv:1506.08700, 2015.
- Canziani et al. (2016) Canziani, A., Paszke, A., and Culurciello, E. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
Chen et al. (2015)
Chen, W., Wilson, J., Tyree, S., Weinberger, K., and Chen, Y.
Compressing neural networks with the hashing trick.
International Conference on Machine Learning, pp. 2285–2294, 2015.
- Cheng et al. (2017) Cheng, Y., Wang, D., Zhou, P., and Zhang, T. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.
- Dong et al. (2017) Dong, X., Chen, S., and Pan, S. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4857–4867, 2017.
- Golub et al. (2018) Golub, M., Lemieux, G., and Lis, M. Dropback: Continuous pruning during training. arXiv preprint arXiv:1806.06949, 2018.
- Gomez et al. (2018) Gomez, A. N., Zhang, I., Swersky, K., Gal, Y., and Hinton, G. E. Targeted dropout. In 32nd Conference on Neural Information Processing Systems, 2018.
- Guo et al. (2016) Guo, Y., Yao, A., and Chen, Y. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pp. 1379–1387, 2016.
- Han et al. (2015a) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
- Han et al. (2015b) Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015b.
- Han et al. (2016) Han, S., Pool, J., Narang, S., Mao, H., Gong, E., Tang, S., Elsen, E., Vajda, P., Paluri, M., Tran, J., et al. Dsd: Dense-sparse-dense training for deep neural networks. arXiv preprint arXiv:1607.04381, 2016.
- Hanson & Pratt (1989) Hanson, S. J. and Pratt, L. Y. Comparing biases for minimal network construction with back-propagation. In Advances in neural information processing systems, pp. 177–185, 1989.
- Hassibi & Stork (1993) Hassibi, B. and Stork, D. G. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171, 1993.
He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- He et al. (2018) He, K., Girshick, R., and Dollár, P. Rethinking imagenet pre-training. arXiv preprint arXiv:1811.08883, 2018.
- Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Hinton et al. (2012) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
- Hu et al. (2016) Hu, H., Peng, R., Tai, Y.-W., and Tang, C.-K. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
- Karlof (2005) Karlof, J. K. Integer programming: theory and practice. CRC Press, 2005.
- Kingma et al. (2015) Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583, 2015.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
- Lavin & Gray (2016) Lavin, A. and Gray, S. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021, 2016.
- Lebedev & Lempitsky (2016a) Lebedev, V. and Lempitsky, V. Fast convnets using group-wise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2564, 2016a.
- Lebedev & Lempitsky (2016b) Lebedev, V. and Lempitsky, V. Fast convnets using group-wise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2564, 2016b.
- Lebedev et al. (2014) Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., and Lempitsky, V. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014.
- LeCun et al. (1990) LeCun, Y., Denker, J. S., and Solla, S. A. Optimal brain damage. In Advances in neural information processing systems, pp. 598–605, 1990.
- Li et al. (2016) Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
- Liu et al. (2016) Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C. Ssd: Single shot multibox detector. In European conference on computer vision, pp. 21–37. Springer, 2016.
- Liu et al. (2017) Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. Learning efficient convolutional networks through network slimming. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2755–2763. IEEE, 2017.
- Louizos et al. (2017) Louizos, C., Welling, M., and Kingma, D. P. Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312, 2017.
- Luo et al. (2017) Luo, J.-H., Wu, J., and Lin, W. Thinet: A filter level pruning method for deep neural network compression. arXiv preprint arXiv:1707.06342, 2017.
- Mathieu et al. (2013) Mathieu, M., Henaff, M., and LeCun, Y. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013.
- Molchanov et al. (2017) Molchanov, D., Ashukha, A., and Vetrov, D. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
- Molchanov et al. (2016) Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
- Nesterov (2012) Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
- Noh et al. (2015) Noh, H., Hong, S., and Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 1520–1528, 2015.
- Ruder (2016) Ruder, S. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
- Rueda et al. (2017) Rueda, F. M., Grzeszick, R., and Fink, G. A. Neuron pruning for compressing deep networks using maxout architectures. In German Conference on Pattern Recognition, pp. 177–188. Springer, 2017.
- Shamir (2015) Shamir, O. A stochastic pca and svd algorithm with an exponential convergence rate. In International Conference on Machine Learning, pp. 144–152, 2015.
- Simonyan & Zisserman (2014a) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014a.
- Simonyan & Zisserman (2014b) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014b.
- Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Sun et al. (2018) Sun, C., Sun, W., Wang, X., and Zhou, Q. Potential game theoretic learning for the minimal weighted vertex cover in distributed networking systems. IEEE Transactions on Cybernetics, PP(99):1–11, 2018.
- Taoka & Watanabe (2012) Taoka, S. and Watanabe, T. Performance comparison of approximation algorithms for the minimum weight vertex cover problem. In Circuits and Systems (ISCAS), 2012 IEEE International Symposium on, pp. 632–635. IEEE, 2012.
- Ullrich et al. (2017) Ullrich, K., Meeds, E., and Welling, M. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.
- Wan et al. (2013) Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pp. 1058–1066, 2013.
- Wang et al. (2016) Wang, S., Cai, H., Bilmes, J., and Noble, W. Training compressed fully-connected networks with a density-diversity penalty. 2016.
- Wu et al. (2016) Wu, J., Leng, C., Wang, Y., Hu, Q., and Cheng, J. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4820–4828, 2016.
- Zhou et al. (2016) Zhou, H., Alvarez, J. M., and Porikli, F. Less is more: Towards compact cnns. In European Conference on Computer Vision, pp. 662–677. Springer, 2016.