Progressive Weight Pruning of Deep Neural Networks using ADMM

10/17/2018 ∙ by Shaokai Ye, et al. ∙ Syracuse University 16

Deep neural networks (DNNs) although achieving human-level performance in many domains, have very large model size that hinders their broader applications on edge computing devices. Extensive research work have been conducted on DNN model compression or pruning. However, most of the previous work took heuristic approaches. This work proposes a progressive weight pruning approach based on ADMM (Alternating Direction Method of Multipliers), a powerful technique to deal with non-convex optimization problems with potentially combinatorial constraints. Motivated by dynamic programming, the proposed method reaches extremely high pruning rate by using partial prunings with moderate pruning rates. Therefore, it resolves the accuracy degradation and long convergence time problems when pursuing extremely high pruning ratios. It achieves up to 34 times pruning rate for ImageNet dataset and 167 times pruning rate for MNIST dataset, significantly higher than those reached by the literature work. Under the same number of epochs, the proposed method also achieves faster convergence and higher compression rates. The codes and pruned DNN models are released in the link bit.ly/2zxdlss

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have achieved human-level performance in many application domains such as image classification (Krizhevsky et al., 2012), object recognition (LeCun et al., 1998; He et al., 2016)

, natural language processing

(Hinton et al., 2012; Dahl et al., 2012), etc. At the same time, the networks are growing deeper and bigger for higher classification/recognition performance (i.e., accuracy) (Simonyan & Zisserman, 2015). However, the very large DNN model size increases the computation time of the inference phase. To make matters worse, the large model size hinders DNN’ deployments on edge computing, which provides the ubiquitous application scenarios of DNNs besides cloud computing applications.

As a result, extensive research efforts have been devoted to DNN model compression, in which DNN weight pruning is a representative technique. Han et al. (2015) is the first work to present the DNN weight pruning method, which prunes the weights with small magnitudes and retrains the network model, heuristically and iteratively. After that, more sophisticated heuristics have been proposed for DNN weight pruning, e.g., incorporating both weight pruning and growing (Guo et al., 2016), regularization (Wen et al., 2016)

, and genetic algorithms

(Dai et al., 2017). Other improvement directions of weight pruning include trading-off between accuracy and compression rate, e.g., energy-aware pruning (Yang et al., 2017), incorporating regularity, e.g., channel pruning (He et al., 2017), and structured sparsity learning (Wen et al., 2016).

While the weight pruning technique explores the redundancy in the number of weights of a network model, there are other sources of redundancy in a DNN model. For example, the weight quantization (Leng et al., 2017; Park et al., 2017; Zhou et al., 2017; Lin et al., 2016; Wu et al., 2016; Rastegari et al., 2016; Hubara et al., 2016; Courbariaux et al., 2015) and clustering (Zhu et al., 2017; Han et al., 2016) techniques explore the redundancy in the number of bits for weight representation. The activation pruning technique (Jung et al., 2018; Sharify et al., 2018) leverages the redundancy in the intermediate results. While our work focuses on weight pruning as the major DNN model compression technique, it is orthogonal to the other model compression techniques and might be integrated under a single ADMM-based framework for achieving more compact network models.

The majority of prior work on DNN weight pruning take heuristic approaches to reduce the number of weights as much as possible, while preserving the expressive power of the DNN model. Then one may ask, how can we push for the utmost sparsity of the DNN model without hurting accuracy? and what is the maximum compression rate we can achieve by weight pruning? Towards this end, Zhang et al. (2018b) took a tentative step by proposing an optimization-based approach that leverages ADMM (Alternating Direction Method of Multipliers), a powerful technique to deal with non-convex optimization problems with potentially combinatorial constraints. This direct ADMM-based weight pruning technique can be perceived as a smart DNN regularization where the regularization target is dynamically changed in each ADMM iteration. As a result it achieves higher compression (pruning) rate than heuristic methods.

Inspired by Zhang et al. (2018b), in this paper we propose a progressive weight pruning approach that incorporates both an ADMM-based algorithm and masked retraining, and takes a progressive means targeting at extremely high compression (pruning) rates with negligible accuracy loss. The contributions of this work are summarized as follows:

  • We make a key observation that when pursuing extremely high compression rates (say 150 for LeNet-5 or 30 for AlexNet), the direct ADMM-based weight pruning approach (Zhang et al., 2018b) cannot produce exactly sparse models upon convergence, in that many weights to be pruned are close to zero but not exactly zero. Certain accuracy degradation will result from this phenomenon if we simply set these weights to zero.

  • We propose and implement the progressive weight pruning paradigm that reaches an extremely high compression rate through multiple partial prunings with progressive pruning rates. This progressive approach, motivated by dynamic programming, helps to mitigate the long convergence time by direct ADMM pruning.

  • Extensive experiments are performed by comparing with many state-of-the-art weight pruning approaches and the highest compression rates in the literature are achieved by our progressive weight pruning framework, while the loss of accuracy is kept negligible. Our method achieves up to 34 pruning rate for the ImageNet data set and 167 pruning rate for the MNIST data set, with virtually no accuracy loss. Under the same number of epochs, the proposed method achieves notably faster convergence and higher compression rates than prior iterative pruning and direct ADMM pruning methods.

We provide codes (both Caffe and TensorFlow versions) and pruned DNN models (both for the ImageNet and MNIST data sets) in the link:

bit.ly/2zxdlss.

2 The Progressive Weight Pruning Framework of DNNs

This section introduces the proposed progressive weight pruning framework using ADMM. Section 2.1 describes the overall framework. Section 2.2 discusses the ADMM-based algorithm for DNN weight pruning (Zhang et al., 2018b), which we will improve and incorporate into the progressive weight pruning framework. Section 2.3 proposes a direct improvement of masked retraining to restore accuracy. Section 2.4 provides the motivations and details of the proposed progressive weight pruning framework.

2.1 The Overall Framework

Figure 1: The overall progressive weight pruning framework including masked ADMM-based algorithm, thresholding mask updating, and masked retraining steps.

The overall framework of progressive weight pruning is shown in Figure 1. It applies the ADMM-based pruning algorithm on a pre-trained (uncompressed) network model. Then it defines thresholding masks, with which the weights smaller than thresholds are forced to be zero. To restore accuracy, the masked retraining step is applied, that only updates nonzero weights specified by the thresholding masks. The ADMM-based algorithm, thresholding mask updating, and masked retaining steps are performed for several rounds, and each round is considered as a partial pruning, progressively pushing for the utmost of the DNN model pruning. Note that in our progressive weight pruning framework, we change the ADMM-based algorithm into a “masked” version that reuses the partially pruned model by masking the gradients of the pruned weights, thereby preventing them from recovering to nonzero weights and thus accelerating convergence.

2.2 ADMM-based Pruning Algorithm

Our ADMM-based pruning algorithm takes a pre-trained network as the input and outputs a pruned network model satisfying some sparsity constraints. Consider an -layer DNN, where the collection of weights in the -th (convolutional or fully-connected) layer is denoted by and the collection of biases in the -th layer is denoted by

. The loss function associated with the DNN is denoted by

.

The DNN weight pruning problem can be formulated as:

(1)
subject to

where and is the desired number of weights in the -th layer of the DNN. It is clear that are nonconvex sets, and it is in general difficult to solve optimization problems with nonconvex constraints.

The problem can be equivalently rewritten in a format without constraints, namely

(2)

where is the indicator function of , i.e.,

(3)

The ADMM technique (Boyd et al., 2011) can be applied to solve the weight pruning problem by formulating it as:

subject to

Through the augmented Lagrangian, the ADMM technique decomposes the weight pruning problem into two subproblems, and solving them iteratively until convergence. The first subproblem is:

(4)

This subproblem is equivalent to the original DNN training plus an

regularization term, and can be effectively solved using stochastic gradient descent with the same complexity as the original DNN training. Note that we cannot prove global optimality of the solution to subproblem (

4), just as we cannot prove optimality of the solution to the original DNN training problem.

On the other hand, the second subproblem is:

Since is the indicator function of the set , the globally optimal solution to this subproblem can be explicitly derived as in Boyd et al. (2011):

(5)

where denotes the Euclidean projection onto the set . Note that is a nonconvex set, and computing the projection onto a nonconvex set is a difficult problem in general. However, the special structure of allows us to express this Euclidean projection analytically. Namely, the optimal solution (5) is to keep the largest elements of and set the rest to zero (Boyd et al., 2011).

Finally, we update the dual variable as . This concludes one iteration of the ADMM.

In the context of deep learning, the ADMM-based algorithm for DNN weight pruning can be understood as a smart DNN regularization technique (see Eqn. (

4)), in which the regularization target (in the regularization term) is dynamically updated in each ADMM iteration. This is one reason that the ADMM-based algorithm for weight pruning achieves higher performance than heuristic methods and other regularization techniques (Wen et al., 2016), and the Projected Gradient Descent technique (Zhang et al., 2018a).

2.3 Masked Retraining Step

Applying the ADMM-based pruning algorithm alone has limitations for high compression rates. At convergence, the pruned DNN model will not be exactly sparse, in that many weights to be pruned will be close to zero instead of being exactly equal to zero. This is due to the non-convexity of Subproblem 1 in the ADMM-based algorithm. Certain accuracy degradation will result from this phenomenon if we simply set those weights to zero. This accuracy degradation will be non-negligible for high compression rates.

Instead of waiting for the full convergence of the ADMM-based algorithm, a masked retraining step is proposed, that (i) terminates the ADMM iterations early, (ii) keeps the largest weights (in terms of magnitude) and sets the other weights to zero, and (iii) performs retraining on the nonzero weights (with zero weights masked) using the training data set. More specifically, masks are applied to gradients of zero weights, preventing them from updating. Essentially, the ADMM-based algorithm sets a good starting point, and then the masked retraining step encourages the remaining nonzero weights to learn to recover classification accuracies.

Integrating masked retraining after the ADMM-based algorithm, a good compression rate can be achieved with reasonable training time. For example, we can achieve 21 model pruning rate without accuracy loss for AlexNet using a total of 417 epochs, much faster than the iterative weight pruning method of Han et al. (2016), which achieves 9 pruning rate in a total of 960 epochs. When translating into training time, our time of training is 72 hours using single NVIDIA 1080Ti GPU, whereas the reported training time in Han et al. (2016) is 173 hours.

2.4 Progressive Weight Pruning

Although the ADMM-based pruning algorithm in Section 2.2 and the masked retraining step in Section 2.3 together can achieve the state-of-the-art model compression (pruning) rates for many network models, we find limitations to this approach at extremely high pruning rates, for example at 150 pruning rate for LeNet-5 or 30 pruning rate for AlexNet.

Specifically, with a very high weight pruning rate, it takes a relatively long time for the ADMM-based algorithm to choose which weights to prune. For example, it is difficult for the ADMM-based algorithm to converge for 30 pruning rate on AlexNet but easy for 21 pruning rate.

To overcome this difficulty, we propose the progressive weight pruning method. This technique is motivated by dynamic programming, achieving a high weight pruning rate by using partial pruning models with moderate pruning rates. We use Figure 2 as an example to show the process used to achieve 30 weight pruning rate in AlexNet without accuracy loss. In Figure 2 (a), we start from three partial pruning models, with 15, 18, and 21 pruning rates, which can be directly derived from the uncompressed DNN model via the ADMM-based algorithm with masked retraining. To achieve 24 weight pruning rate, we start from these three models and check which gives the highest accuracy (suppose it is the 15 one). Because we start from partial pruning models, the convergence rate is fast. We then replace 15 partial pruning model by 24 model to derive the 27 model, see Figure 2 (b). In this way we always maintain three partial results and limit the total searching time. Suppose this time the 18 pruning model results in the highest accuracy and then we replace it with the 27 one. Finally, in Figure 2 (c), we find 24 model gives highest accuracy to reach 30 pruning rate.

Figure 2: Illustration of the progressive weight pruning algorithm.

Note that during progressive weight pruning, to leverage the partial pruning models, we use “masked” training when we reuse the partial pruning models in the ADMM-based algorithm. Specifically, it masks the gradients of the already pruned weights to prevent them from recovering to nonzero values. In this way, the algorithm is encouraged to focus on pruning nonzero weights.

Figure 3: The value of loss function associated with AlexNet versus retraining steps for (a) ADMM-based algorithm plus masked retraining and (b) proposed progressive pruning algorithm.

Figure 3 demonstrates the value of the loss function associated with AlexNet versus retraining steps for (a) the ADMM-based algorithm with masked retraining and (b) the proposed progressive pruning. Both methods target at 30 pruning rate. The ADMM-based algorithm with masked retraining performs one-round pruning to 30, while the proposed progressive pruning performs multiple partial prunings (15 to 24 to 30). We apply the same total number of iterations of both methods for fair comparison. The total number of epochs will be 730 for both cases, which is still lower than 960 epochs in (Han et al., 2016). We can observe in Figure 3 that by using multiple partial prunings we can achieve faster convergence with lower loss.

3 Experimental Results and Discussions

Method Top-5 Acc. No. Para. Rate
Uncompressed 80.27% 61.0M 1
Network Pruning (Han et al., 2015) 80.3% 6.7M 9
Optimal Brain Surgeon (Dong et al., 2017) 80.0% 6.7M 9.1
Low Rank and Sparse Decomposition (Yu et al., 2017) 80.3% 6.1M 10
Fine-Grained Pruning (Mao et al., 2017) 80.4% 5.1M 11.9
NeST (Dai et al., 2017) 80.2% 3.9M 15.7
Dynamic Surgery (Guo et al., 2016) 80.0% 3.4M 17.7
ADMM Pruning (Zhang et al., 2018b) 80.2% 2.9M 21
Progressive Weight Pruning (BVLC Model) 80.2% 2.02M 30
Progressive Weight Pruning (BVLC Model) 80.0% 1.97M 31
Progressive Weight Pruning (CaffeNet Model) 80.2% 2.02M 30
Progressive Weight Pruning (CaffeNet Model) 80.0% 1.97M 31
Table 1: Comparisons of weight pruning results on AlexNet for ImageNet data set.
Pruning Rate Direct ADMM Pruning Progressive Weight Pruning
18 80.3% 80.9%
21 80.2% 80.8%
30 76.7% 80.2%
Table 2: Top-5 accuracy of direct ADMM pruning (Zhang et al., 2018b) and progressive pruning at different pruning rates on AlexNet for ImageNet data set.
Method Top-5 Acc. No. Para. Rate
Uncompressed 88.7% 138M 1
Network Pruning (Han et al., 2015) 89.1% 10.6M 13
Optimal Brain Surgeon (Dong et al., 2017) 89.0% 10.3M 13.3
Low Rank and Sparse Decomposition (Yu et al., 2017) 89.1% 9.2M 15
ADMM Pruning (Zhang et al., 2018b) 88.7% 7.26M 19.5
Progressive Weight Pruning 88.7% 4.6M 30
Progressive Weight Pruning 88.2% 4.1M 34
Table 3: Comparisons of weight pruning results on VGG-16 for ImageNet data set.

max width=1 Method Top-5 Acc. No. Para. Rate Uncompressed 92.4% 25.6M 1 Fine-grained Pruning (Mao et al., 2017) 92.3% 9.8M 2.6 Progressive Weight Pruning 92.0% 2.8M 9.16 Progressive Weight Pruning 91.5% 1.47M 17.43

Table 4: Comparisons of weight pruning results on ResNet-50 for ImageNet data set.
Method Accuracy No. Para. Rate
Uncompressed 99.2% 431K 1
Network Pruning (Han et al., 2015) 99.2% 36K 12.5
ADMM Pruning (Zhang et al., 2018b) 99.2% 6.05K 71.2
Optimal Brain Surgeon (Dong et al., 2017) 98.3% 3.88K 111
Progressive Weight Pruning 99.0% 2.58K 167
Table 5: Comparisons of weight pruning results on LeNet-5 for MNIST data set.
Method Acc. Loss No. Para. Conv  No. bits FC No. bits Total data size /Compress rate Total size w. index /Compress rate
Uncompressed 0.0% 430.5K 32 32 1.7MB 1.7MB
Iterative pruning (Han et al., 2016) 0.1% 35.8K 8 5 24.2KB / 70.2 52.1KB / 33
Learning to share (Ullrich et al., 2017) 0.2% 10.4KB / 162
Our Method 0.2% 2.57K 3 2 (3 for output layer) 0.89KB / 1,910 2.73KB / 623
Table 6: Comparisons of weight pruning with quantization results on LeNet-5 for MNIST data set.

3.1 Experimental Setups

We evaluate the proposed ADMM-based progressive weight pruning framework on the ImageNet ILSVRC-2012 data set (Deng et al., 2009) and MNIST data set (LeCun et al., 1998). We also use DNN weight pruning results from many previous works for comparison. For ImageNet data set, we test on a variety of DNN models including AlexNet (both BAIR/BVLC model and CaffeNet model), VGG-16, and ResNet-50 models. We test on LeNet-5 model for MNIST data set. The accuracies of the uncompressed DNN models are reported in the tables for reference.

We implement our codes in Caffe (Jia et al., 2014). Experiments are tested on 12 Nvidia GTX 1080Ti GPUs and 12 Tesla P100 GPUs. As the key parameters in ADMM-based weight pruning, we set the ADMM penalty parameter to for the masked ADMM-based algorithm. When targeting at a high weight pruning rate, we change it to for higher performance. To eliminate the already pruned weights in partial pruning results from the masked ADMM-based algorithm, is forced to be zero if no more pruning is performed for a specific layer . We use an initial learning rate of for the masked ADMM-based algorithm and an initial learning rate of for masked retraining.

We provide the codes (both Caffe and TensorFlow versions) and all pruned DNN models (both for ImageNet and MNIST data sets) in the link: bit.ly/2zxdlss.

3.2 Comparison Results and Discussions

Table 1 presents the weight pruning comparison results on the AlexNet model between our proposed method and prior works. Our weight pruning results clearly outperform the prior work, in that we can achieve 31 weight reduction rate without loss of accuracy. Our progressive weight pruning also outperforms the direct ADMM weight pruning in Zhang et al. (2018b) that achieves 21 compression rate. Also the CaffeNet model results in slightly higher accuracy compared with the BVLC AlexNet model. Table 2 presents more comparison results with the direct ADMM pruning. It can be observed that (i) with the same compression rate, our progressive weight pruning outperforms the direct pruning in accuracy; (ii) the direct ADMM weight pruning suffers from significant accuracy drop with high compression rate (say 30 for AlexNet); and (iii) for a good compression rate (18 and 21), our progressive weight pruning technique can even achieve higher accuracy compared with the original, uncompressed DNN model.

Table 3, Table 4, and Table 5 present the comparison results on the VGG-16, ResNet-50, and LeNet-5 (for MNIST) models, respectively. These weight pruning results we achieved clearly outperform the prior work, consistently achieving the highest sparsities in the benchmark DNN models. On the VGG-16 model, we achieve 30 weight pruning with comparable accuracy with prior works, while the highest pruning rate in prior work is 19.5. We also achieve 34 weight pruning with minor accuracy loss. For ResNet-50 model, we have tested 17.43 weight pruning rate and confirmed minor accuracy loss. In fact, there is limited prior work on ResNet weight pruning for ImageNet data set, due to (i) the difficulty in weight pruning since ResNet mainly consists of convolutional layers, and (ii) the slow training speed of ResNet. Our method, on the other hand, achieves a relatively high training speed, thereby allowing for the weight pruning testing on different large-scale DNN models.

For LeNet-5 model compression, we achieve 167 weight reduction with almost no accuracy loss, which is much higher than prior work under the same accuracy. The prior work Optimal Brain Surgeon (Dong et al., 2017) also achieves a high pruning rate of 111, but suffers from accuracy drop of around 1% (already non-negligible for MNIST data set).

For other types of DNN models, we have tested the proposed method on the facial recognition application on two representative DNN models

(Krafka et al., 2016; Ho, 2016). We demonstrate over 10 weight pruning rate with 0.2% and 0.4% accuracy loss, respectively, compared with the original DNN models.

In summary, the experimental results demonstrate that our framework applies to a broad set of representative DNN models and consistently outperforms the prior work. It also applies to the DNN models that consist of mainly convolutional layers, which are different with weight pruning using prior methods. These promising results will significantly contribute to the energy-efficient implementation of DNNs in mobile and embedded systems, and on various hardware platforms.

Finally, some recent work have focused on the simultaneous weight pruning and weight quantization, as both will contribute to the model storage compression of DNNs. Weight pruning and quantization can be unified under the ADMM framework, and we demonstrate the comparison results in Table 6 using the LeNet-5 model as illustrative example. As can be observed in the table, we can simultaneously achieve 167 weight reduction and use 2-bit for fully-connected layer weight quantization and 3-bit for convolutional layer weight quantization. The overall accuracy is 99.0%. When we focus on the weight data storage, the compression rate is unprecendented 1,910 compared with the original DNN model with floating point representation. When indices (required in weight pruning) are accounted for, the overall compression rate is 623, which is still much higher than the prior work. It is interesting to observe that the amount of storage for indices is even higher than that for actual weight data.

4 Related Work on DNN Weight Pruning/Model Compression

The pioneering work by Han et al. (2015) shows that DNN weights could be effectively pruned while maintaining the same accuracy after iterative retraining, which gives 9 pruning in AlexNet and 13 pruning in VGG-16. However, higher compression rates could hardly be obtained as the method remains highly heuristic and time-consuming. Extensions of this initial work apply algorithm-level improvements. For example, Guo et al. (2016) adopts a method that performs both pruning and growing of DNN weights, achieving 17.7 pruning rate in AlexNet. Dai et al. (2017)

applies the evolutionary algorithm that prunes and grows weights in a random manner, achieving 15.7

pruning rate in AlexNet. The Optimal Brain Surgeon technique has been proposed Dong et al. (2017), achieving minor improvement in AlexNet/VGGNet but a good pruning ratio of 111 with less than 1% accuracy degradation in MNIST. The regularization method (Wen et al., 2016) achieves 6 weight pruning in the convolutional layers of CaffeNet. Mao et al. (2017) uses different versions of DNN weight pruning methods, from the fine-grained pruning to channel-wise regular pruning methods. Recently, the direct ADMM weight pruning algorithm has been developed (Zhang et al., 2018b), which is a systematic weight pruning framework and achieves state-of-the-art performance in multiple DNN models.

The above weight pruning methods result in irregularity in weight storage, in that indices are need to locate the next weight in sparse matrix representations. To mitigate the associated overheads, many recent work have proposed to incorporate regularity and structure in the weight pruning framework. Representative work include the channel pruning methods (He et al., 2017; Mao et al., 2017), and row/column weight pruning method (Wen et al., 2016). The latter has been extended in a systematic way in Zhang et al. (2018c). These work can partially mitigate the overheads in GPU, embedded systems, and hardware implementations and result in higher acceleration in these platforms, but typically cannot result in higher pruning ratio than unrestricted pruning. We will investigate the application of progressive weight pruning to the regular/structured pruning as future work.

5 Conclusion

This work proposes a progressive weight pruning approach based on ADMM, a powerful technique to deal with non-convex optimization problems with potentially combinatorial constraints. Motivated by dynamic programming, the proposed method reaches extremely high pruning rates by using partial prunings, with moderate pruning rates in each partial pruning step. Therefore, it resolves the accuracy degradation and long convergence time problems when pursuing extremely high pruning ratios. It achieves up to 34 pruning rate for the ImageNet data set and 167 pruning rate for the MNIST data set, significantly higher than those reached by work in the existing literature. Under the same number of epochs, the proposed method also achieves faster convergence and higher compression rates.

Acknowledgments

Financial support from the National Science Foundation under awards CAREER CMMI-1750531 and ECCS-1609916 is gratefully acknowledged.

References

  • Boyd et al. (2011) Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers.

    Foundations and Trends® in Machine Learning

    , 3(1):1–122, 2011.
  • Courbariaux et al. (2015) Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131, 2015.
  • Dahl et al. (2012) George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1):30–42, 2012.
  • Dai et al. (2017) Xiaoliang Dai, Hongxu Yin, and Niraj K Jha. Nest: a neural network synthesis tool based on a grow-and-prune paradigm. arXiv preprint arXiv:1711.02017, 2017.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 248–255, 2009.
  • Dong et al. (2017) Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4857–4867, 2017.
  • Guo et al. (2016) Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pp. 1379–1387, 2016.
  • Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015.
  • Han et al. (2016) Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • He et al. (2017) Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 1398–1406. IEEE, 2017.
  • Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
  • Ho (2016) Jostine Ho. mememoji. https://github.com/JostineHo/mememoji, 2016.
  • Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115, 2016.
  • Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. ACM, 2014.
  • Jung et al. (2018) Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Youngjun Kwak, Jae-Joon Han, and Changkyu Choi. Joint training of low-precision neural network with quantization interval parameters. arXiv preprint arXiv:1808.05779, 2018.
  • Krafka et al. (2016) Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra Bhandarkar, Wojciech Matusik, and Antonio Torralba. Eye tracking for everyone. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.

    Imagenet classification with deep convolutional neural networks.

    In Advances in neural information processing systems, pp. 1097–1105, 2012.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Leng et al. (2017) Cong Leng, Hao Li, Shenghuo Zhu, and Rong Jin. Extremely low bit neural network: Squeeze the last bit out with admm. arXiv preprint arXiv:1707.09870, 2017.
  • Lin et al. (2016) Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pp. 2849–2858, 2016.
  • Mao et al. (2017) Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.
  • Park et al. (2017) Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo. Weighted-entropy-based quantization for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7197–7205, 2017.
  • Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Springer, 2016.
  • Sharify et al. (2018) Sayeh Sharify, Alberto Delmas Lascorz, Kevin Siu, Patrick Judd, and Andreas Moshovos. Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks. In Proceedings of the 55th Annual Design Automation Conference, pp.  20. ACM, 2018.
  • Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
  • Ullrich et al. (2017) Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.
  • Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
  • Wu et al. (2016) Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4820–4828, 2016.
  • Yang et al. (2017) Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6071–6079, 2017.
  • Yu et al. (2017) Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7370–7379, 2017.
  • Zhang et al. (2018a) Dejiao Zhang, Haozhu Wang, Mario Figueiredo, and Laura Balzano. Learning to share: Simultaneous parameter tying and sparsification in deep learning. 2018a.
  • Zhang et al. (2018b) Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. arXiv preprint arXiv:1804.03294, 2018b.
  • Zhang et al. (2018c) Tianyun Zhang, Kaiqi Zhang, Shaokai Ye, Jiayu Li, Jian Tang, Wujie Wen, Xue Lin, Makan Fardad, and Yanzhi Wang. Adam-admm: A unified, systematic framework of structured weight pruning for dnns. arXiv preprint arXiv:1807.11091, 2018c.
  • Zhou et al. (2017) Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. In International Conference on Learning Representations (ICLR), 2017.
  • Zhu et al. (2017) Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In International Conference on Learning Representations (ICLR), 2017.