Introduction
Despite the significant success and wide applications, Deep Neural Networks (DNNs) have increasing model sizes and associated computation and storage overheads. DNN model compression techniques have been widely investigated, including weight pruning [Han et al.2015, Wen et al.2016, Dai, Yin, and Jha2017, Guo, Yao, and Chen2016], weight clustering and quantization [Han, Mao, and Dally2016, Park, Ahn, and Yoo2017, Zhou et al.2017, Leng et al.2017], lowrank approximation [Cheng et al.2015, Sindhwani, Sainath, and Kumar2015, Zhao et al.2017], etc.
A pioneering work on weight pruning is the iterative pruning method [Han et al.2015], which successfully prunes 12 weights in LeNet5 (MNIST dataset) and 9
weights in AlexNet (ImageNet dataset), without accuracy degradation. The limitations of this work include (i) the limited capability in weight pruning in convolutional (CONV) layers, the most computationally intensive component in DNNs, and (ii) the irregularity in weight storage after pruning. To overcome these limitations, multiple recent work have extended to (i) use more sophisticated algorithms for higher pruning ratio
[Zhang et al.2018b, Ye et al.2018, Dai, Yin, and Jha2017, Guo, Yao, and Chen2016], (ii) strike a balance between higher pruning ratio and lower accuracy degradation [Yang, Chen, and Sze2016], and (iii) incorporate regularity in weight pruning and storage to facilitate hardware implementations [Wen et al.2016, Wen et al.2017, Zhang et al.2018c].Weight clustering and quantization are equally important, if not more, in DNN model compression. Weight clustering is different from quantization: the former requires the weights to be grouped into a predefined number of clusters, and weights within a cluster are the same; the latter requires the weights to take predefined, fixed values. In fact, weight quantization is a special type of the general weight clustering. Due to its flexibility, weight clustering will result in higher accuracy and/or compression ratio than quantization. On the other hand, weight quantization, especially equaldistance quantization, is more hardware friendly compared to weight clustering (and weight pruning as well). Both weight clustering and quantization will be considered in this paper in a unified way.
Many research work are dedicated to weight clustering and quantization [Leng et al.2017, Zhou et al.2016]. The current work are mainly iterative, including backpropagation training assuming continuous weights and mapping procedure to discrete values. It is important to note that multiplications can even be eliminated through effective weight quantization [Leng et al.2017], through quantization into binary weights, ternary weights (, 0, ), or weight quantization into powers of 2 such as the DoReFa net [Zhou et al.2016].
Weight pruning makes use of the redundancy in the number of weights in DNNs, whereas weight clustering/quantization exploits the redundancy in weight representations. These two sources of redundancy are largely independent with each other, which makes it desirable to combine weight pruning and clustering/quantization to make full exploitation of the degree of redundancy. Despite some early heuristic investigation
[Han et al.2015, Han, Mao, and Dally2016], there lacks a systematic investigation on the best possible combination of DNN model compression techniques. This paper aims to overcome this limitation and shed some light on the highest possible DNN model compression through effective combinations.This paper develops a unified, systematic framework of DNN weight pruning and weight clustering/quantization using Alternating Direction Method of Multipliers (ADMM), a powerful technique in optimization to deal with nonconvex optimization problems with potentially combinatorial constraints [Boyd et al.2011, Takapoui et al.2017]. This framework is based on a key observation: both DNN weight pruning and weight clustering/quantization, as well as their combinations, can be solved in a unified manner using ADMM. In the solution, the original problem is decomposed into three (or two) subproblems, which are iteratively solved until convergence. Overall, the ADMMbased solution can be understood as a smart, dynamic regularization process in which the regularization target is dynamically updated in each iteration. As a result it can outperform the prior work on regularization [Wen et al.2016] or projected gradient descent [Zhang et al.2018a].
For further performance improvement in ADMMbased weight clustering/quantization, we propose multiple techniques including iterative weight quantization and retraining procedure, joint weight clustering training and adaptive centroid updating, weight clustering retraining process, etc. The proposed unified framework using ADMM outperforms prior work in two aspects. First, for individual weight pruning and clustering/quantization methods, the proposed ADMM method outperforms prior work. For instance, we achieve 167 weight reduction in LeNet5, 24.7 in AlexNet, and 23.4 in VGGNet, without any accuracy loss, which clearly outperform prior arts. Second, for the joint DNN weight pruning and clustering/quantization, we achieve 1,910 and 210 storage reduction of weight data on LeNet5 and AlexNet, respectively, without accuracy loss. These results significantly outperform the stateoftheart results. Our codes and models are released at the link http://bit.ly/2D3F0np.
Discussions on the Combination of DNN Weight Pruning and Clustering/Quantization
As discussed before, weight pruning can be largely combined with weight clustering/quantization, thereby making full exploitation of the degree of redundancy. Preliminary work [Han, Mao, and Dally2016]
in this direction uses a combination of iterative weight pruning and Kmeans clustering methods. It simultaneously achieves 9
weight pruning in AlexNet, and uses 8bit CONV layer clustering and 5bit FC layer clustering. This work does not target hardware implementation and only focuses on weight clustering.When comparing with weight clustering/quantization, weight pruning can often result in a higher compression ratio of DNN [Han, Mao, and Dally2016]. This is because of two reasons. First, there is often higher degree of redundancy in the number of weights than the number of bits for weight representations. For weight clustering/quantization, for a single bit reduction in weight representation, the imprecision can be perceived to be doubled. This difficulty is not faced by weight pruning methods. Second, moderate weight pruning can often result in an increase in accuracy (by up to 2% in AlexNet in our ADMM framework), thereby resulting in a higher margin for further weight reduction. This effect, however, is not observed in weight clustering/quantization. As a result, weight pruning is often prioritized over weight clustering/quantization despite the effect of irregular weight storage and associated hardware implementation overhead in the former method.
In the prior work, there lacks a systematic investigation on the best possible combination of DNN weight pruning and weight clustering/quantization methods. In this paper we fill this void in order to make the full exploitation of the degree of redundancy. We provide formulation that can both perform ADMMbased weight pruning and clustering/quantization simultaneously, or give priority to weight pruning.
Fig. 1 shows an illustrative process about weight quantization after weight pruning. Given that a weight matrix after pruning will be quantized on Fig. 1 (a), we use 2bit for quantization and the interval is 0.5. Then the quantization levels become without 0 because 0 represents pruned weights. Fig. 1 (b) shows the quantized weights, and Fig. 1 (c) displays the values that are actually stored in hardware along with the interval value 0.5.
Fig. 2 shows the illustrative weight clustering process after weight pruning, which is different from weight quantization. Given the same weight matrix after pruning, we also use 2bit for weight clustering. Again 0 is not considered because the associated weights are already pruned. Fig. 2 (b) shows the weights after clustering, along with the centroid values for the 4 clusters shown in Fig. 2 (c). Different from weight quantization, the centroid values are flexible in the weight clustering process.
The Unified Framework of ADMM based Weight Pruning and Clustering/Quantization
Background of ADMM
Consider a nonconvex optimization problem that is difficult to solve directly. The ADMM method decomposes it into two subproblems that can be solved separately and efficiently. For example, the optimization problem
(1) 
lends itself to the application of ADMM if is differentiable and has some structure such as /norm or the indicator function of a constraint set. The problem is first rewritten as
(2)  
subject to 
By using augmented Lagrangian [Boyd et al.2011], this problem is decomposed into two subproblems on and . The first is , where
is a quadratic function. As a result, the complexity of solving subproblem 1 (e.g., via stochastic gradient descent) is the same as minimizing
. Subproblem 2 is , where is quadratic. When has special structure, exploiting the properties of allows this problem to be solved analytically and optimally. In this way, we can solve the problem via ADMM that is difficult to solve directly.Problem Formulation
Consider an layer DNN, the collections of weights and biases of the th layer are respectively denoted by and
; The loss function of the
layer DNN is denoted by .When we combine DNN weight pruning with clustering or quantization, the overall problem is defined by
(3)  
subject to 
The set reflects the constraint for the weight pruning problem, i.e., , where is the desired number of weights after pruning in the th layer. When we combine weight pruning with weight clustering, . When we combine weight pruning with weight quantization, . Here the values are quantization levels, and we consider equaldistance quantization (the same distance between quantization levels) to facilitate hardware implementations. Besides, , and is the number of bits we use for weight clustering or quantization.
Both constraints and need to be satisfied simultaneously in the joint problem of DNN weight pruning and weight clustering/quantization. In this way we can make sure that most of the DNN weights are pruned (set to zero), while the remaining weights are clustered/quantized.
The Unified ADMMbased Framework
To apply ADMM, we define indicator functions to incorporate the combinatorial constraints into objective function. The indicator functions are
for .
We then incorporate auxiliary variables and , and rewrite the original problem (3) as
(4)  
subject to 
Through ADMM [Boyd et al.2011], problem (4) can be decomposed into three subproblems. The overall problem of weight pruning and clustering/quantization is solved through solving the subproblems iteratively until convergence. The first subproblem is
(5)  
where and are the dual variables updated in each ADMM iteration. The first term in (5) is the differentiable loss function of the DNN, while the other terms are quadratic terms and they are differentiable and convex. As a result, this subproblem can be solved by stochastic gradient descent (e.g., the ADAM algorithm [Kingma and Ba2014]) and the complexity of solving this subproblem is the same as training of the original DNN.
The second subproblem is
(6) 
As we mentioned before, is the indicator function of , thus the analytical solution of problem (6) is
(7) 
where is Euclidean projection of onto the set . In DNN weight pruning, is the desired number of weights after pruning in the th layer. The Euclidean projection is to keep elements in with the largest magnitude and set the rest to be zero [Boyd et al.2011, Zhang et al.2018b].
For the weight quantization problem, the quantization levels , , …, are fixed. In fact, weight quantization is a special type of clustering in which the clustering centroids are predetermined and fixed. In weight quantization, the Euclidean projection is to map every element of to the quantization level (centroid) closest to that element.
For weight clustering, the centroids of the clusters can be updated dynamically, and the constraint is on the number of clusters for the th layer. Suppose that the weights are already divided into clusters , , …, . Then the Euclidean projection is to set every element in to the average value of its cluster. The details of how to divide the weights into clusters will be discussed later in the next section.
After solving the subproblems, we update the dual variables and , which are given by
(10)  
(11) 
This is one iteration of ADMM, and we solve the subproblems and update the dual variables iteratively until the convergence of ADMM. Namely, the following conditions need to be satisfied
(12)  
(13) 
More Understanding about the ADMMbased Framework
An interpretation of the high performance of ADMMbased framework is as follows. It can be understood as a smart, dynamic DNN regularization technique (see Eqn. (5)), in which the regularization targets are dynamically updated in each ADMM iteration through solving of subproblems 2 and 3. This dynamic characteristics is one of the key reason that the ADMMbased framework outperforms many prior work on DNN model compression based on regularization (without updating of regularization targets), or Projected Gradient Descent [Zhang et al.2018a].
Simplification for the Proposed Framework
The above formulation and iterative solution has high complexity. To address this issue, we present a method that prunes the unimportant weights first and then performs weight clustering or quantization. The underlying reason for this order is the higher degree of redundancy in the number of weights than the number of bits for weight representations (and therefore higher gain in weight pruning than weight clustering/quantization), as discussed before.
In the first step, we only account for the constraints for DNN weight pruning. We update and according to
(14) 
After weight pruning, we solve DNN weight clustering or quantization problem. We consider the constraints for weight clustering or quantization on the pruned model (the remaining weights). To solve this problem, we update and according to
(15) 
and update and according to (9) and (11). Note that in weight clustering/quantization, we only update the nonzero elements. The pruned weights are fixed to zero.
The overall algorithm is shown in Algorithm 1, in which details in weight quantization/clustering will be discussed in the next section. We start from the trained DNNs (e.g., LeNet5, AlexNet, VGGNet). We use the weight pruning ratios ’s and clustering/quantization levels ’s from prior work [Han et al.2015, Han, Mao, and Dally2016] as starting points, and further increase the pruning ratios and decrease the number of clustering/quantization levels. The rationale behind this procedure is that our framework is unified and systematic and can achieve higher DNN model compression compared with stateofthearts.
Details in Weight Quantization and Clustering
Weight clustering and quantization are used to further compress the weight representation after ADMMbased pruning. After weight pruning, there exists lots of zeros in weight matrix. We use bits, which means there are points representing different weights, to cluster or quantize the rest of nonzero weights (zero weights are already pruned). The number of bits and their representations can be different for different layers of DNN. The difference between weight clustering and quantization is that in weight clustering centroids of clusters are flexible, while the quantization levels (centroids) in weight quantization are fixed and predefined.
In the following we first discuss details of weight quantization, which can be perceived as a special case of weight clustering, and then the general weight clustering process.
Details in Weight Quantization
Parameter Initialization
In this work we use equaldistance quantization to facilitate hardware implementations. In each layer the number of quantization levels is . We quantize the weights into a set of quantization levels . The interval is the distance between nearby quantization levels, which may be different for different layers. There is no need to quantize zero weights because they are already pruned.
The interval and quantization level () in weight quantization can be determined in an effective manner. For finding a value , we denote as th weight in layer and as a quantization function to the closest quantization level. Then the total square error in a single quantization step is given by . In order to minimize the total square error, we use binary search method to determine . To decide a value (), we reference some prior work like [Han et al.2015] and decrease accordingly. In [Han et al.2015], around 5bit is used for quantization, which actually is a kind of clustering, in AlexNet, whereas our experiment results prove that 34 bits on average to quantize weights in AlexNet are sufficient without incurring any accuracy loss.
Iterative Weight Quantization and Retraining
After the ADMM procedure, many weights are close to the quantization levels rather than exactly on those levels, which means that we have not strictly quantized the weights yet. The reasons are twofold: the nonconvexity nature of the optimization problem and the time limitation to finish the ADMM procedure. A straightforward way is to project all the weights to the nearby quantization levels. Due to the huge number of weights, although the change in every weight value is very small since they are close enough to the quantization levels, in accumulation it causes around 1% overall accuracy degradation in our experiments (quantization to 3 bits).
To address this degradation, we present an iterative weight quantization method. In our method, we iteratively project a portion of weights to the nearby quantization levels, fix these values (i.e., we quantize these weights), and retrain the rest of them. More specifically, we quantize % of weights closest to every quantization level after the ADMM procedure and then retrain the rest of weights. After we quantize the weights, we observe accuracy degradation of the DNN, while the retraining step can retrieve the accuracy. After the retraining step, we again quantize % of weights closest to every quantization level and implement another retraining step. This quantization and retraining process is performed iteratively until the number of unquantized weights is small enough. Finally, we quantize these small number of remaining weights and it will not incorporate accuracy loss.
The advantage of our proposed method is that we only quantize a portion of weights in every iteration, and our retraining step provides additional chance to the rest of the weights, so that they can be updated to retrieve the accuracy. This explains why the iterative quantization method works better than the straightforward method that quantizes all the weights directly. We show the overall algorithm about ADMMbased weight quantization and iterative retraining process by using Algorithm 2.
Details in Weight Clustering
In weight clustering, we cluster the remaining weights (after weight pruning) into clusters, where weights in each cluster have the same value. Different from weight quantization, the centroid value () for each cluster is flexible, and should be optimized along with the weight clustering procedure. In the following, we discuss the details in weight clustering that are different from quantization, including weight clustering training and retraining processes. The initialization of () value is the same as weight quantization, and will not be discussed in details.
Weight Clustering Training
The clustering centroids need to be determined together in the training procedure. For initialization, we perform Kmeans clustering () and determine each centroid as the average value of associated weights. In each ADMM iteration in weight clustering training, we perform weight mapping (Euclid mapping) of to the nearest centroid values, and update the weights through solving Eqn. (15). Based on weight updating, we perform Kmeans clustering again and update each centroid value as the average value of associated weights. In this way we perform both weight clustering and centroid updating in an effective manner.
Weight Clustering Retraining
After finishing ADMM training of weight clustering, we perform weight clustering retraining process to avoid accuracy degradation. This retraining process is not ADMMbased, but based on the basic stochastic gradient descent. In the retraining process, we perform stochastic gradient descent only on the centroid value for each cluster. In this way we maintain the same value (centroid value) for all the weights in this cluster. The retraining process will only result in accuracy enhancement instead of accuracy degradation. This retraining flexibility of centroid values is the key reason that weight clustering has higher accuracy than quantization. Algorithm 3 illustrates the whole process of ADMMbased weight clustering and retraining.
Model  Accuracy degradation  No. of weights  CONV weight bits  FC weight bits  Total data size/ Compress ratio  Total model size (including index)/ Compress ratio 

LeNet5 Baseline  0.0%  430.5K  32  32  1.7MB  1.7MB 
Iterative pruning [Han, Mao, and Dally2016]  0.1%  35.8K  8  5  24.2KB / 70.2  52.1KB / 33 
Our Method (Clustering)  0.1%  2.57K  3  2 (3 for output layer)  0.89KB / 1,910  2.73KB / 623 
Our Method (Quantization)  0.2%  2.57K  3  2 (3 for output layer)  0.89KB / 1,910  2.73KB / 623 
Model  Accuracy degradation  No. of weights  CONV weight bits  FC weight bits  Total data size/ Compress ratio  Total model size (including index)/ Compress ratio 

AlexNet Baseline  0.0%  60.9M  32  32  243.6MB  243.6MB 
Iterative pruning [Han, Mao, and Dally2016]  0.0%  6.7M  8  5  5.4MB / 45  9.0MB / 27 
Binary quant. [Leng et al.2017]  3.0%  60.9M  1  1  7.3MB / 32  7.3MB / 32 
Ternary quant. [Leng et al.2017]  1.8%  60.9M  2  2  15.2MB / 16  15.2MB / 16 
Our Method (Clustering)  0.1%  2.47M  5  3  1.16MB / 210  2.7MB / 90 
Our Method (Quantization)  0.2%  2.47M  5  3  1.16MB / 210  2.7MB / 90 
Experimental Results and Discussions
In this section, we apply the proposed joint weight pruning and weight clustering/quantization framework on LeNet5 [LeCun et al.1998] for MNIST dataset and AlexNet [Krizhevsky, Sutskever, and Hinton2012]
for ImageNet dataset. We focus on the total compression ratio on the overall DNN model, which depends on the number of weights and the total number of bits for weight representations. Also, we make comparisons of our model compression results with the representative works on DNN weight pruning and clustering/quantization. The comparisons show that we achieve a significant improvement in DNN model compression. We implement our experiments of LeNet5 on Tensorflow
[Abadi et al.2016]and AlexNet (and VGGNet) on Caffe
[Jia et al.2014]. Our experiments are carried out on GeForce GTX 1080Ti and NVIDIA Tesla P100 GPUs.We initialize ADMM by using the pretrained model of LeNet5 and AlexNet. For LeNet5, we set the penalty parameters as for LeNet5 and for AlexNet. The penalty parameters we set for weight pruning and weight clustering/quantization on a network are the same.
Our codes and models are released at the link http://bit.ly/2D3F0np.
LeNet5 on MNIST Dataset
We first present the weight pruning and quantization/clustering results on the LeNet5 model. The overall results on model size compression are shown in Table 1, while the layerwise results are shown in Table 3. For ADMMbased weight pruning alone, we achieve up to 167 weight reduction without accuracy loss, which is notably higher than the prior work such as [Han et al.2015] (12), [Zhang et al.2018a] (24.1, but this is on a different model LeNet300100), and [Aghasi et al.2017] (45.7, with 0.5% accuracy degradation).
For ADMMbased joint weight pruning and quantization, we simultaneously achieve 88 weight reduction through pruning, and use an average of 2.4bit for quantization, without accuracy loss. In terms of weight data storage, the compression ratio reaches 1,910
when comparing with the original LeNet5. This is clearly impressive result, when considering that each MNIST sample has 784 pixels and even logistic regression has 7.84M weights (MNIST has 10 classes). When indices (required in weight pruning) are accounted for, the whole model size reduction becomes 623
. We mainly compare with [Han et al.2016] because there lacks much prior work on joint weight pruning and quantization/clustering. We can observe significant improvements in both weight pruning and quantization compared with the prior work, demonstrating the effectiveness of ADMM framework.When weight clustering is applied, we use the same bit for CONV and FC layers as quantization, and accuracy improvement is achieved. Note that further reduction in bit representation is difficult to achieve without accuracy degradation (because FC layer is already quantized/clustered to 2bit). Hence quantization will be sufficient if 0.1% accuracy is not the design consideration.
Layer  No. of Weights  Number of weights after prune  Percentage of weights after prune  Weight bits 

conv1  0.5K  0.1K  20%  5 
conv2  25K  1.33K  5.3%  3 
fc1  400K  0.8K  0.2%  2 
fc2  5K  0.35K  7%  3 
Total  430.5K  2.58K  0.6%  2.4 
AlexNet on ImageNet Dataset
In this section we present the weight pruning and quantization/clustering results on the AlexNet model. The overall results on model size compression are shown in Table 2, while the layerwise results are shown in Table 4. For ADMMbased weight pruning alone, we can achieve 24.7 weight reduction without accuracy loss, which is notably higher than the prior work such as [Han et al.2015] (9), [Yu et al.2017] (10) and [Dai, Yin, and Jha2017] (15.7, but starting from a smaller DNN than original AlexNet). We also performed testing on VGGNet and achieve similar results on weight pruning and quantization/clustering. For example, the weight pruning ratio is 23.4 without accuracy loss. More results are abbreviated due to space limitation.
For ADMMbased joint weight pruning and quantization, we simultaneously achieve 24.7 weight reduction through pruning, and use an average of 3.4bit (not accounting for the first and last layers, similar to prior work) for weight quantization, without accuracy loss. In terms of weight data storage, the compression ratio reaches 210 when comparing with the original AlexNet model, which is also significant improvement. When indices in weight pruning are accounted for, the whole model size reduction becomes 90. When comparing with [Han et al.2015], we observe significant improvements in both weight pruning and quantization compared with the prior work, demonstrating the effectiveness of the ADMMbased framework.
Finally when it comes to weight clustering, we use the same bit for CONV and FC layers as quantization, and accuracy improvement is observed. Again due to the difficulty for further bit representation reduction without accuracy loss, weight quantization will be sufficient for most of the application domains.
Layer  No. of Weights  Number of weights after prune  Percentage of weights after prune  Weight bits 

conv1  34.8K  28.19K  81%  8 
conv2  307.2K  61.44K  20%  5 
conv3  884.7K  168.09K  19%  5 
conv4  663.5K  132.7K  20%  5 
conv5  442.4K  88.48K  20%  5 
fc1  37.7M  0.75M  2%  3 
fc2  16.8M  0.91M  5.4%  3 
fc3  4.1M  0.33M  8%  8 
Total  60.9M  2.47M  4.06%  3.4 
Conclusion
In this paper, we present a unified framework of DNN weight pruning and weight clustering/quantization using ADMM. When we focus on weight pruning alone, we achieve 167 weight reduction in LeNet5, 24.7 in AlexNet, and 23.4 in VGGNet without accuracy loss. For the combination of DNN weight pruning and clustering/quantization, we achieve 1,910 and 210 storage reduction of weight data on LeNet5 and AlexNet, respectively, without accuracy loss.
References
 [Abadi et al.2016] Abadi, M.; Agarwal, A.; Barham, P.; et al. 2016. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
 [Aghasi et al.2017] Aghasi, A.; Abdi, A.; Nguyen, N.; and Romberg, J. 2017. Nettrim: Convex pruning of deep neural networks with performance guarantee. In Advances in Neural Information Processing Systems, 3177–3186.

[Boyd et al.2011]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J.
2011.
Distributed optimization and statistical learning via the alternating
direction method of multipliers.
Foundations and Trends® in Machine Learning
3(1):1–122. 
[Cheng et al.2015]
Cheng, Y.; Yu, F. X.; Feris, R. S.; Kumar, S.; Choudhary, A.; and Chang, S.F.
2015.
An exploration of parameter redundancy in deep networks with
circulant projections.
In
Proceedings of the IEEE International Conference on Computer Vision
, 2857–2865.  [Dai, Yin, and Jha2017] Dai, X.; Yin, H.; and Jha, N. K. 2017. Nest: A neural network synthesis tool based on a growandprune paradigm. arXiv preprint arXiv:1711.02017.
 [Guo, Yao, and Chen2016] Guo, Y.; Yao, A.; and Chen, Y. 2016. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, 1379–1387.
 [Han et al.2015] Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), 1135–1143.
 [Han et al.2016] Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M. A.; and Dally, W. J. 2016. Eie: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, 243–254. IEEE.
 [Han, Mao, and Dally2016] Han, S.; Mao, H.; and Dally, W. J. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR).
 [Jia et al.2014] Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, 675–678. ACM.
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[Krizhevsky, Sutskever, and
Hinton2012]
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E.
2012.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, 1097–1105.  [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 [Leng et al.2017] Leng, C.; Li, H.; Zhu, S.; and Jin, R. 2017. Extremely low bit neural network: Squeeze the last bit out with admm. arXiv preprint arXiv:1707.09870.

[Park, Ahn, and Yoo2017]
Park, E.; Ahn, J.; and Yoo, S.
2017.
Weightedentropybased quantization for deep neural networks.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
. 
[Sindhwani, Sainath, and
Kumar2015]
Sindhwani, V.; Sainath, T.; and Kumar, S.
2015.
Structured transforms for smallfootprint deep learning.
In Advances in Neural Information Processing Systems, 3088–3096.  [Takapoui et al.2017] Takapoui, R.; Moehle, N.; Boyd, S.; and Bemporad, A. 2017. A simple effective heuristic for embedded mixedinteger quadratic programming. International Journal of Control 1–11.
 [Wen et al.2016] Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, 2074–2082.
 [Wen et al.2017] Wen, W.; Xu, C.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2017. Coordinating filters for faster deep neural networks. CoRR, abs/1703.09746.
 [Yang, Chen, and Sze2016] Yang, T.J.; Chen, Y.H.; and Sze, V. 2016. Designing energyefficient convolutional neural networks using energyaware pruning. arXiv preprint arXiv:1611.05128.
 [Ye et al.2018] Ye, S.; Zhang, T.; Zhang, K.; Li, J.; Xu, K.; Yang, Y.; Yu, F.; Tang, J.; Fardad, M.; Liu, S.; et al. 2018. Progressive weight pruning of deep neural networks using admm. arXiv preprint arXiv:1810.07378.
 [Yu et al.2017] Yu, X.; Liu, T.; Wang, X.; and Tao, D. 2017. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7370–7379.
 [Zhang et al.2018a] Zhang, D.; Wang, H.; Figueiredo, M.; and Balzano, L. 2018a. Learning to share: Simultaneous parameter tying and sparsification in deep learning.
 [Zhang et al.2018b] Zhang, T.; Ye, S.; Zhang, K.; Tang, J.; Wen, W.; Fardad, M.; and Wang, Y. 2018b. A systematic dnn weight pruning framework using alternating direction method of multipliers. arXiv preprint arXiv:1804.03294.
 [Zhang et al.2018c] Zhang, T.; Zhang, K.; Ye, S.; Li, J.; Tang, J.; Wen, W.; Lin, X.; Fardad, M.; and Wang, Y. 2018c. Adamadmm: A unified, systematic framework of structured weight pruning for dnns. arXiv preprint arXiv:1807.11091.
 [Zhao et al.2017] Zhao, L.; Liao, S.; Wang, Y.; Li, Z.; Tang, J.; Pan, V.; and Yuan, B. 2017. Theoretical properties for neural networks with weight matrices of low displacement rank. arXiv preprint arXiv:1703.00144.
 [Zhou et al.2016] Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; and Zou, Y. 2016. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160.
 [Zhou et al.2017] Zhou, A.; Yao, A.; Guo, Y.; Xu, L.; and Chen, Y. 2017. Incremental network quantization: Towards lossless cnns with lowprecision weights. arXiv preprint arXiv:1702.03044.
Comments
There are no comments yet.