1 Introduction
Deep neural networks (DNNs) are both computationally and storage intensive
[1, 2]. A number of prior work have focused on developing model compression techniques for DNNs. These techniques, which are applied during the training phase of the DNN, aim to simultaneously reduce the model size (thus, the storage requirement) and accelerate the computation for inference – all these to be achieved with minor classification accuracy (or prediction quality) loss. Indeed the accuracy of a DNN inference engine after model compression is typically higher than that of a shallow neural network with no compression [3, 4]. Two important categories of DNN model compression techniques are weight pruning and weight quantization.An early work on weight pruning of DNNs was done by Han et al. [3]. It is an iterative heuristic method, achieving a 9 reduction in the number of weights of AlexNet model (for ImageNet dataset). This weight pruning method has been extended in [5, 6, 7, 8, 4, 9] to either use more sophisticated algorithms to achieve a higher weight pruning rate, or to incorporate certain regularity or “structures” in the weight pruning framework. Weight quantization of DNNs has also been investigated in many recent work [10, 11, 12, 13, 14, 15, 16, 17]. Both storage and computational requirements of DNNs have been greatly reduced with tolerable accuracy loss. Indeed, multiplication operations (which are costly) may be eliminated when using binary, ternary, or powerof2 weight quantizations [15, 16, 17].
To overcome the limitation of the highly heuristic nature in prior weight pruning work, a recent work [18] developed a systematic framework of DNN weight pruning using the advanced optimization technique ADMM (Alternating Direction Methods of Multipliers) [19, 20]
. Through the adoption of ADMM, the original weight pruning problem is decomposed into two subproblems, one effectively solved using stochastic gradient descent as original DNN training, while the other solved optimally and analytically via Euclidean projection
[18]. This method achieves one of stateofart in weight pruning results, 21 weight reduction in AlexNet and 71.2in LeNet5 without accuracy loss. However, the direct application of ADMM technique lacks the guarantee on solution feasibility (satisfying all constraints) due to the nonconvex nature of objective function (loss function), while there is also margin of improvement for solution quality (in terms of pruning rate under the same accuracy).
In this work, we first make the following extensions on the oneshot ADMMbased weight pruning [18]: (i) develop an integrated framework of dynamic ADMM regularization and masked mapping and retraining steps, thereby guaranteeing solution feasibility and providing high solution quality; (ii) incorporate the multi updating technique for faster (and better) ADMM convergence; and (iii) generalize to a unified framework applicable both the weight pruning and weight quantization. These extensions already provide higher performance than [18].
Beyond the above extensions, we observe the opportunity of performing further weight pruning from the results of the oneshot ADMMbased weight pruning framework. This is due to the special property of based ADMM regularization process. Similar observation also applies to the weight quantization problem, and both suggest a progressive, multistep model compression framework using ADMM
. In the progressive framework, the pruning/quantization results from the previous step serve as intermediate results and starting point for the subsequent step. It has an additional benefit of reducing the search space for weight pruning/quantization within each step. Detailed procedure and hyperparameter determination process have been carefully designed towards ultrahigh weight pruning and quantization rates.
Extensive experimental results demonstrate that the proposed progressive framework consistently outperforms prior work. Some highlights: (i) we achieve 246, 36, and 8 weight pruning on LeNet5, AlexNet, and ResNet50 models, respectively, with (almost) zero accuracy loss; (ii) even a significant 61 weight pruning in AlexNet (ImageNet) results in only minor degradation in actual accuracy compared with prior work; (iii) we are among the first to derive notable weight pruning results for ResNet and MobileNet models; (iv) we derive the first lossless, fully binarized (for all layers) LeNet5 for MNIST and VGG16 model for CIFAR10; and (v) we derive the first fully binarized (for all layers) ResNet model for ImageNet with reasonable accuracy loss. Our models and sample codes are released in link https://bit.ly/2TYx7Za.
2 Related Work
Weight pruning. An early work of weight pruning is [3]. It uses a heuristic, iterative method to prune weights of small magnitudes and retrain the DNN. It achieves 9 reduction in the number of weights on AlexNet for ImageNet dataset without accuracy degradation. However, this work achieves relatively low compression rate (2.7 for AlexNet) on CONV layers, which are the key computational part in stateoftheart DNNs [21, 22]. Besides, indices are needed, at least one per weight, to index the relative location of the next weight. This method has been extended in two directions. The first is to improve reduction in the number of weights by using more sophisticated heuristics, e.g., incorporating both weight pruning and growing [7], using regularization [4]
, or genetic algorithm
[23]. The second is enhancing the actual implementation efficiency by deriving an effective tradeoff between accuracy and compression rate, e.g., the energyaware pruning [24], and incorporating regularity in weight pruning, e.g., the channel pruning [9] and structured sparsity learning [4] approaches.Weight quantization. This method leverages the inherent redundancy in the number of bits for weight representation. Many of the prior art work [10, 11, 12, 13, 14, 15, 16, 17] are directed at quantization of weights to binary values, ternary values, or powers of 2 to facilitate hardware implementations, with acceptable accuracy loss. The stateoftheart techniques [17, 10] adopt an iterative quantization and retraining framework, with some degree of randomness incorporated into the quantization step. This method results in less than 3% accuracy loss on AlexNet for binary weight quantization [10].
3 Overall Framework of Progressive DNN Model Compression
Figure 1 illustrates the proposed progressive DNN weight pruning and weight quantization framework. The oneshot ADMMbased weight pruning or quantization is performed multiple times, each as a step in the progressive framework. The pruning/quantization results from the previous step serve as intermediate results and starting point for the subsequent step. As discussed before, the reasons to develop a progressive model compression framework are twofold: (i) The fact that many weights are close to zero after ADMM regularization enables further weight pruning (such observation also applies to quantization); and (ii) the multistep procedure reduces the search space for weight pruning/quantization within each step.
Through extensive investigations, we conclude that a twostep progressive procedure
will be in general sufficient for weight pruning and quantization, in which each step requires approximately the same number of training epochs as original DNN training. Further increase in the number of steps or the number of epochs in each step will result in only marginal improvement in the overall solution quality (e.g., 0.1%0.2% accuracy improvement).
The detailed description of the proposed progressive framework will be presented in Section 4 and Section 5. Section 4 will present the proposed singlestep, ADMMbased weight pruning and quantization framework, as an extension of [18] to guarantee solution feasibility and a generalization to weight quantization as well. Section 5 presents the motivation, detailed procedure, and hyperparameter determination of the proposed progressive model compression framework, along with illustration of why “progressive” is the key to ultrahigh compression rates.
4 SingleStep, ADMMbased Weight Pruning and Quantization
4.1 Optimization Problem Formulation
Consider an layer DNN with both CONV and FC layers. The weights and biases of the th layer are respectively denoted by and , and the loss function associated with the DNN is denoted by ; see [18]. In this paper, and respectively characterize the collection of weights and biases from layer to layer . Then DNN weight pruning or weight quantization is formulated as the following optimization problem:
(1)  
subject to 
For weight pruning, the constraint set is , where ‘card’ refers to cardinality and ‘supp’ refers to the support set. Elements in are solutions, satisfying that the number of nonzero elements in is limited by for layer . These values are hyperparameters, with determination heuristic in Section 5. Besides the general, nonstructured weight pruning scenario, the constraint set can be extended to incorporate specific “structures” corresponding to structured pruning techniques such as filter pruning, channel pruning, column pruning, etc., with detailed discussions in [25]. The appropriate structured pruning will facilitate highparallelism implementations in hardware^{1}^{1}1The default weight pruning in this paper is the general, nonstructured pruning. However, the proposed framework is also applicable to structured weight pruning, with results in supplementary materials..
For weight quantization, elements in the constraint set are solutions, in which elements in assume one of values, where denotes the number of these fixed values. Here, the values are quantization levels of weights of layer in increasing order, and we focus on equaldistance quantization (the same distance between adjacent quantization levels) to facilitate hardware implementations. For the combination of weight pruning and quantization for DNNs, it is common practice to perform weight pruning first, and then quantization on the remaining, nonzero weights.
4.2 A Unified Solution Framework using ADMM
In problem (1) the constraint is combinatorial. As a result, this problem cannot be solved directly by stochastic gradient descent methods like original DNN training. However, the form of the combinatorial constraints on is compatible with ADMM which is recently shown to be an effective method to deal with such clusteringlike constraints [20, 26]
Despite such compatibility, there is still challenge in the direct application of ADMM due to the nonconvexity in objective function. To overcome this challenge, we extend over [18] and develop a systematic framework of dynamic ADMM regularization and masked mapping and retraining steps. We can guarantee solution feasibility (satisfying all constraints) and provide high solution quality through this integration. This framework is unified and applies to both weight pruning and weight quantization, and will act as one step in the progressive DNN weight pruning/quantization framework.
ADMM Regularization Step: Corresponding to every set , we define the indicator function Furthermore, we incorporate auxilliary variables , . The original problem (1) is then equivalent to
(2)  
subject to 
Through formation of the augmented Lagrangian [19], the ADMM regularization decomposes problem (2) into two subproblems, and solves them iteratively until convergence^{2}^{2}2The details of ADMM are presented in [19, 18]. We omit the details due to space limitation.. The first subproblem is
(3) 
where . The first term in the objective function of (3) is the differentiable loss function of the DNN, and the second term is a quadratic regularization term of the ’s, which is differentiable and convex. As a result (3) can be solved by stochastic gradient descent as original DNN training. Although we cannot guarantee the global optimality, it is due to the nonconvexity of the DNN loss function rather than the quadratic term enrolled by our method. Please note that this first subproblem maintains the same form for weight pruning and quantization problems.
On the other hand, the second subproblem is given by
(4) 
Note that is the indicator function of , thus this subproblem can be solved analytically and optimally [19]. For , the optimal solution is the Euclidean projection of onto . For weight pruning, we can prove that the Euclidean projection results in keeping elements in with the largest magnitudes and setting the remaining weights to zeros. For weight quantization, we can prove that the Euclidean projection results in mapping every element of to the quantization level closest to that element.
After both subproblems solved, we update the dual variables ’s according to the ADMM rule [19] and thereby complete one iteration in ADMM regularization.
Increasing in ADMM regularization: The values are the most critical hyperparameter in ADMM regularization. We start from smaller values, say , and gradually increase with ADMM iterations. This coincides with the theory of ADMM convergence [20, 26]
. It in general takes 8  12 ADMM iterations for convergence (more iterations to converge for weight pruning and fewer for weight quantization), corresponding to 100  150 epochs in PyTorch. This convergence rate is comparable with the original DNN training.
Masked mapping and retraining: After ADMM regularization, we obtain intermediate solutions. The subsequent step of masked mapping and retraining will guarantee the solution feasibility and improve solution quality. For weight pruning, the procedure is more straightforward. We first perform the said Euclidean projection (mapping) to guarantee that pruning constraints are satisfied. Next, we mask the zero weights and retrain the DNN with nonzero weights using training sets (while keeping the masked weights 0). In this way test accuracy (solution quality) can be (partially) restored, and solution feasibility (constraints) will be maintained.
For weight quantization, the procedure is more complicated. The reason is that the retraining process will affect the quantization results, thereby solution feasibility. To deal with this issue, we first perform Euclidean projection (mapping) of weights that are close enough (defined by a threshold value ) to nearby quantization levels. Then we perform retraining on the remaining, unquantized weights (with quantized weights fixed) for accuracy improvement. Finally we perform Euclidean mapping on the remaining weights as well. In this way the solution feasibility will be guaranteed.
4.3 Explanation of Effectiveness in the Deep Learning Context
The proposed solution framework is different from the conventional utilization of ADMM, i.e., to accelerate the convergence of an originally convex problem [19, 27]
. Rather, we integrate the ADMM framework with stochastic gradient descent. Aside from recent mathematical optimization results
[20, 26]illustrating the advantage of ADMM with combinatorial constraints, the advantage of the proposed solution framework can be explained in the deep learning context as described below.
The proposed solution (3) can be understood as a smart, dynamic regularization method, in which the regularization target will change judiciously and analytically in each iteration. On the other hand, conventional regularization methods (based on , norms or their combinations) use a fixed regularization target, and the penalty is applied on all the weights. This will inevitably cause accuracy degradation. More illustrations of the ADMMbased dynamic regularization vs. conventional, fixed regularization will be provided in Section 5.3.
5 Progressive DNN Model Compression Framework: Detailed Procedure
5.1 Motivation
During the implementation of the oneshot weight pruning framework described in Section 4, we observe that there are a number of unpruned weights with values very close to zero. The reason is the regularization nature in ADMM regularization step, which tends to generate very small, nonzero weight values even when they are not pruned. As the remaining number of nonzero weights is already significantly reduced during weight pruning, simply mapping these smallvalue weights to zero will result in accuracy degradation. On the other hand, this motivates us to perform weight pruning (and quantization) in a multistep, progressive manner. For weight pruning, the weights that have been pruned in the previous step will be masked and only the remaining, nonzero weights will be considered in the subsequent step. For weight quantization, we perform quantization on the weights in a subset of layers, fix these quantization results, and quantize the remaining layers in the subsequent step.
A second motivation of the progressive framework is to reduce the search space for weight pruning/quantization within each step. After all, weight pruning and quantization problems are essentially combinatorial optimizations. Although recently demonstrated to generate superior results on this type of problems
[20, 26], ADMMbased solution still has a superlinear increase of computational complexity as a function of solution space. As a result, the complexity becomes very high with ultrahigh compression rates (i.e., very large search space) beyond what can be achieved in prior work. The progressive framework, on the other hand, can mitigate this limitation and reduce the total training time (to 2 or slightly higher than training time of the original DNN).5.2 Detailed Procedure and Hyperparameter Determination
Through extensive investigations, we conclude that a twostep progressive procedure will be in general sufficient for weight pruning and quantization, in which each step requires approximately the same number of training epochs as original DNN training. We have conducted experiments on CIFAR10 and ImageNet benchmarks (AlexNet and ResNet18 models) on the relative accuracy of twostep procedure vs. threestep procedure, in which each step uses 120 epochs for training in PyTorch. The results show that threestep procedure only possesses marginal improvement in the overall solution quality, i.e., accuracy improvement no greater than 0.2%. This makes the additional training time not entirely worthwhile.
Hyperparameter Determination and Sensitivity Analysis: A very critical question is how to determine the hyperparameters, in a highly efficient and reliable manner. This problem is challenging for weight pruning, because we need to determine both the target overall pruning rate and specific pruning rate for each layer, both required in the ADMMbased solution. For quantization it becomes relatively straightforward, as the target number of quantization bits is typically prespecified (binary, ternary, 2bit, etc.) and the same number of quantization bits for all layers is in general preferred in hardware. The objective is to minimize accuracy loss. As a result, the twostep procedure of weight quantization can be performed as follows: the first step performs quantization on all the weights except for the first and last layers, while the second step performs quantization on these two layers. This is because quantization on these two layers has more significant impact on the overall accuracy.
Let us focus again on the hyperparameter determination heuristic for weight pruning problems. Experiments demonstrate that at least 2 to 3 improvement in overall pruning rate can be achieved compared with the prior work [3], under the same accuracy or without accuracy loss. Again at least 50% improvement in pruning rate can be achieved compared with the prior work of oneshot ADMMbased weight pruning [18]. As a result, a simple but effective hyperparameter determination method is as follows: We set the target overall pruning rate in the first ADMMbased weight pruning step to be around 1.5 compared with what can be achieved (without accuracy loss) in prior work [3], or to be slightly lower than the final result in [18]. The target overall pruning rate in the second step will be doubled compared with the first step, or even further increased if there is still margin of improvement. The perlayer pruning rate will be inherited from the results in prior work and increased proportionally. According to our experiments, the above heuristic will generate consistently higher pruning rates than prior work without accuracy loss.
We have further conducted two experiments to demonstrate the stability of hyperparameter (perlayer pruning rates) selection. Detailed experimental setup and results are provided in supplementary materials. The general conclusions are: (i) certain degree of variations in the perlayer pruning rates will have only minor impact on the overall accuracy under the ADMMbased framework; (ii) for very deep DNNs such as ResNet50, uniform pruning rates for all layers will result in a reasonably good overall pruning results. These results demonstrate the robustness of the hyperparameter determination process.
Although the above discussions are based on the general, nonstructured weight pruning, the above hyperparameter determination is also applicable to structured pruning.
5.3 Discussions and Illustration of Effectiveness through Weight Pruning
Using AlexNet model on ImageNet data set as an example, Figure 3 demonstrates the Top5 accuracy loss vs. overall pruning rates using various methods, including our proposed progressive framework, our enhanced oneshot ADMMbased pruning, iterative pruning and retraining reported in [3], and fixed regularizations and projected gradient descent (PGD). Figure 4 demonstrates the absolute Top5 accuracy. Please note that we use a baseline AlexNet model with 60.0% Top1 accuracy and 82.2% Top5 accuracy, both higher than prior work such as [3, 18] (57.2% Top1 and 80.2% Top5). This is to reflect the recent advances in DNN training in PyTorch. As a result, our definition of accuracy loss (or lossless) is compared with respect to the enhanced accuracy. In other words, we aim to surpass the prior methods in both absolute accuracy and relative accuracy loss values.
We can clearly observe the performance ranking of these techniques. The proposed progressive framework outperforms all other methods. The second is oneshot ADMMbased pruning. The third is iterative pruning and retraining heuristic. And the last is fixed regularizations and PGD. We know from Section 4.3 that fixed regularizations and PGD suffer from penalizing all weights even if they are not pruned, thereby resulting in notable accuracy degradation. Then how to explain the performance gap among the other techniques?
To answer this question, we use Figure 5 as an illustration. The weight pruning problem can be understood as a partitioning problem, in which weights will be partitioned into two parts, one part all mapped to zero, while the other part utilized to restore accuracy. The straightforward iterative pruning method performs partitioning based only on the absolute values of the weights, smaller ones mapped to zero. The ADMMbased weight pruning method, on the other hand, allows partitioning using effective mathematical optimization methods, thereby achieving higher pruning rates without accuracy loss. Then new challenge exists on the high complexity in deriving such partitioning when the pruning rates become ultrahigh, and this challenge can be effectively mitigated using the progressive method by reducing the search space within each step.
6 Experimental Results
In this section, we evaluate the proposed progressive DNN model compression framework comprehensively, based on ImageNet ILSVRC2012, CIFAR10, and MNIST data sets, using AlexNet [1], VGGNet [2], ResNet18/ResNet50 [22], MobileNet V2 [28], and LeNet5 DNN models, and comparing with various prior methods including singleshot ADMM. Our implementations are based on PyTorch, and the baseline accuracies are in many cases higher than those utilized in prior work, such as AlexNet and ResNet50 for ImageNet, VGGNet and MobileNet V2 for CIFAR10, etc. We conduct a fair comparison because we focus on the relative accuracy with our baseline instead of the absolute accuracy (which will of course outperform prior work).
Thanks to the compatibility of the proposed framework with DNN training, directly training a DNN model using the proposed framework has the same result as using a prior pretrained DNN model. When a pretrained DNN model is utilized, we limit the number of epochs in both steps in the progressive framework to be 120, similar to the original DNN training in PyTorch and much lower than the iterative pruning heuristic [3]. We use the hyperparameter determination procedure discussed in Section 5.3. The training and model compression are performed in PyTorch using NVIDIA 1080Ti, 2080, and Tesla P100 GPUs.
Due to space limitation, in this section we only present results on the general, nonstructured weight pruning and sample results on binary quantizations. More comprehensive results on structured weight pruning, combination of weight pruning and quantization, and convergence analysis are provided in the supplementary materials.
6.1 Experimental Results on Weight Pruning
6.1.1 Results on ImageNet Dataset
Method  Top5 accuracy  Relative accuracy loss  Overall prun. rate 

SVD [29]  %  %  5.1 
Iter. prun. [3]  %  %  9.1 
NeST [5]  %  15.7  
Dyn. surg. [7]  %  17.7  
Oneshot ADMM [18]  %  17.7  
Our oneshot  %  15  
Our oneshot  %  36  
Our method  %  36  
Our method  %  44  
Our method  %  61 
Method  Top5 accuracy  Relative accuracy loss  Conv. prun. rate 

Iter. prun. [3]  %  %  2.7 
Dyn. surg. [7]  %  3.1  
NeST [5]  %  3.2  
Finegrained [30]  %  4.2  
method [4]  %  5.0  
Our method  %  8.6  
Our method  %  11.2 
AlexNet Results: Table 1 compares the overall pruning rates of the whole AlexNet model (CONV and FC layers) vs. accuracy, between the proposed progressive framework and various prior methods. It can be clearly observed that the proposed framework outperforms prior methods, including the oneshot ADMM method [18]. With almost no Top5 accuracy loss (note of our high baseline accuracy), we achieve 36 overall pruning rate. We achieve a notable 61 weight reduction with 79.7% Top5 accuracy, just slightly below the baseline accuracy in prior work. We can clearly observe the advantage over oneshot ADMM method. With the same accuracy, the progressive framework achieves 61 weight reduction while our extended oneshot method achieves “only” 36. This 36 in oneshot method has been derived using the same number of total training epochs as the progressive framework.
Table 2 compares the pruning rates on the CONV layers vs. Top5 accuracy, since the CONV layers are the most computationally intensive in stateofart DNNs. We achieve 8.6 pruning in CONV layers with even slight accuracy enhancement, and 11.2 pruning with minor accuracy loss, consistently outperforming prior work in CONV layer weight pruning.
VGG16 Results: We conduct experiments on VGG16 for ImageNet data set, with results similar to AlexNet. We achieve 34 overall weight reduction without accuracy loss, which is higher than 13 using iterative pruning [3], 15 in [31] or 19.9 using our extended oneshot ADMM (no corresponding results reported in [18]). Detailed table is omitted due to space limitation.
Method  Top5 Acc. Loss  Pruning rate 

Uncompressed  0.0%  1 
Finegrained [30]  0.1%  2.6 
Our oneshot  0.0%  4.5 
Our method  0.0%  8 
Our method  0.7%  17.4 
ResNet18/ResNet50 Results: We conduct experiments on ResNet18 and ResNet50 models for ImageNet data set. As there is lack of effective pruning results before, we conduct uniform weight pruning (the same pruning rate for all CONV and FC layers) to show the effectiveness with less optimized individuallayer pruning rates. The results are shown in Table 3. We achieve 8 overall pruning rate (also 8 pruning rate on CONV layers) on ResNet50, without accuracy loss. We also achieve 6 overall pruning rate (also 6 pruning rate on CONV layers) on ResNet18, without accuracy loss. These results clearly outperform the prior work which has limited overall pruning rate, which also did not mention CONV layer rate. It also outperforms our oneshot ADMMbased method, which achieves 4.5 uniform weight pruning on all layers (CONV and FC) on ResNet50.
6.1.2 Results on CIFAR10 Dataset
VGG16 Results: We conduct experiments on VGG16 results using the CIFAR10 data set. The baseline accuracy is 93.7%, which is higher than those in prior work, e.g., 90.2% in [32] or 84.8% in [33]. We only present our results due to lack of prior work for fair comparison. We achieve 11.5 overall weight pruning without accuracy loss, or 40.3 with accuracy loss of 0.8%.
MobileNet V2 Results: We conduct experiments on MobileNet V2 results using the CIFAR10 data set. The baseline accuracy is as high as 95.07% due to the adoption of mixup technique. We present our results in Table 4 due to lack of prior work for fair comparison. We achieve 5 weight pruning with almost no accuracy loss, starting from the highaccuracy baseline. We achieve 10 weight pruning (which is highly challenging for MobileNet) with only 1.3% accuracy loss.
Method  Accuracy  Pruning rate 

Uncompressed  95.07%  1 
Our method  95.49%  3.3 
Our method  94.90%  5 
Our method  94.70%  6.7 
Our method  93.75%  10 
6.1.3 Results on MNIST Dataset
Table 5 demonstrates the comparison results on LeNet5 model using MNIST data set. Through the progressive framework, we achieve an unprecedented 246 overall weight reduction with almost no accuracy loss. It clearly outperforms oneshot ADMM (71.2 using prior oneshot ADMM [18] and 85 using our extended oneshot ADMM) and other prior methods. Please note that our extended oneshot ADMMbased method also slightly outperforms the prior counterpart [18].
6.2 Sample Results on Weight Quantization
Binary Weight Quantization Results on LeNet5: To the extent of authors’ knowledge, we achieve the first lossless, fully binarized LeNet5 model in which weights in all layers are binarized. The accuracy is still 99.21%, lossless compared with baseline. We do not list the comparison results due to limited space, but claim that our method already achieves the highest possible accuracy. We claim that becoming lossless is challenging even for MNIST. For example, recent work [33] results in 2.3% accuracy degradation on MNIST for full binarization, with baseline accuracy 98.66%.
Weight Quantization on CIFAR10: We also achieve the first lossless, fully binarized VGG16 for CIFAR10, in which weights in all layers (including the first and the last) are binarized. The accuracy is 93.53%. We would like to point out that fully ternarized quantization results in 93.66% accuracy. Table 6 shows our results and comparisons.
Method  Accuracy  Num. of bits 

Baseline of [33]  84.80%  32 
8bit [33]  84.07%  8 
Binary [33]  81.56%  1 
Our baseline  93.70%  32 
Our ternary  93.66%  2 (ternary) 
Our binary  93.53%  1 
Binary Weight Quantization Results on ResNet for ImageNet Dataset: The binarization of ResNet models on ImageNet data set is widely acknowledged as a very challenging task. As a result, there are very limited prior work (e.g., the oneshot ADMM [10]) with binarization results on ResNet models. As [10] targets ResNet18 (which is even more challenging than ResNet50 or larger ones), we make a fair comparison on the same model. Table 7 demonstrates the comparison results (Top5 accuracy loss). In prior work, it is by default that the first and last layers are not quantified (or quantized to 8 bits) as these layers have a significant effect on overall accuracy. When leaving the first and last layers unquantized, our framework is not progressive, but an extended oneshot ADMMbased framework. We can observe the higher accuracy compared with the prior method under this circumstance (first and last layers unquantized while the rest of layers binarized). The Top1 accuracy has similar result: 3.8% degradation in our extended oneshot and 4.3% in [10].
Method  Relative Top5 acc. loss  Num. of bits 

Uncompressed  0.0%  32 
Oneshot ADMM quantization [10]  2.9%  1 (32 for the first and last) 
Our method (oneshot)  2.5%  1 (32 for the first and last) 
Our method  5.8%  1 
Using the progressive framework, we can derive a fully binarized ResNet18, in which weights in all layers are binarized. The accuracy degradation is 5.8%, which is noticeable and shows that the full binarization of ResNet is a challenging task even under the progressive framework. We did not find prior work for comparison on this result.
7 Conclusion
In this work, we extended the prior oneshot ADMMbased framework and developed a multistep, progressive DNN weight pruning and quantization framework, in which we achieve further weight pruning/quantization and provide faster convergence rate. We achieve 246, 36, and 8 weight pruning on LeNet5, AlexNet, and ResNet50 models, respectively, with (almost) zero accuracy loss. We also derive the first lossless, fully binarized (for all layers) LeNet5 for MNIST and VGG16 for CIFAR10.
References

[1]
Krizhevsky, A., Sutskever, I., Hinton, G.E.:
Imagenet classification with deep convolutional neural networks.
In: Advances in neural information processing systems. (2012) 1097–1105  [2] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014)
 [3] Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in neural information processing systems. (2015) 1135–1143
 [4] Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems. (2016) 2074–2082
 [5] Dai, X., Yin, H., Jha, N.K.: Nest: A neural network synthesis tool based on a growandprune paradigm. arXiv preprint arXiv:1711.02017 (2017)
 [6] Yang, T.J., Chen, Y.H., Sze, V.: Designing energyefficient convolutional neural networks using energyaware pruning. arXiv preprint arXiv:1611.05128 (2016)
 [7] Guo, Y., Yao, A., Chen, Y.: Dynamic network surgery for efficient dnns. In: Advances In Neural Information Processing Systems. (2016) 1379–1387
 [8] Dong, X., Chen, S., Pan, S.: Learning to prune deep neural networks via layerwise optimal brain surgeon. In: Advances in Neural Information Processing Systems. (2017) 4860–4874

[9]
He, Y., Zhang, X., Sun, J.:
Channel pruning for accelerating very deep neural networks.
In: International Conference on Computer Vision (ICCV). Volume 2. (2017) 6
 [10] Leng, C., Li, H., Zhu, S., Jin, R.: Extremely low bit neural network: Squeeze the last bit out with admm. arXiv preprint arXiv:1707.09870 (2017)

[11]
Park, E., Ahn, J., Yoo, S.:
Weightedentropybased quantization for deep neural networks.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
 [12] Zhou, A., Yao, A., Guo, Y., Xu, L., Chen, Y.: Incremental network quantization: Towards lossless cnns with lowprecision weights. arXiv preprint arXiv:1702.03044 (2017)

[13]
Lin, D., Talathi, S., Annapureddy, S.:
Fixed point quantization of deep convolutional networks.
In: International Conference on Machine Learning. (2016) 2849–2858
 [14] Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J.: Quantized convolutional neural networks for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4820–4828
 [15] Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnornet: Imagenet classification using binary convolutional neural networks. In: European Conference on Computer Vision, Springer (2016) 525–542
 [16] Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., Bengio, Y.: Binarized neural networks. In: Advances in neural information processing systems. (2016) 4107–4115
 [17] Courbariaux, M., Bengio, Y., David, J.P.: Binaryconnect: Training deep neural networks with binary weights during propagations. In: Advances in neural information processing systems. (2015) 3123–3131
 [18] Zhang, T., Ye, S., Zhang, K., Tang, J., Wen, W., Fardad, M., Wang, Y.: A systematic dnn weight pruning framework using alternating direction method of multipliers. European Conference on Computer Vision (ECCV) (2018)
 [19] Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3(1) (2011) 1–122
 [20] Hong, M., Luo, Z.Q., Razaviyayn, M.: Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization 26(1) (2016) 337–364
 [21] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. In: International Conference on Learning Representations (ICLR). (2015)
 [22] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 770–778
 [23] Dai, X., Yin, H., Jha, N.K.: Nest: a neural network synthesis tool based on a growandprune paradigm. arXiv preprint arXiv:1711.02017 (2017)
 [24] Yang, T.J., Chen, Y.H., Sze, V.: Designing energyefficient convolutional neural networks using energyaware pruning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 6071–6079
 [25] Zhang, T., Zhang, K., Ye, S., Li, J., Tang, J., Wen, W., Lin, X., Fardad, M., Wang, Y.: Adamadmm: A unified, systematic framework of structured weight pruning for dnns. arXiv preprint arXiv:1807.11091 (2018)

[26]
Liu, S., Chen, J., Chen, P.Y., Hero, A.:
Zerothorder online alternating direction method of multipliers:
Convergence analysis and applications.
In: International Conference on Artificial Intelligence and Statistics. (2018) 288–297
 [27] Hong, M., Luo, Z.Q.: On the linear convergence of the alternating direction method of multipliers. Mathematical Programming 162(12) (2017) 165–199
 [28] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 4510–4520
 [29] Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in neural information processing systems. (2014) 1269–1277
 [30] Mao, H., Han, S., Pool, J., Li, W., Liu, X., Wang, Y., Dally, W.J.: Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922 (2017)
 [31] Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 7370–7379
 [32] Qin, Z., Yu, F., Liu, C., Chen, X.: Demystifying neural network filter pruning. arXiv preprint arXiv:1811.02639 (2018)
 [33] Cheng, H.P., Huang, Y., Guo, X., Huang, Y., Yan, F., Li, H., Chen, Y.: Differentiable finegrained quantization for deep neural network compression. arXiv preprint arXiv:1810.10351 (2018)
Supplementary Materials
Stability of Hyperparameter
We test the stability of our hyperparameter on VGG16 for CIFAR10 data set. Figure 6 demonstrates that our method is stable on parameter (the major hyperparameter in ADMM regularization). In our experiment, we change from 0.0005 to 0.005 with the same pruning rate, and the accuracy we achieve is close for different values.
Convergence Analysis
We test the convergence of our method on VGG16 for CIFAR10 data set. Figure 7 demonstrates that our method (ADMM regularization) achieves fast convergence rate, where the gap between and converges to zero in around 7 ADMM iterations.
Structured Weight Pruning Results
Models of the following structured weight pruning results are in the anonymous link https://bit.ly/2TYx7Za. These results significantly outperform prior arts (if any). We focus on column pruning except for MobileNet V2, which is more suitable for filter pruning.
Table 8 shows our column pruning result on VGG16 for CIFAR10 data set. We achieve 29 structured pruning rate with 0.34 accuracy loss.
Table 9 shows our filter pruning result on MobileNet V2 for CIFAR10 data set. We achieve 7.1 structured pruning rate (which is very difficult for MobileNet) with 0.2 accuracy loss.
Table 10 shows our column pruning result on LeNet5 for MNIST data set. We achieve 37.1 structured pruning rate with 0.18 accuracy loss.
Table 11 shows our column pruning result on ResNet18 for ImageNet data set. We achieve 3 structured pruning rate without any accuracy loss. The best in prior work results in at least 1% accuracy loss with 2 structured pruning rate.
Method  Prune rate / Top 1 accuracy loss 

baseline  1 / 0% 
Our method  9.3 / 0.06% 
Our method  29.0 / 0.34% 
Method  Prune rate / Top 1 accuracy loss 

baseline  1 / 0% 
Our method  7.1 / 0.20% 
Method  Prune rate / Top 1 accuracy loss 

baseline  1 / 0% 
Our method  17.7 / 0.05% 
Our method  37.1 / 0.18% 
Our method  105.5 / 0.87% 
Method  Prune rate / Top 1 accuracy loss 

baseline  1 / 0% 
Our method  3.0 / 0.0% 
Combination of (NonStructured) Weight Pruning and Quantization
Models of the following results are released in the anonymous link https://bit.ly/2TYx7Za. We did not find prior work on the combination of ResNet nonstructured weight pruning and quantization results.
Method  Prune rate / Quantization bits  Accuracy loss 

baseline  1 / 32  0% 
Our method  8 / 6  0.2% 
Method  Prune rate / Quantization bits  Accuracy loss 

baseline  1 / 32  0% 
Our method  5 / 5  0.0% 
Table 12 shows the combination of nonstructured pruning and quantization on ResNet50 for Imagenet data set, in which we achieve 8 pruning rate and quantize the weights in 6 bits with 0.2% (Top5) accuracy loss, with baseline Top5 accuracy 92.9%.
Table 13 shows the combination of nonstructured pruning and quantization on ResNet18 for Imagenet data set, in which we achieve 5 pruning rate and quantize the weights in 5 bits without accuracy loss (baseline Top5 accuracy 89.1%).
Comments
There are no comments yet.