1 Introduction
Deep neural networks (DNNs) with very large model sizes are the key enabler for the recent success of deep learning. However, large models incur excessive DRAM accesses which consume significant more energy than arithmetic or SRAM operations. Thus,
model compression of DNNs became an active and intensively studied research topic. These techniques, which are applied during the training phase of the DNNs, exploit the redundancy in weights. The aim is to simultaneously reduce the model size (thus, the storage requirement) and accelerate the computation for inference, — all to be achieved with minor classification accuracy loss. These techniques are of particular interests to the hardware acceleration of DNN inference engine [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70] since it is more challenging to achieve high processing throughput for the compressed models. Two important model compression techniques are weight pruning and weight quantization.Weight pruning leverages the redundancy in the number of weights. The pioneering work [71]
used heuristic and iterative weight pruning to achieve considerable weight parameter reduction with negligible accuracy loss. It has been extended in
[72, 73, 74, 75] with more sophisticated heuristics. On the downside, such nonstructured methods lead to irregular, sparse weight matrices (as shown in Figure 1 (a), arbitrary weight can be pruned), which rely on indices to be stored in a compressed format. As a result, they are less compatible with the data parallel execution model in GPUs and multicore CPUs. This drawback is confirmed by the throughput degradation reported in recent works [76, 77]. To overcome the limitation of nonstructured pruning, recent works [76, 78] proposed the idea of incorporating regularity or “structures” in weight pruning, such as filter pruning, channel pruning, and filter shape pruning, shown in Figure 1 (b). The structured approaches maintain a full matrix with reduced dimensions, and indices are no longer needed. As a result, it leads to much higher speedups in GPUs.Weight quantization is an orthogonal compression technique that leverages the redundancy in the number of bits of weight representation [79, 80, 81, 82, 83, 84, 85, 86]. Compared to weight pruning, weight quantization is inherently more hardwarefriendly, since both storage and computation of DNNs will be reduced proportionally to the weight precision without additional overhead due to indices. Moreover, multiplication operations may be eliminated with binary, ternary, or powerof2 weight quantizations [84, 85, 86]. Thanks to these advantages, weight quantization has been a “mustdo” step for DNN inference engines. Besides FPGA and ASIC, it is also well supported in GPU, CPU, and mobile devices, e.g., [87, 88].
Given the pros and cons of nonstructured/structured weight pruning and weigh quantization, they need to be investigated jointly to fully understand the interactions between them. In particular, since weight quantization is a mustdo step, especially for FPGA and ASIC, i.e., weight pruning will not be performed alone. The key open question is, with quantization, what kind of pruning (nonstructured vs. structured) is most beneficial? The answer to the question is far from obvious. Using LeNet5 (for MNIST data set) as an example, we achieve an unprecedented 348 (nonstructured) weight reduction with 3bit quantization, maintaining 99 accuracy. However, each index needs to be at least 9bit on account of 348 weight pruning. This makes index storage larger than that of weights (in addition, indices cannot be further quantized). In this example, nonstructured weight pruning results in larger actual storage than structured pruning. Thus, we can see the importance of answering such question: it will determine the design aspects that we should really focus on to avoid diminishing return of certain optimizations. As shown in Figure 2, we need the clear answers for all platforms.
Two recent works ADMMNN [89] and [79], that perform systematic joint weight pruning and quantization, are in the best position to perform this study. Using advanced variablesplitting optimization method ADMM (Alternating Direction Methods of Multipliers) [90, 91, 92], stateoftheart results are achieved (e.g., 21 weight reduction [93] in AlexNet), — outperforming heuristic counterparts. Unfortunately, the current framework is insufficient to perform such study. First, ADMMNN lacks the algorithmic mechanisms to enforce structured weight pruning, and guarantee the solution feasibility. Second, we lack the methodology to fairly and fundamentally compare nonstructured and structured pruning in an “appletoapple” manner. This paper is the first study to provide the answer to the open question with two key contributions.
The first contribution of the paper is the development of ADMMNNS by extending and enhancing of ADMMNN [89]. It is extended with the algorithmic supports for structured pruning. We achieve this by adjusting the constraints in each layer to express the structured requirements. For example, for filter pruning, the constraint for a layer can be specified as number of nonzero filters is less than or equal to a threshold. Moreover, we develop a systematic framework of dynamic ADMM regulation, masked mapping and retraining to guarantee solution feasibility (satisfying all constraints) and provide high solution quality (ensuring pruning and quantization rate under the same accuracy).
The second contribution is the methodology for the fair and fundamental comparison of nonstructured and structured weight pruning with quantization in place. We focus on two metrics with the same accuracy: 1) total storage (weight+indices), which is computed based on both absolute and relative indices; 2) computation efficiency, which is captured by a new metrics called pruningtoperformance ratio (PPR). After pruning, suppose weight reduction results in speedup, the PPR value is defined as . Intuitively, the less the value of PPR, the higher the computation efficiency, — same speedup can be achieved by smaller pruning rate. For structured pruning, PPR value is approximately 1 due to the absence of indices. For nonstructured pruning, recent accelerators based on nonstructured sparsity [94, 95, 96, 97] show that PPR values are larger than 2.7. We can fairly compare nonstructured and structured pruning by conservatively comparing PPR: nonstructured pruning is more beneficial if it can achieve 2.7 or higher pruning rate than structured pruning. No prior work has conducted such study and the answer to the above comparison is unknown.
The fairness of the proposed methodology is ensured due to three reasons: 1) it is performed by our the new ADMMNNS framework that significantly outperforms prior arts (in both nonstructured and structured pruning); 2) the comparison of storage and computation is hardware implementationagnostic; 3) the comparison is performed at the same rate of accuracy. We also strengthen weight quantization after nonstructured pruning by selectively leveraging stateofart ternary quantization solution [98].
Based on the proposed ideas, we perform extensive and representative testing of our comparison framework with AlexNet, VGGNet, ResNet18/50, MobileNet, and LeNet5 models based on ImageNet, CIFAR10, and MNIST data sets. Due to space limitation, we focus on the convolutional (CONV) layers, which are the most computationally intensive layers in DNNs and are becoming the major storage as well as in stateofart ResNet and MobileNet models. We do observe similar (and more significant) effect on fullyconnected (FC) layers and on RNNs. In the following, we highlight our results and findings.
First, ADMMNNS framework guarantees solution feasibility while providing high solution quality. Our results consistently and significantly outperform prior art. This is the key to ensure the credibility of our conclusion. Specifically, we 1) achieve unprecedented 348, 36, and 8 overall weight pruning on LeNet5, AlexNet, and ResNet50 models, respectively, with (almost) zero accuracy loss; 2) derive the first lossless, fully binarized (for all layers) LeNet5 for MNIST and VGG16 for CIFAR10; and 3) derive the first fully binarized (for all layers) ResNet for ImageNet with reasonable accuracy loss.
Second, comparing nonstructured and structured pruning, we find that the storage overhead of indices for nonstructured pruning is always more than its additional weight storage reduction, thus the amount of total storage for nonstructured pruning is actually larger. In term of computation efficiency, we find that the PPR for structured pruning in all models are less than 2.7. For the first time, our results show that, despite more flexibility and weight pruning rate, nonstructured pruning is not competitive in terms of both storage and computation efficiency with quantization and the same accuracy. In a few cases, the storage size of nonstructured pruning is comparable (or slightly better than) to that of structured pruning, however it is still not a desirable choice considering the additional complexity of hardware design to support nonstructured sparsity. Moreover, we explain in detail (Section 8 that the conclusion is unlikely to change for different hardware platforms (e.g., GPUs, multicore CPUs, FPGA, or ASIC), application scenarios, DNN types, and will still hold with potential pruning/quantization algorithm improvements. Based on this conclusion, we reach the conclusion that nonstructured weight pruning is considered harmful, and we recommend not to continue investigating DNN inference engines using nonstructured sparsity. We release codes and all the models of this work at anonymous link: http://bit.ly/2WMQSRi.
2 Model Compression Background
2.1 Weight Pruning
Nonstructured weight pruning. The early work by Han et al. [71] achieved 9 reduction in the number of parameters in AlexNet and 13 in VGG16. However, most reduction is achieved in FC layers, and the 2.7 reduction achieved in CONV layers will not lead to an overall acceleration in GPUs [76]. Extensions of iterative weight pruning, such as [74] (dynamic network surgery), [72] (NeST) and [99], use more delicate algorithms such as selective weight growing and pruning. But the weight pruning rates on CONV layers are still limited, e.g., 3.1 in [74], 3.23 in [72], and 4.16 in [99] for AlexNet with no accuracy degradation. This level of nonstructured weight pruning cannot guarantee sufficient speedups in GPUs. In fact, based on the enhanced ADMMNN framework, we can achieve 11.2 nonstructured weight pruning in CONV layers with almost no accuracy degradation. Ironically, it even results in 20% speed degradation on an NVIDIA 1080Ti GPU.
Structured weight pruning. To overcome the limitation in nonstructured, irregular weight pruning, SSL [76] proposes to learn structured sparsity at the levels of filters, channels, filter shapes, layer depth, etc. This work is among the firsts that reported the actually measured GPU accelerations. This is because CONV layers after structured pruning will transform to a full matrix multiplication with reduced matrix size. However, the weight pruning rate is limited in the prior work on structured pruning. The average weight pruning rate on CONV layers of AlexNet is only 1.4 without accuracy loss. More recently, [78] achieved 2 channel pruning with 1% accuracy degradation on VGGNet. More importantly, the structured weight pruning has never been evaluated with weight quantization.
2.2 Weight Quantization
Weight quantization. This method takes advantages of the inherent redundancy in the number of bits for weight representation. Many of the prior works [79, 80, 81, 82, 83, 84, 85, 86] focused on quantization of weights to binary values, ternary values, or powers of 2 to facilitate hardware implementation, with acceptable accuracy loss. The stateoftheart techniques [86, 79] adopt an iterative quantization and retraining framework, with some degree of randomness incorporated into the quantization step. This method results in less than 3% accuracy loss on AlexNet for binary weight quantization [79].
Compared to weight pruning, weight quantization is the major DNN model compression technique utilized in industry, due to its “hardwarefriendliness” and the proportional reduction of computation and storage. Thus, weight quantization has been a mustdo step in FPGA and ASIC designs of DNN inference engines. Also, it is well supported in GPUs and mobile devices, e.g., PyTorch
[88]in NVIDIA GPU and TensorFlow Lite
[87] for mobile devices.2.3 ADMM for Weight Pruning/Quantization
Recent work [89, 79] have incorporated ADMM for DNN weight pruning and weight quantization, respectively. ADMM is a powerful tool for optimization, by decomposing an original problem into two subproblems that can be solved separately and efficiently. For example, considering optimization problem In ADMM, this problem is decomposed into two subproblems on and (auxiliary variable), which will be solved iteratively until convergence. The first subproblem derives given : . The second subproblem derives given : . Both and are quadratic functions.
ADMM is conventionally utilized to accelerate the convergence of convex optimization problems and enable distributed optimization, in which the optimality and fast convergence rate has been proven [90, 92]. As a special property, ADMM can effectively deal with a subset of combinatorial constraints and yields optimal (or at least high quality) solutions [100, 101]. Luckily, the associated constraints in the DNN weight pruning and quantization belong to this subset of combinatorial constraints, making ADMM applicable to DNN mode compression. However, due to the nonconvex nature of the objective function for DNN training, there is still a lack of guarantee in the prior work [89, 79] on solution feasibility and solution quality. Moreover, [89] only supports nonstructured pruning.
3 NonStructured vs. Structured Weight Pruning
3.1 NonStructured Pruning: Indexing Overhead
Indices are used to represent weight matrices in the sparse format, thereby achieving storage reduction in nonstructured weight pruning. A representative sparse representation format is the compressed sparse row (CSR) format, which was also utilized in prior work [71, 6]. As shown in blackFigure 3 (a), it represents a matrix by three arrays, which respectively contains nonzero (weight) values, column indices and the extents of rows. This representation requires numbers, where is the number of nonzero values and is the number of rows.
We call the above representation as CSR with absolute indices. Instead of storing the absolute position, we can compute the index difference and store the indices with relative position. This representation requires numbers, where is the number of nonzero (weight) values. For further compression, one can restrict the number of bits (3 bits in this example) to represent the relative position and add a dummy zero weight when the relative position exceeds the largest value (8 for this example) that can be represented, which are both shown in blackFigure 3 (b). These cases are called CSR with relative indices.
Comparing the two options, CSR with relative indices is good for compression [71], while CSR with absolute indices leads to better hardware acceleration [94, 96, 97]
. In this work, we aim to allow the highest freedom for nonstructured pruning in storage and computation evaluations, — we allow CSR with relative indices in storage calculation and CSR with absolute indices for computation estimation for nonstructured pruning.
3.2 Structured Pruning: Three Types
Wen et al. [76] introduced three types of structured pruning: filter pruning, channel pruning, and filter shape pruning, as shown in blackFigure 1 (b). Filter pruning removes whole filter(s); channel pruning removes whole channels; and filter shape pruning removes the weights in the same locations of all filters in one specific layer. Moreover, as shown in blackFigure 4, filter pruning and channel pruning are correlated. Pruning a filter in layer is equivalent to pruning the corresponding channel in layer , which is generated by this specific filter. As a result, filter pruning (and channel pruning) has a roughly quadratic effect on the weight parameter reduction (and the amount of computations) of the DNNs.
The CONV operations in (one layer of) DNNs are commonly transformed to matrix multiplications by converting weight tensors and feature map tensors to matrices
[52], named general matrix multiplication or GEMM, as shown in blackFigure 5. From blackFigure 5 (b), filter pruning corresponds to reducing one row, and thus is also termed row pruning. Filter shape pruning corresponds to reducing one column, and thus is also termed column pruning. Channel pruning corresponds to reducing multiple consecutive columns. The three structured pruning techniques, along with their combinations, will reduce the dimensions in GEMM while maintaining a full matrix format. Thus, indices are not needed. It is why structured pruning techniques are in general more suitable for hardware accelerations.On one hand, the major advantage of filter/channel pruning has the superlinear effect on storage/computation reduction, i.e., filter pruning on all layers results in over reduction in number of weight parameters. On the other hand, column pruning has a higher degree of flexibility. These techniques can be largely combined in order to achieve the highest rates in reductions of computation and storage, and effective heuristic for the desirable combination is needed.
4 ADMMNNS Framework
In this section, we build ADMMNNS, a unified solution framework of both nonstructured and structured weight pruning, as well as weight quantization problems by extending ADMMNN, the stateoftheart ADMMbased framework [89]. The differences between ADMMNNS and ADMMNN are: 1) it supports structured pruning; 2) it can guarantee solution feasibility and provide high solution quality; and 3) we propose effective techniques for enhancing convergence.
4.1 Enforcing Structured Pruning
This section discusses the extension of ADMMNN with structured pruning constraints. Consider an layer DNN with both CONV and FC layers. The weights and biases of the th layer are respectively denoted by and
, and the loss function associated with the DNN is denoted by
; see [93]. In our discussion, and respectively characterize the collection of weights and biases from layer to layer . Then DNN weight pruning or weight quantization is formulated as the following optimization problem:(1)  
subject to 
Next we introduce constraint sets ’s corresponding to the nonstructured weight pruning, different types of structured pruning, as well as weight quantization. We use CONV layers as illustrative example since CONV layers are the most computationally intensive. The problem formulation can be well applied to FC layers [93].
The collection of weights in the th CONV layer is a fourdimensional tensor, i.e., , where , and are respectively the number of filters, the number of channels in a filter, the height of the filter, and the width of the filter, in layer . In the following, if denotes the weight tensor in a specific layer, let denote the th filter in , denote the th channel, and denote the collection of weights located at position in every filter of , as illustrated in blackFigure 1 (b).
Weight pruning: For nonstructured weight pruning, the constraint on the weights in th layer is number of nonzero elements in is less than or equal to For filter pruning (row pruning), the constraint in the th CONV layer becomes the number of nonzero filters in is less than or equal to . For channel pruning, the constraint becomes the number of nonzero channels in is less than or equal to Finally, for filtershape pruning (column pruning), the constraint in the th CONV layer is
the number of nonzero vectors in
is less than or equal to These , , , andvalues are hyperparameters determined in prior, and the determination procedure will be discussed in Section
4.4.Weight quantization: For weight quantization, elements in assume one of values, where denotes the number of these fixed values. Here, the values are quantization levels of weights of layer in increasing order, and we focus on equaldistance quantization (the same distance between adjacent quantization levels) to facilitate hardware implementation.
4.2 Enhancing Solution Feasibility and High Solution Quality
In problem (1
), the constraint is combinatorial. As a result, this problem cannot be solved directly by stochastic gradient descent methods like original DNN training. However, the form of the combinatorial constraints on
is compatible with ADMM which is recently shown to be an effective method to deal with such clusteringlike constraints [100, 101].Despite such compatibility, it is still challenging to directly apply ADMM due to the nonconvexity in objective function. To overcome this challenge, we propose dynamic ADMM regularization, masked mapping and retraining steps for both nonstructured and structured pruning. By integrating these techniques, ADMMNNS can guarantee solution feasibility (satisfying all constraints) and provide high solution quality (pruning/quantization rate under the same accuracy). The procedure of ADMMNNS is shown in Figure 6.
ADMM Regularization Step: The ADMM regularization decomposes the original problem (1) into two subproblems through^{1}^{1}1The details of ADMM are presented in [92, 93]. We omit the details due to space limitation. (i) defining indicator function
corresponding to every set ; (ii) incorporating auxiliary variables , ; and (iii) adopting augmented Lagrangian [92]. These decomposed subproblems will be iteratively solved until convergence. The first subproblem is
(2) 
where . The first term in the objective function of (2) is the differentiable loss function of the DNN, and the second term is a quadratic regularization term of the ’s, which is differentiable and convex. As a result (2) can be solved by stochastic gradient descent as original DNN training. Please note that this first subproblem maintains the same form and solution for (nonstructured and structured) weight pruning and quantization problems.
On the other hand, the second subproblem is given by
(3) 
Note that is the indicator function of , thus this subproblem can be solved analytically and optimally [92]. For , the optimal solution is the Euclidean projection of onto . For nonstructured weight pruning, we can prove that the Euclidean projection results in keeping elements in with the largest magnitudes and setting the remaining weights to zeros. For filter pruning, we first calculate for , where denotes the Frobenius norm. We then keep elements in corresponding to the largest values in and set the rest to zero. For channel pruning, we first calculate for . We then keep elements in corresponding to the largest values in and set the rest to zero. The optimal solution of the second subproblem for filter shape pruning is similar, and is omitted due to space limitation. For weight quantization, we can prove that the Euclidean projection results in mapping every element of to the quantization level closest to that element.
After both subproblems solved, we update the dual variables ’s according to the ADMM rule [92] and thereby complete one iteration in ADMM regularization. Overall the ADMM regularization step can be understood as a smart, dynamic regularization, in which the regularization target will change judiciously and analytically in each iteration. On the other hand, conventional regularization methods (based on , norms or their combinations) use a fixed regularization target, and the penalty is applied on all the weights. This will inevitably cause accuracy degradation. Sample comparison results are provided in Section 5.
Masked mapping and retraining: After ADMM regularization, we obtain intermediate solutions. The subsequent step of masked mapping and retraining will guarantee the solution feasibility and improve solution quality. For nonstructured and structured weight pruning, the procedure is more straightforward. We first perform the said Euclidean projection (mapping) to guarantee that pruning constraints are satisfied. Next, we mask the zero weights and retrain the DNN with nonzero weights using training sets, while keeping the masked weights 0. In this way test accuracy (solution quality) can be (partially) restored, and solution feasibility (constraints) will be maintained.
For weight quantization, the procedure is more complicated. The reason is that the retraining process will affect the quantization results, thereby solution feasibility. To deal with this issue, we first perform Euclidean projection (mapping) of weights that are close enough (defined by a threshold value ) to nearby quantization levels. Then we perform retraining on the remaining, unquantized weights (with quantized weights fixed) for accuracy improvement. Finally we perform Euclidean mapping on the remaining weights as well. In this way the solution feasibility will be guaranteed.
4.3 Techniques for Enhancing Convergence
In this section we discuss two techniques for enhancing convergence (rate and results): multi method in ADMM regularization, and progressive weight pruning. We abandon the extragradient descent method in [79] as we did not find the advantage in convergence speed, not to mention the additional hyperparameters introduced by this method.
Increasing in ADMM regularization: The values are the most critical hyperparameter in ADMM regularization. We start from smaller values, say , and gradually increase with ADMM iterations. This coincides with the theory of ADMM convergence [100, 101]
. It in general takes 8  12 ADMM iterations for convergence, corresponding to 100  150 epochs in PyTorch. This convergence rate is comparable with the original DNN training.
Progressive weight pruning: The ADMM regularization is regularization. As a result, there is a large portion of very small weights values after one round of ADMMbased (nonstructured or structured) weight pruning. This gives rise the opportunity to perform a second round of weight pruning. In practice, we perform two rounds of ADMMbased weight pruning consecutively, where the weight pruning results in the first round will be the starting point of the second round (weights that are already pruned to zero will not be recovered). This method has an additional benefit of reducing the search space in each step, thereby accelerating convergence.
4.4 Determining Hyperparameter
Hyperparameter determination mainly refers to the determination process of pruning rate (e.g., the value) and/or the number of quantization levels per layer of DNN. This is a more challenging task for pruning than quantization in general. For quantization, it is typically preferred for the same number of quantization levels for all (or most of) layers, like binarized or ternarized weights, which is preferred by hardware. For weight pruning, on the other hand, these pruning rate values are flexible and shall be judiciously determined.
As hyperparameter determination is not our primary focus, we use a heuristic method as follows. We observe that we can achieve at least 3 more weight pruning than prior, heuristic weight pruning methods without accuracy loss. Hence, we adopt the perlayer pruning rates reported in prior work, and increase proportionally. In the progressive pruning procedure, we set the target of the first round to be 1.5 pruning than prior work, and the second round to be doubled based on that. We will further increase the pruning rates if there is still margin for weight pruning without accuracy loss.
5 Nonstructured DNN Weight Pruning and Quantization Results
In this section, we demonstrate the effectiveness of ADMMNNS for nonstructure pruning and quantization, based on ImageNet ILSVRC2012, CIFAR10, and MNIST data sets, using AlexNet [102], VGGNet [103], ResNet18/ResNet50 [104], MobileNet V2 [105], and LeNet5 DNN models. Due to space limitation, we only show the results of the overall DNN model (which has the most prior work for comparison), and binarized quantization of DNNs. Our implementations are based on PyTorch, and the baseline accuracy results are in many cases higher than those utilized in prior work, which reflects the recent training advances. For example, in the AlexNet model we utilize a baseline with Top1 accuracy 60.0% and Top5 accuracy 82.2%, both higher than prior work (57.2% Top1 and 80.2% Top5). We conduct a fair comparison because we focus on relative accuracy with our baseline instead of the absolute accuracy (which has outperformed prior work).
Thanks to the compatibility of ADMMNNS with DNN training, directly training a DNN model using the framework achieves the same result as using a pretrained DNN model. When a pretrained DNN model is utilized, we limit the number of epochs in both steps in the progressive framework to be 120, similar to the original DNN training in PyTorch and is much lower than the iterative pruning heuristic [71].
5.1 NonStructured Weight Pruning Results
AlexNet Results for ImageNet Dataset: Table 1 compares the overall pruning rates of the whole AlexNet model (CONV and FC layers) vs. accuracy, between the proposed framework and various prior methods. We can clearly observe that the proposed framework outperforms prior methods, including the prior ADMM method [93]. With almost no accuracy loss even based on the high baseline accuracy, we achieve 36 overall pruning rate. We achieve a notable 61 weight reduction with 79.7% Top5 accuracy, just slightly below the baseline accuracy in prior work.
Method  Top5 accuracy  Relative accuracy loss  Overall prun. rate 
Iter. prun. [71]  %  %  9.1 
NeST [72]  %  15.7  
Dyn. surg. [74]  %  17.7  
ADMM [93]  %  17.7  
Our method  %  36  
Our method  %  44  
Our method  %  61 
Figure 7 illustrates the absolute top5 accuracy for different pruning methods, on AlexNet model for ImageNet dataset. These methods include our proposed solution, iterative pruning [71], fixed regularization techniques like and regularizations, and projected gradient descent (PGD). The results clearly show that the proposed method outperforms the others both in absolute accuracy and in relative accuracy loss.
ResNet50 Results for ImageNet Dataset: Due to the lack of existing effective pruning results, we conduct uniform weight pruning, — use the same pruning rate for all CONV and FC layers. The results are shown in Table 2. We achieve 8 overall pruning rate (also 8 pruning rate on CONV layers) on ResNet50 without accuracy loss. These results clearly outperform the prior work.
Method  Top5 Acc. Loss  Pruning rate 
Uncompressed  0.0%  1 
Finegrained [99]  0.1%  2.6 
Our method  0.0%  8 
Our method  0.7%  17.4 
MobileNet V2 Results for CIFAR10 Dataset: The baseline accuracy is as high as 95.07% due to the adoption of mixup technique. We present our results in Table 3 due to the lack of prior work for fair comparison. We achieve 5.7 weight pruning with almost no accuracy loss, starting from the highaccuracy baseline. We achieve 10 weight pruning (which is highly challenging for MobileNet) with only 1.3% accuracy loss.
Method  Accuracy  Pruning rate 
Uncompressed  95.07%  1 
Our method  94.95%  5.7 
Our method  94.70%  6.7 
Our method  93.75%  10 
5.2 Binary Weight Quantization Results
Due to space limitation, we mainly show the results on fully binarized DNN models (i.e., weights in all layers, including the first and the last, are binarized), which is a highly challenging task for prior work.
Weight Quantization Results on LeNet5 and CIFAR10: To the best of our knowledge, we achieve the first lossless, fully binarized LeNet5 model. The accuracy is still 99.21%, lossless compared with baseline. In prior works, achieving lossless is challenging even for MNIST. For example, recent work [106] results in 2.3% accuracy degradation on MNIST for full binarization, with baseline accuracy 98.66%. We also achieve the first lossless, fully binarized VGG16 for CIFAR10. The accuracy is 93.53%. We would like to point out that fully ternarized quantization results in 93.66% accuracy. Table 5 shows our results and comparisons.
Method  Accuracy  Num. of bits 
Baseline of [106]  84.80%  32 
Binary [106]  81.56%  1 
Our baseline  93.70%  32 
Our ternary  93.66%  2 (ternary) 
Our binary  93.53%  1 
Binary Weight Quantization Results on ResNet for ImageNet: The binarization of ResNet models on ImageNet data set is widely acknowledged as an extremely challenging task. As a result, there are very limited prior work (e.g., the prior ADMMbased method [79]) with binarization results on ResNet models. As [79] targets ResNet18, we make a fair comparison on the same model. Table 6 demonstrates the comparison results (Top5 accuracy loss). In prior work, by default the first and last layers are not quantized (to 8 bits) as these layers have a significant effect on overall accuracy. When leaving the first and last layers unquantized, we observe the higher accuracy compared with the prior method. The Top1 accuracy has similar result: 3.8% degradation in our method and 4.3% in [79].
Furthermore, we can derive a fully binarized ResNet18, in which weights in all layers are binarized. The accuracy degradation is 5.8%, which is noticeable and shows that the full binarization of ResNet is a challenging task even for the proposed framework. We did not find prior work to compare with this result.
Method  Relative Top5 acc. loss  Num. of bits 
Uncompressed  0.0%  32 
ADMM [79]  2.9%  1 (32 for the first and last) 
Our method  2.5%  1 (32 for the first and last) 
Our method  5.8%  1 
Summary The results presented in this section show that ADMMNNS can achieve comparable or better results compared to the stateoftheart results. In certain cases, ADMMNNS achieves unprecedented weight reduction. These results provide a strong baseline and credibility of our study.
6 Nonstructured vs. Structured: The Comparisons Methodology
A Motivation Example: The previous section has shown the superior results on joint weight pruning and quantization. Using LeNet5 (MNIST data set) as an example, we achieve an unprecedented 348 nonstructured weight reduction together with 3bit quantization, maintaining 99 accuracy. When indices are not accounted for, the overall compression rate is an unprecedented 3,712 compared with the original LeNet5 model without compression. However, each index needs to be at least 9bit considering 348 weight pruning. This makes index storage even larger than weights, and indices cannot be further quantized. As a result, nonstructured weight pruning in fact results in larger actual storage than structured pruning.
The fundamental phenomena shown here is that, with quantization the weight reduction by nonstructured pruning is offset by the extra index storage. It motivates us to study whether it is a common trend with weight quantization in place? If the answer is yes, then the value of nonstructured weight pruning will be further in doubt. This is because nonstructured pruning is already less preferred for GPU and multicore CPUs [76, 77], the only benefit is the potentially higher pruning rates due to greater pruning flexibility. If this benefit is also lost, there will be nearly no merit of nonstructured sparsity for hardware acceleration of DNNs, considering the impacts on computation efficiency and degraded parallelism. Importantly, such conclusion will also be true for FPGA and ASIC designs and guide us to the design aspects that we should really focus on.
In this section, we conduct the first(to the best of our knowledge) comprehensive study to understand the value of nonstructured and structured pruning, with quantization in place and the same accuracy. It is worth noting that without ADMMNNS framework, this study is not possible, — we need a framework that achieves competitive results and can jointly perform both weight pruning and quantization.
A Hardware ImplementationAgnostic Comparison Methodology: We conduct a fair comparison between nonstructured and structured weight pruning with quantization in place, based on the unified solution framework. Note that the comparison framework is more FPGA and ASIC oriented as flexible weight quantization is assumed. However, we would like to point out that a moderate, fixed weight quantization, e.g., 8 bit, supported in GPU [88], TPU [107], and mobile devices [87], will result in a similar conclusion. Please refer to blackSection 8.4 for more discussions.
The key characteristic of our comparison framework is that it is hardware implementationagnostic. Our intention is that the comparison results will be independent of specific hardware implementations, and as a result, the conclusion will unlikely to change for architectural advances in either type of pruning. Therefore, we directly compare the amounts of storage and estimated computation efficiency for nonstructured and structured weight pruning with quantization in place, which capture the fundamental tradeoffs. Intuitively, storage is measured as the total weight and index storage with quantization in place. Storage of intermediate results is not considered, and this favors nonstructured pruning, — structured, filter/channel pruning will likely benefit more in intermediate results storage reduction.
On the other hand, computation efficiency is estimated using the pruningtoperformance ratio (PPR) values derived from prior work on nonstructured sparsity accelerators [94, 95, 96, 97]. For structured pruning, weight reduction results in around speedup (slightly higher or lower depending on platform and problem), and the PPR value is approximately 1. For nonstructured pruning, weight reduction only results in speedup with . In the stateofart tapeouts [94], the PPR value , which is close to 3 with a low pruning rate and higher than 4 for a high pruning rate. In synthesis results [95, 96, 97], this PPR value ranges from 2.7 to 3.5. We use the smallest value 2.7 that favors nonstructured pruning the most. In other words, if nonstructured pruning achieves more than 2.7 pruning rate than structured one (or equivalently, structured pruning rate is less than 37% of nonstructured one) under the same accuracy and quantization level, the former is more preferred in terms of computation. Otherwise, the latter is more preferred.
Maintaining the Same Accuracy for Comparison: The proposed comparison is performed under the same accuracy for nonstructured and structured pruning with quantization in place. The precise accuracy control, which is challenging for prior work, is enabled by the unified solution framework. For most cases, we would like to have (almost) no accuracy degradation compared with the baseline DNN model without pruning or quantization. For nonstructured pruning, it is achieved in two steps: 1) perform weight pruning to the maximum extent such that there will be no accuracy loss; and 2) perform weight quantization (hopefully) not to cause accuracy loss. For structured pruning, we give priority to column pruning, and perform three steps: 1) perform column pruning to the maximum extent without accuracy loss; 2) perform filter pruning and reduce corresponding redundant channels; and 3) perform weight quantization (hopefully) without accuracy loss. blackFigure 8 illustrates the procedure for maintaining accuracy. Of course the proposed framework is also applicable if certain accuracy degradation is allowed. A larger margin of accuracy loss in general favors structured pruning, because higher pruning rates can be achieved for both pruning schemes, but nonstructured pruning requires more bits for (relative) indices.
There is more subtlety in the combination of nonstructured pruning and quantization. If a weight is nonzero after pruning but quantized to zero, this weight can be added to the pruned list to achieve a higher pruning rate. Please note that this phenomenon does not apply to structured pruning. To better exploit this phenomenon and achieve even higher storage/computation reduction for nonstructured pruning (plus quantization), we leverage the stateofart ternary quantization technique [98] with dedicated optimizations. We apply this technique for weight quantization after nonstructured pruning in cases when it outperforms our proposed method, thereby providing enough opportunity and optimizations to nonstructured weight pruning.
7 Comparison of NonStructured and Structured Weight Pruning
Due to space limitation, we focus on CONV layers, which are the most computationally intensive layers in DNNs and are becoming the major storage as well in stateofart ResNet and MobileNet models. We do observe similar (and more significant) effect on FC layers and on RNNs, with more discussions in Section 8.
As discussed in Section 5, our implementations are based on PyTorch with high baseline accuracies. We limit the number of epochs in both structured pruning and nonstructured pruning to be 240 (much lower than the iterative pruning heuristic [71]), and the number of epochs in weight quantization to be 120. We adopt the hyperparameter determination heuristic discussed in Section 4.4 for both structured and nonstructured pruning.
For nonstructured weight pruning, we show results on both CSR with relative indices and with absolute indices. The former is more appropriate for storage reduction, but the latter achieves higher computation efficiency. For absolute indices we assume 4K blocks that are reasonable for hardware [94]. Besides the comparison between two pruning schemes, our results also consistently outperform prior work, in terms of both nonstructured and structured pruning, as well as combination with weight quantization.
7.1 Comparison Results on ImageNet Dataset
Table 7 and Table 8 demonstrate the comparison results using AlexNet and ResNet18 models on ImageNet dataset. In these tables, “CONV Prune Rate" refers to the reduction ratio in the number of weights in overall CONV layers, and the number of remaining weights is “CONV No. of Weights". "CONV Quant Bits" refers to the number of bits used for equaldistance weight quantization, while “CONV Weight Store" is the storage required only for weights (not account for indices). “Index Bits" refers to the number of bits in CSR with relative indices. In our results, we already optimized this index bit value to minimize the overall storage (accounting for the additional dummy zeros as well). The next two columns refer to the total storage size accounting for relative indices and absolute indices, respectively. For structured pruning, they are the same as weight storage. The final column “CONV Compress Rate" refers to the storage compression rate compared with the original baseline DNN model without compression, assuming relative indices that are more favorable to nonstructured pruning. We use “N/A" if the specific prior work only focuses on weight pruning without performing quantization.
It can be observed that we achieve significant pruning rate gains for both nonstructured and structured pruning. Especially for structured pruning, we achieve 5.1 and 2.5 structured weight pruning in CONV layers of AlexNet and ResNet18 models, respectively, without accuracy loss. We further achieve 4.3 structured pruning with minor accuracy loss around 1%. For ResNet on ImageNet dataset, it is difficult for prior work to achieve lossless structured pruning. For example, [78] results in 1% accuracy loss with 2 structured pruning, on ResNet50 model with more redundancy.
When comparing nonstructured vs. structured pruning, the overall CONV compression rate is comparable for the AlexNet case and the 1% accuracy loss case for ResNet18. For the lossless case in ResNet18, nonstructured pruning is slightly better in storage, especially when relative indices are utilized. This is because the number of bits for indexing is relatively small in this case, and the slight benefit will diminish if certain accuracy loss is tolerable. The occasional gain cannot outweigh the difficulty in hardware support of nonstructured sparsity. It would be difficult to choose nonstructured pruning over the other one even if the storage results are comparable.
Method 










Baseline AlexNet  82.2%  1.0  2.3M  32  9.3MB    9.3MB  9.3MB  1.0  

Han [108]  80.3%  2.7  0.86M  8  0.86MB  4  1.3MB  N/A  7.1  
Dyn. surg. [74]  80.0%  3.1  0.74M  N/A  N/A  N/A  N/A  N/A  N/A  
Nest [72]  80.3%  3.23  0.71M  N/A  N/A  N/A  N/A  N/A  N/A  
Finegrained [99]  80.3%  4.16  0.55M  N/A  N/A  N/A  N/A  N/A  N/A  
our’s  81.9%  11.2  0.3M  7  0.26MB  6  0.51MB  0.61MB  25.5  

SSL [76]  80.4%  1.4  1.6M  N/A  N/A    N/A  N/A  N/A  
our’s  81.8%  5.1  0.65M  7  0.56MB    0.56MB  0.56MB  23.3 
Method 










Baseline ResNet18  89.1%  1.0  11.2M  32  44.7MB    44.7MB  44.7MB  1.0  
NonStructured  our’s  89.1%  6.4  1.75M  6  1.32MB  5  2.47MB  3.11MB  18.1  
NonStructured  our’s  87.9%  8.9  1.26M  6  0.94MB  5  1.89MB  2.29MB  23.6  
Structured  our’s  89.1%  2.5  4.46M  6  3.34MB    3.34MB  3.34MB  13.4  
Structured  our’s  87.8%  4.3  2.60M  6  1.95MB    1.95MB  1.95MB  22.9 
7.2 Comparison Results on CIFAR10 Dataset
Table 9 and Table 10 demonstrate the comparison results using VGG16 and ResNet18 models on CIFAR10 dataset. We observe that very significant pruning rates can be achieved compared with prior work (over 30
improvement in certain case). We investigated deeper and found that the underlying reason is the CIFAR10 dataset itself, in that it is both “simple” and “difficult”. “Simple” means that the input image scale is small and the number of classes is only 10; while “difficult” means that input images are blurred and feature extraction is not straightforward. As a result, researchers tend to migrate largescale DNN models originally designed for ImageNet, such as VGG16 and ResNet18 (prior work even used ResNet50). Consequently, there is significant margin of model compression, which can be exploited in the proposed systematic framework but difficult for heuristic methods.
Another observation is that nonstructured pruning has only marginal gain in pruning rates (reduction in the number of weights) compared with structured one. Our hypothesis is that it is due to the high search space in nonstructured pruning. Together with the large number of index bits due to high pruning rates, nonstructured pruning is not preferable compared with structured one considering total storage size. The storage size gap is becoming surprisingly large when absolute indices are utilized.
Table 11 demonstrates the comparison results using MobileNet V2 model on CIFAR10 dataset. MobileNet is already compact and relatively difficult for further weight pruning, but we still achieve 5 structured pruning along with 4bit quantization. Again nonstructured pruning only shows minor gain in weight reduction, and it is not preferable considering the unavoidable indexing overheads.
Method 










Baseline VGG16  93.7%  1.0  14.7M  32  58.8MB    58.8MB  58.8MB  1.0  
NonStructured  our’s  93.1%  57.4  0.26M  5  0.16MB  7  0.54MB  0.72MB  109  

2PFPCE [109]  92.8%  3.7M  N/A  N/A    N/A  N/A  N/A  
2PFPCE [109]  91.0%  8.3  1.8M  N/A  N/A    N/A  N/A  N/A  
our’s  93.1%  50.0  0.29M  5  0.18MB    0.18MB  0.18MB  327 
Method 










Baseline ResNet18  93.9%  1.0  11.2M  32  44.6MB    44.6MB  44.6MB  1.0  
NonStructured  our’s  93.3%  69.0  0.16M  5  0.10MB  8  0.33MB  0.53MB  135  

AMC [110]  93.5%  1.7  N/A  N/A  N/A    N/A  N/A  N/A  
our’s  93.3%  59.8  0.19M  5  0.12MB    0.12MB  0.12MB  372 
Method 










Baseline MobileNetV2  95.1%  1.0  2.2M  32  9.0MB    9.0MB  9.0MB  1.0  
NonStructured  our’s  94.9%  6.1  0.37M  4  0.19MB  4  0.48MB  0.55MB  18.8  
Structured  our’s  95.1%  4.9  0.45M  4  0.23MB    0.23MB  0.23MB  39.2 
7.3 Comparison Results on MNIST Dataset
Table 12 demonstrates the comparison results using LeNet5 model on MNIST data set. It is a simple dataset, and we achieve 87.9 structured pruning on CONV layers, together with 3bit quantization. Nonstructured pruning is again not preferred due to the high index bit and marginal increase in weight reduction rate. Ironically, it results in multiple times the amount of storage compared with structured pruning, when weight quantization is in place.
Method 










Baseline LeNet5  99.2%  1.0  25.5K  32  102KB    102KB  102KB  1.0  

Han [108]  99.2%  7.7  3.33K  8  3.33KB  5  7.0KB  N/A  14.5  
our’s  99.0%  114.3  223  3  0.08KB  8  0.39KB  0.93KB  262  

SSL [76]  99.0%  26.1  975  N/A  N/A    N/A  N/A  N/A  
our’s  99.0%  87.9  290  3  0.11KB    0.11KB  0.11KB  944 
7.4 Comparison on Computation Efficiency
We have shown that nonstructured pruning is not preferable in terms of storage even assuming the storagefriendly CSR format with relative indices, not to mention absolute indices. Based on our methodology, we find that computation efficiency shows the similar trend.
As discussed before, structured pruning will have higher computation efficiency if it achieves more than 37% in the pruning rate as nonstructured pruning. In all our testing, the ratio between weight pruning rates of structured vs. nonstructured pruning ranges from 40% to 87%, with a large variation but consistently higher than 37%. Even for the 40% case, the choice is clear considering the difficulty in hardware design for nonstructured sparsity.
8 Discussions
In this section, we discuss additional factors and variations in different platforms, and explain why our conclusion is unlikely to change. As a result, we draw the final conclusion that nonstructured weight pruning is in general not preferred compared with structured pruning across different platforms, application scenarios, DNN types, etc.
8.1 Algorithm Improvement and Generalization Enhancement
We consider the following question: will our conclusion change if there is further algorithm improvement (that outperforms the ADMMbased unified solution in this paper)? Also, how about using a number of other recently proposed generalization enhancement techniques, such as warmup, mixup, cosine decay in bag of tricks [111]? Mixup is already utilized in MobileNet V2 training in this work and can notably enhance convergence and stability in training (the original MobileNet training is very difficult). We hypothesize that the conclusion is likely to maintain unchanged, as these techniques are likely to enhance the results for both nonstructured and structured weight pruning schemes. As the pruning rates increase, the number of bits for index representation will also increase. The results will likely even favor structured pruning to a greater extent.
8.2 Transfer Learning and Adversarial Robustness
In many critical applications of deep learning, such as autonomous driving and medical imaging, there is lack of sufficient labelled training data as standard image classification tasks. As a result, the transfer learning technique [112, 113, 114] is widely applied via (i) pretraining a DNN model using standard data set (say ImageNet); (ii) transferring to the target application domain; and (iii) performing fine tuning using target domain data. It is recently shown [115]
that sufficient number of weight parameters is needed in order to maintain the generality, i.e., the ability in domain transfer. This coincides with practice that VGGNet and deep ResNets are the major types for transfer learning instead of MobileNet. From the DNN security aspects, recent work
[116] shows that sufficient number of parameters is required to maintain the robustness of DNN against adversarial attacks.We hypothesize that structured pruning may be preferred in this way because of the larger number of remaining weight parameters (compared with nonstructured), which will lead to higher probability to satisfy the generality and adversarial robustness requirements. We believe that it will be a challenge to quantify such requirements, and derive the best combination of structured pruning and quantization for performance optimization while satisfying such requirements.
8.3 FC Layers and RNNs
The comparison results conducted in this paper focus on CONV layers, which is the major computation part in DNNs. On the other hand, the FC layers are not negligible in DNNs. Besides, FC layers constitute major computations in recurrent neural networks
(RNNs), which is as important as convolutional neural networks
[107]. Our preliminary investigation shows that the gain of structured pruning in FC layers and in RNNs is even higher. This is an intuitive result because FC layers have higher degree of redundancy, and more number of bits for indices if nonstructured pruning is utilized. It is also worth mentioning that a number of structured matrixbased techniques, such as blockcirculant matrices [117] and cyclic matrices [118], serve as good candidates of structured pruning in FC layers. Superior results are already demonstrated in FC layers using these methods.8.4 Effects of Weight Quantization
In the current industry’s practice, weight quantization is the major method in DNN model compression and is typically prioritized over weight pruning. As a result, it is unlikely that weight pruning is conducted alone (especially for FPGA/ASIC systems) without quantization. However, for such systems, it is possible that a fixed quantization level (or a set of levels) is utilized to accommodate different DNN models and applications, e.g., TPU supports 8 bit and 16 bit computation. Such moderate, fixed weight quantization (e.g., 8 bits) will unlikely change the general conclusion in this paper, especially accounting for the difficulty in developing dedicated hardware supporting nonstructured sparsity. For GPUs, multicore CPUs, and even mobile devices, 8bit/16bit weight quantization is already well supported. Structured pruning is known to be more suitable for such systems.
To the other extreme case, researchers are investigating weight quantizationonly solution, including binary and ternary quantizations. As pointed out in Section 5, binary/ternary quantization can be almost lossless in many cases. However, we observe that there is still a large margin of structured pruning as shown in the compression results on CIFAR10, and such compression rate cannot be achieved by weight quantization alone. As a result, we recommend to perform structured pruning in combination with weight quantization,
9 Conclusion
Nonstructured and structured weight pruning and weight quantization are major methods for model compression, but the interaction among different techniques are never clearly understood. This paper is the first to investigate the value of nonstructured and structured DNN weight pruning, when the weight quantization is in place. We build ADMMNNS, a joint weight pruning and quantization framework with algorithmic supports for structured pruning, dynamic ADMM regulation, and masked mappling and retraining. To perform fair and fundamental comparison between nonstructured and structured pruning in a hardware implementationagnostic manner, we propose a methodology that captures storage overhead and computation efficiency. We perform extensive and representative testing of ADMMNNS with AlexNet, VGGNet, ResNet18/50, MobileNet, and LeNet5 models based on ImageNet, CIAR10, and MNIST data sets. We show that ADMMNNS can significant outperform the stateoftheart results for nonstructured pruning with quantization. More importantly, for the first time we show that with quantization in place and the same accuracy, nonstructured pruning is not preferable in terms of both storage overhead and computation efficiency. We also explain in detail that the conclusion is unlikely to change for different hardware platforms, application scenarios, DNN types, etc. Thus, we recommend the community not to continue investigating DNN inference engines based on nonstructured sparsity. We release codes and all the models of this work at anonymous link: http://bit.ly/2WMQSRi.
References
 [1] Youjie Li, Jongse Park, Mohammad Alian, Yifan Yuan, Zheng Qu, Peitian Pan, Ren Wang, Alexander Schwing, Hadi Esmaeilzadeh, and Nam Sung Kim. A networkcentric hardware/algorithm codesign to accelerate distributed training of deep neural networks. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 175–188. IEEE, 2018.
 [2] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. From highlevel deep neural models to fpgas. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1–13. IEEE Computer Society, 2016.
 [3] Haiyu Mao, Mingcong Song, Tao Li, Yuting Dai, and Jiwu Shu. Lergan: A zerofree, low data movement and pimbased gan architecture. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 669–681. IEEE, 2018.
 [4] Kartik Hegde, Rohit Agrawal, Yulun Yao, and Christopher W Fletcher. Morph: Flexible acceleration for 3d cnnbased video understanding. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 933–946. IEEE, 2018.
 [5] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. Prime: A novel processinginmemory architecture for neural network computation in rerambased main memory. In ACM SIGARCH Computer Architecture News, volume 44, pages 27–39. IEEE Press, 2016.
 [6] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 243–254. IEEE, 2016.

[7]
Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright
Jerger, and Andreas Moshovos.
Cnvlutin: Ineffectualneuronfree deep neural network computing.
ACM SIGARCH Computer Architecture News, 44(3):1–13, 2016.  [8] Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, and Shaojun Wei. Rana: towards efficient neural acceleration with refreshoptimized embedded dram. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pages 340–352. IEEE Press, 2018.
 [9] Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. Neural cache: bitserial incache acceleration of deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pages 383–396. IEEE Press, 2018.

[10]
Mark Buckler, Philip Bedoukian, Suren Jayasuriya, and Adrian Sampson.
Eva
: Exploiting temporal redundancy in live computer vision.
In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 533–546. IEEE, 2018.  [11] Amir Yazdanbakhsh, Kambiz Samadi, Nam Sung Kim, and Hadi Esmaeilzadeh. Ganax: A unified mimdsimd acceleration for generative adversarial networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pages 650–661. IEEE Press, 2018.
 [12] Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W Fletcher. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pages 674–687. IEEE Press, 2018.
 [13] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. Bit fusion: Bitlevel dynamically composable architecture for accelerating deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture, pages 764–775. IEEE Press, 2018.
 [14] Chao Zhang, Tong Meng, and Guangyu Sun. Pm3: Power modeling and power management for processinginmemory. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 558–570. IEEE, 2018.
 [15] Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. Hypar: Towards hybrid parallelism for deep learning accelerator array. arXiv preprint arXiv:1901.02067, 2019.
 [16] Xiaowei Wang, Jiecao Yu, Charles Augustine, Ravi Iyer, and Reetuparna Das. Bit prudent incache acceleration of deep convolutional neural networks. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 81–93. IEEE, 2019.

[17]
Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier
Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen.
Pudiannao: A polyvalent machine learning accelerator.
In ACM SIGARCH Computer Architecture News, volume 43, pages 369–381. ACM, 2015.  [18] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. ACM SIGOPS Operating Systems Review, 51(2):751–764, 2017.
 [19] Ao Ren, Zhe Li, Caiwen Ding, Qinru Qiu, Yanzhi Wang, Ji Li, Xuehai Qian, and Bo Yuan. Scdcnn: Highlyscalable deep convolutional neural network using stochastic computing. ACM SIGOPS Operating Systems Review, 51(2):405–418, 2017.
 [20] Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. In Proceedings of the TwentyThird International Conference on Architectural Support for Programming Languages and Operating Systems, pages 461–475. ACM, 2018.
 [21] Ruizhe Cai, Ao Ren, Ning Liu, Caiwen Ding, Luhao Wang, Xuehai Qian, Massoud Pedram, and Yanzhi Wang. Vibnn: Hardware acceleration of bayesian neural networks. In Proceedings of the TwentyThird International Conference on Architectural Support for Programming Languages and Operating Systems, pages 476–488. ACM, 2018.
 [22] Yu Ji, Youhui Zhang, Wenguang Chen, and Yuan Xie. Bridge the gap between neural networks and neuromorphic hardware with a neural network compiler. In Proceedings of the TwentyThird International Conference on Architectural Support for Programming Languages and Operating Systems, pages 448–460. ACM, 2018.
 [23] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpgabased accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 161–170. ACM, 2015.
 [24] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jaesun Seo, and Yu Cao. Throughputoptimized openclbased fpga accelerator for largescale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 16–25. ACM, 2016.
 [25] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 26–35. ACM, 2016.
 [26] Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, JengHau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. Accelerating binarized convolutional neural networks with softwareprogrammable fpgas. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 15–24. ACM, 2017.
 [27] Jialiang Zhang and Jing Li. Improving the performance of openclbased fpga accelerator for convolutional neural network. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 25–34. ACM, 2017.
 [28] Chi Zhang and Viktor Prasanna. Frequency domain acceleration of convolutional neural networks on cpufpga shared memory system. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 35–44. ACM, 2017.
 [29] Yufei Ma, Yu Cao, Sarma Vrudhula, and Jaesun Seo. Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 45–54. ACM, 2017.
 [30] Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C Ling, and Gordon R Chiu. An opencl™ deep learning accelerator on arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 55–64. ACM, 2017.
 [31] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 65–74. ACM, 2017.
 [32] Chang Gao, Daniel Neil, Enea Ceolini, ShihChii Liu, and Tobi Delbruck. Deltarnn: A powerefficient recurrent neural network accelerator. In Proceedings of the 2018 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 21–30. ACM, 2018.
 [33] Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, and Chunyuan Zhang. Towards a uniform templatebased architecture for accelerating 2d and 3d cnns on fpga. In Proceedings of the 2018 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 97–106. ACM, 2018.
 [34] Hanqing Zeng, Ren Chen, Chi Zhang, and Viktor Prasanna. A framework for generating high throughput cnn implementations on fpgas. In Proceedings of the 2018 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 117–126. ACM, 2018.
 [35] Eriko Nurvitadhi, Jeffrey Cook, Asit Mishra, Debbie Marr, Kevin Nealis, Philip Colangelo, Andrew Ling, Davor Capalija, Utku Aydonat, Aravind Dasu, et al. Inpackage domainspecific asics for intel® stratix® 10 fpgas: A case study of accelerating deep learning using tensortile asic. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL), pages 106–1064. IEEE, 2018.
 [36] Zhe Chen, Andrew Howe, Hugh T Blair, and Jason Cong. Fpgabased lstm acceleration for realtime eeg signal processing. In Proceedings of the 2018 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 288–288. ACM, 2018.
 [37] Yankang Du, Qinrang Liu, Shuai Wei, and Chen Gao. Softwaredefined fpgabased accelerator for deep convolutional neural networks. In Proceedings of the 2018 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 291–291. ACM, 2018.
 [38] Shuanglong Liu, Xinyu Niu, and Wayne Luk. A lowpower deconvolutional accelerator for convolutional neural network based segmentation on fpga. In Proceedings of the 2018 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 293–293. ACM, 2018.
 [39] Yifan Yang, Qijing Huang, Bichen Wu, Tianjun Zhang, Liang Ma, Giulio Gambardella, Michaela Blott, Luciano Lavagno, Kees Vissers, John Wawrzynek, et al. Synetgy: Algorithmhardware codesign for convnet accelerators on embedded fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 23–32. ACM, 2019.
 [40] Junzhong Shen, You Huang, Mei Wen, and Chunyuan Zhang. Accelerating 3d cnnbased lung nodule segmentation on a multifpga system.
 [41] Lu Jing, Jun Liu, and FuHai Yu. A deep learning inference accelerator based on model compression on fpga. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 118–118. ACM, 2019.
 [42] Weijie You and Chang Wu. A reconfigurable accelerator for sparse convolutional neural networks. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 119–119. ACM, 2019.
 [43] Xuechao Wei, Yun Liang, Peng Zhang, Cody Hao Yu, and Jason Cong. Overcoming data transfer bottlenecks in dnn accelerators via layerconscious memory managment. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 120–120. ACM, 2019.
 [44] Jialiang Zhang and Jing Li. Unleashing the power of soft logic for convolutional neural network acceleration via product quantization. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 120–120. ACM, 2019.
 [45] Shulin Zeng, Yujun Lin, Shuang Liang, Junlong Kang, Dongliang Xie, Yi Shan, Song Han, Yu Wang, and Huazhong Yang. A finegrained sparse accelerator for multiprecision dnn. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 185–185. ACM, 2019.
 [46] Hiroki Nakahara, Akira Jinguji, Masayuki Shimoda, and Shimpei Sato. An fpgabased fine tuning accelerator for a sparse cnn. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 186–186. ACM, 2019.
 [47] Liqiang Lu, Yun Liang, Ruirui Huang, Wei Lin, Xiaoyuan Cui, and Jiansong Zhang. Speedy: An accelerator for sparse convolutional neural networks on fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 187–187. ACM, 2019.
 [48] Zhucheng Tang, Guojie Luo, and Ming Jiang. Ftconv: Fpga acceleration for transposed convolution layers in deep neural networks. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 189–189. ACM, 2019.
 [49] Kaiyuan Guo, Shuang Liang, Jincheng Yu, Xuefei Ning, Wenshuo Li, Yu Wang, and Huazhong Yang. Compressed cnn training with fpgabased accelerator. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 189–189. ACM, 2019.
 [50] Ephrem Wu, Xiaoqian Zhang, David Berman, Inkeun Cho, and John Thendean. Computeefficient neuralnetwork acceleration. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 191–200. ACM, 2019.
 [51] Sebastian Vogel, Jannik Springer, Andre Guntoro, and Gerd Ascheid. Efficient acceleration of cnns for semantic segmentation on fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 309–309. ACM, 2019.
 [52] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
 [53] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. Diannao: A smallfootprint highthroughput accelerator for ubiquitous machinelearning. ACM Sigplan Notices, 49:269–284, 2014.
 [54] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and Andreas Moshovos. Stripes: Bitserial deep neural network computing. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1–12. IEEE Computer Society, 2016.
 [55] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: A machinelearning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 609–622. IEEE Computer Society, 2014.
 [56] Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, et al. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 13–26. IEEE, 2017.
 [57] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel HernándezLobato, GuYeon Wei, and David Brooks. Minerva: Enabling lowpower, highlyaccurate deep neural network accelerators. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 267–278. IEEE, 2016.
 [58] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. Shidiannao: Shifting vision processing closer to the sensor. In Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on, pages 92–104. IEEE, 2015.
 [59] Mingcong Song, Kan Zhong, Jiaqi Zhang, Yang Hu, Duo Liu, Weigong Zhang, Jing Wang, and Tao Li. Insitu ai: Towards autonomous and incremental deep learning for iot systems. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pages 92–103. IEEE, 2018.
 [60] Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. Tabla: A unified templatebased framework for accelerating statistical machine learning. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pages 14–26. IEEE, 2016.
 [61] YuHsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits, 52(1):127–138, 2017.
 [62] Bert Moons, Roel Uytterhoeven, Wim Dehaene, and Marian Verhelst. 14.5 envision: A 0.26to10tops/w subwordparallel dynamicvoltageaccuracyfrequencyscalable convolutional neural network processor in 28nm fdsoi. In SolidState Circuits Conference (ISSCC), 2017 IEEE International, pages 246–247. IEEE, 2017.
 [63] Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinderpal Singh, Elio Guidetti, Fabio De Ambroggi, Tommaso Majo, Paolo Zambotti, Manuj Ayodhyawasi, Harvinder Singh, et al. 14.1 a 2.9 tops/w deep convolutional neural network soc in fdsoi 28nm for intelligent embedded systems. In SolidState Circuits Conference (ISSCC), 2017 IEEE International, pages 238–239. IEEE, 2017.
 [64] Paul N Whatmough, Sae Kyu Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and GuYeon Wei. 14.3 a 28nm soc with a 1.2 ghz 568nj/prediction sparse deepneuralnetwork engine with> 0.1 timing error rate tolerance for iot applications. In SolidState Circuits Conference (ISSCC), 2017 IEEE International, pages 242–243. IEEE, 2017.
 [65] Jaehyeong Sim, JunSeok Park, Minhye Kim, Dongmyung Bae, Yeongjae Choi, and LeeSup Kim. 14.6 a 1.42 tops/w deep convolutional neural network recognition processor for intelligent ioe systems. In SolidState Circuits Conference (ISSCC), 2016 IEEE International, pages 264–265. IEEE, 2016.
 [66] Suyoung Bang, Jingcheng Wang, Ziyun Li, Cao Gao, Yejoong Kim, Qing Dong, YenPo Chen, Laura Fick, Xun Sun, Ron Dreslinski, et al. 14.7 a 288w programmable deeplearning processor with 270kb onchip weight storage using nonuniform memory hierarchy for mobile intelligence. In SolidState Circuits Conference (ISSCC), 2017 IEEE International, pages 250–251. IEEE, 2017.
 [67] Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the 35th International Conference on ComputerAided Design, page 12. ACM, 2016.
 [68] Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. Energyefficient cnn implementation on a deeply pipelined fpga cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, pages 326–331. ACM, 2016.

[69]
http://www.techradar.com/news/computingcomponents/processors/googlestensorprocessingunitexplained
thisiswhatthefutureofcomputinglooks
like1326915.  [70] https://www.sdxcentral.com/articles/news/intelsdeeplearningchipswillarrive2017/2016/11/.
 [71] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
 [72] Xiaoliang Dai, Hongxu Yin, and Niraj K Jha. Nest: a neural network synthesis tool based on a growandprune paradigm. arXiv preprint arXiv:1711.02017, 2017.
 [73] TienJu Yang, YuHsin Chen, and Vivienne Sze. Designing energyefficient convolutional neural networks using energyaware pruning. arXiv preprint arXiv:1611.05128, 2016.
 [74] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
 [75] Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layerwise optimal brain surgeon. In Advances in Neural Information Processing Systems, pages 4857–4867, 2017.
 [76] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.
 [77] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 548–560. IEEE, 2017.
 [78] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
 [79] Cong Leng, Hao Li, Shenghuo Zhu, and Rong Jin. Extremely low bit neural network: Squeeze the last bit out with admm. arXiv preprint arXiv:1707.09870, 2017.

[80]
Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo.
Weightedentropybased quantization for deep neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 7197–7205, 2017.  [81] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. In International Conference on Learning Representations (ICLR), 2017.
 [82] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
 [83] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828, 2016.
 [84] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 [85] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.
 [86] Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
 [87] https://www.tensorflow.org/mobile/tflite/.
 [88] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch, 2017.
 [89] Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. Admmnn: An algorithmhardware codesign framework of dnns using alternating direction method of multipliers. In Proceedings of the TwentyFourth International Conference on Architectural Support for Programming Languages and Operating Systems.
 [90] Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochastic alternating direction method of multipliers. In International Conference on Machine Learning, pages 80–88, 2013.
 [91] Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In International Conference on Machine Learning, pages 392–400, 2013.
 [92] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
 [93] Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. arXiv preprint arXiv:1804.03294, 2018.
 [94] Zhe Yuan, Jinshan Yue, Huanrui Yang, Zhibo Wang, Jinyang Li, Yixiong Yang, Qingwei Guo, Xueqing Li, MengFan Chang, Huazhong Yang, et al. Sticker: A 0.4162.1 tops/w 8bit neural network processor with multisparsity compatible convolution arrays and online tuning acceleration for fully connected layers. In 2018 IEEE Symposium on VLSI Circuits, pages 33–34. IEEE, 2018.
 [95] Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. Admmnn: An algorithmhardware codesign framework of dnns using alternating direction method of multipliers. arXiv preprint arXiv:1812.11677, 2018.
 [96] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. Cambriconx: An accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page 20. IEEE Press, 2016.
 [97] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. Scnn: An accelerator for compressedsparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pages 27–40. IEEE, 2017.
 [98] Zhezhi He and Deliang Fan. Simultaneously optimizing weight and quantizer of ternary neural network using truncated gaussian approximation. arXiv preprint arXiv:1810.01018, 2018.
 [99] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.
 [100] Mingyi Hong, ZhiQuan Luo, and Meisam Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016.

[101]
Sijia Liu, Jie Chen, PinYu Chen, and Alfred Hero.
Zerothorder online alternating direction method of multipliers:
Convergence analysis and applications.
In
International Conference on Artificial Intelligence and Statistics
, pages 288–297, 2018.  [102] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [103] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [104] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [105] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 [106] HsinPai Cheng, Yuanjun Huang, Xuyang Guo, Feng Yan, Yifei Huang, Wei Wen, Hai Li, and Yiran Chen. Differentiable finegrained quantization for deep neural network compression. In NIPS 2018 CDNNRIA Workshop, 2018.
 [107] Google supercharges machine learning tasks with TPU custom chip, https://cloudplatform.googleblog.com/2016/05/Googlesuperchargesmachinelearningtaskswithcustomchip.html.
 [108] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016.
 [109] Chuhan Min, Aosen Wang, Yiran Chen, Wenyao Xu, and Xin Chen. 2pfpce: Twophase filter pruning based on conditional entropy. arXiv preprint arXiv:1809.02220, 2018.
 [110] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, LiJia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In The European Conference on Computer Vision (ECCV), September 2018.
 [111] Junyuan Xie, Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, and Mu Li. Bag of tricks for image classification with convolutional neural networks. arXiv preprint arXiv:1812.01187, 2018.
 [112] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
 [113] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
 [114] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big Data, 3(1):9, 2016.
 [115] Zeyuan AllenZhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
 [116] Shaokai Ye, Kaidi Xu, Sijia Liu, Hao Cheng, JanHenrik Lambrechts, Huan Zhang, Aojun Zhou, Kaisheng Ma, Yanzhi Wang, and Xue Lin. Second rethinking of network pruning in the adversarial setting. arXiv preprint arXiv:1903.12561, 2019.
 [117] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, et al. C ir cnn: accelerating and compressing deep neural networks using blockcirculant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 395–408. ACM, 2017.
 [118] C Deng, S Liao, Y Xie, KK Parhi, X Qian, and B Yuan. Permdnn: Efficient compressed deep neural network architecture with permuted diagonal matrices. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018.
Comments
There are no comments yet.