1 Introduction
The high computational and storage requirements of largescale DNNs, such as VGG [25] or ResNet [6], make it prohibitive for broad, realtime applications at the mobile end. Model compression techniques have been proposed that aim at reducing both the storage and computational costs for DNN inference phase [28, 17, 18, 4, 5, 8, 7, 33, 34, 21, 12]. One key model compression technique is DNN weight pruning [28, 17, 18, 4, 5, 8, 7, 33, 34] that reduces the number of weight parameters, with minor (or no) accuracy loss.
There are mainly two categories of weight pruning. The general, nonstructured pruning [17, 4, 5, 33] can prune arbitrary weight in DNN. Despite the high pruning rate (weight reduction), it suffers from limited acceleration in actual hardware implementation due to the sparse weight matrix storage and associated indices [5, 28, 8]. On the other hand, structured pruning [28, 18, 8, 34] can directly reduce the size of weight matrix while maintaining the form of a full matrix, without the need of indices. It is thus more compatible with hardware acceleration and has become the recent research focus. There are multiple types/schemes of structured pruning, e.g., filter pruning, channel pruning, and column pruning for CONV layers of DNN as summarized in [28, 17, 8, 34]. Recently, a systematic solution framework [33, 34] has been developed based on the powerful optimization tool ADMM (Alternating Direction Methods of Multipliers) [2, 26]. It is applicable to different schemes of structured pruning (and nonstructured one) and achieves stateofart results [33, 34] by far.
The structured pruning problem of DNNs is flexible, comprising a large number of hyperparameters, including the scheme of structured pruning and combination (for each layer), perlayer weight pruning rate, etc. Conventional handcrafted policy has to explore the large design space for hyperparameter determination for weight or computation (FLOPs) reductions, with minimum accuracy loss. The trialanderror process is highly timeconsuming, and derived hyperparameters are usually suboptimal. It is thus desirable to employ an automated process of hyperparameter determination for such structured pruning problem, motivated by the concept of AutoML (automated machine learning)
[35, 1, 13, 22, 14]. Recent work AMC [7] employs the popular deep reinforcement learning (DRL) [35, 1] technique for automatic determination of perlayer pruning rates. However, it has limitations that (i) it employs an early weight pruning technique based on fixed regularization, and (ii) it only considers filter pruning for structured pruning. As we shall see later, the underlying incompatibility between the utilized DRL framework with the problem further limits its ability to achieve high weight pruning rates (the maximum reported pruning rate in [7] is only 5 and is nonstructured pruning).This work makes the following innovative contributions in the automatic hyperparameter determination process for DNN structured pruning. First, we analyze such automatic process in details and extract the generic flow, with four steps: (i) action sampling, (ii) quick action evaluation, (iii) decision making, and (iv) actual pruning and result generation. Next, we identify three sources of performance improvement compared with prior work. We adopt the ADMMbased structured weight pruning algorithm as the core algorithm, and propose an innovative additional purification step for further weight reduction without accuracy loss. Furthermore, we find that the DRL framework has underlying incompatibility with the characteristics of the target pruning problem, and conclude that such issues can be mitigated simultaneously using effective heuristic search method enhanced by experiencebased guided search.
Combining all the improvements results in our automatic framework AutoCompress, which outperforms the prior work on automatic model compression by up to 33 in pruning rate (120 reduction in the actual parameter count) under the same accuracy. Through extensive experiments on CIFAR10 and ImageNet datasets, we conclude that AutoCompress is the key to achieve ultrahigh pruning rates on the number of weights and FLOPs that cannot be achieved before, while DRL cannot compete with human experts to achieve high pruning rates. Significant inference speedup has been observed from the AutoCompress framework on actual measurements on smartphone, based on our compilerassisted mobile DNN acceleration framework. We release all models of this work at anonymous link: http://bit.ly/2VZ63dS.
2 Related Work
DNN Weight Pruning and Structured Pruning: DNN weight pruning includes two major categories: the general, nonstructured pruning [17, 4, 5, 33] where arbitrary weight can be pruned, and structured pruning [28, 17, 18, 8, 34] that maintains certain regularity. Nonstructured pruning can result in a higher pruning rate (weight reduction). However, as weight storage is in a sparse matrix format with indices, it often results in performance degradation in highly parallel implementations like GPUs. This limitation can be overcome in structured weight pruning.
Figure 1 illustrates three structured pruning schemes on the CONV layers of DNN: filter pruning, channel pruning, and filtershape pruning (a.k.a. column pruning
), removing whole filter(s), channel(s), and the same location in each filter in each layer. CONV operations in DNNs are commonly transformed to matrix multiplications by converting weight tensors and feature map tensors to matrices
[28], named general matrix multiplication (GEMM). The key advantage of structured pruning is that a full matrix will be maintained in GEMM with dimensionality reduction, without the need of indices, thereby facilitating hardware implementations.It is also worth mentioning that filter pruning and channel pruning are correlated [8], as pruning a filter in layer (after batch norm) results in the removal of corresponding channel in layer . The relationship in ResNet [6] and MobileNet [23] will be more complicated due to bypass links.
ADMM:
Alternating Direction Method of Multipliers (ADMM) is a powerful mathematical optimization technique, by decomposing an original problem into two subproblems that can be solved separately and efficiently
[2]. Consider the general optimization problem . In ADMM, it is decomposed into two subproblems on and ( is an auxiliary variable), to be solved iteratively until convergence. The first subproblem derives given : . The second subproblem derives given : . Both and are quadratic functions.As a key property, ADMM can effectively deal with a subset of combinatorial constraints and yield optimal (or at least high quality) solutions. The associated constraints in DNN weight pruning (both nonstructured and structured) belong to this subset [10]. In DNN weight pruning problem,
is loss function of DNN and the first subproblem is DNN training with dynamic regularization, which can be solved using current gradient descent techniques and solution tools
[11, 27] for DNN training. corresponds to the combinatorial constraints on the number of weights. As the result of the compatibility with ADMM, the second subproblem has optimal, analytical solution for weight pruning via Euclidean projection. This solution framework applies both to nonstructured and different variations of structured pruning schemes.AutoML: Many recent work have investigated the concept of automated machine learning (AutoML), i.e., using machine learning for hyperparameter determination in DNNs. Neural architecture search (NAS) [35, 1, 14] is an representative application of AutoML. NAS has been deployed in Google’s Cloud AutoML framework, which frees customers from the timeconsuming DNN architecture design process. The most related prior work, AMC [7]
, applies AutoML for DNN weight pruning, leveraging a similar DRL framework as Google AutoML to generate weight pruning rate for each layer of the target DNN. In conventional machine learning methods, the overall performance (accuracy) depends greatly on the quality of features. To reduce the burdensome manual feature selection process, automated feature engineering learns to generate appropriate feature set in order to improve the performance of corresponding machine learning tools.
3 The Proposed AutoCompress Framework for DNN Structured Pruning
Given a pretrained DNN or predefined DNN structure, the automatic hyperparameter determination process will decide the perlayer weight pruning rate, and type (and possible combination) of structured pruning scheme per layer. The objective is the maximum reduction in the number of weights or FLOPs, with minimum accuracy loss.
3.1 Automatic Process: Generic Flow
Figure 2 illustrates the generic flow of such automatic process, which applies to both AutoCompress and the prior work AMC. Here we call a sample selection of hyperparamters an “action” for compatibility with DRL. The flow has the following steps: (i) action sampling, (ii) quick action evaluation, (iii) decision making, and (iv) actual pruning and result generation. Due to the high search space of hyperparameters, steps (i) and (ii) should be fast. This is especially important for step (ii), in that we cannot employ the timeconsuming, retraining based weight pruning (e.g., fixed regularization [28, 8] or ADMMbased techniques) to evaluate the actual accuracy loss. Instead, we can only use simple heuristic, e.g., eliminating a predefined portion (based on the chosen hyperparameters) of weights with least magnitudes for each layer, and evaluating the accuracy. This is similar to [7]. Step (iii) makes decision on the hyperparameter values based on the collection of action samples and evaluations. Step (iv) generates the pruning result, and the optimized (core) algorithm for structured weight pruning will be employed here. Here the algorithm can be more complicated with higher performance (e.g., the ADMMbased one), as it is only performed once in each round.
The overall automatic process is often iterative, and the above steps (i) through (iv) reflect only one round. The reason is that it is difficult to search for high pruning rates in one single round, and the overall weight pruning process will be progressive. This applies to both AMC and AutoCompress. The number of rounds is 4  8 in AutoCompress for fair comparison. Note that AutoCompress supports flexible number of progressive rounds to achieve the maximum weight/FLOPs reduction given accuracy requirement (or with zero accuracy loss).
3.2 Motivation: Sources of Performance Improvements
Based on the generic flow, we identify three sources of performance improvement (in terms of pruning rate, accuracy, etc.) compared with prior work. The first is the structured pruning scheme. Our observation is that an effective combination of filter pruning (which is correlated with channel pruning) and column pruning will perform better compared with filter pruning alone (as employed in AMC [7]). Comparison results are shown in the evaluation section. This is because of the high flexibility in column pruning, while maintaining the hardwarefriendly full matrix format in GEMM. The second is the core algorithm for structured weight pruning in Step (iv). We adopt the stateofart ADMMbased weight pruning algorithm in this step. Furthermore, we propose further improvement of a purification step on the ADMMbased algorithm taking advantages of the special characteristics after ADMM regularization. In the following two subsections, we will discuss the core algorithm and the proposed purification step, respectively.
The third source of improvement is the underlying principle of action sampling (Step (i)) and decision making (Step (iii)). The DRLbased framework in [7]
adopts an exploration vs. exploitationbased search for action sampling. For Step (iii), it trains a neural network using action samples and fast evaluations, and uses the neural network to make decision on hyperparameter values. Our hypothesis is that DRL is inherently incompatible with the target automatic process, and can be easily outperformed by effective heuristic search methods (such as simulated annealing or genetic algorithm), especially the enhanced versions. More specifically, the DRLbased framework adopted in
[7] is difficult to achieve high pruning rates (the maximum pruning rate in [7] is only 5 and is on nonstructured pruning), due to the following reasons.First
, the sample actions in DRL are generated in a randomized manner, and are evaluated (Step (ii)) using very simple heuristic. As a result, these action samples and evaluation results (rewards) are just rough estimations. When training a neural network and relying on it for making decisions, it will hardly generate satisfactory decisions especially for high pruning rates.
Second, there is a common limitation of reinforcement learning technique (both basic one and DRL) on optimization problem with constraints [29, 32, 9]. As pruning rates cannot be set as hard constraints in DRL, it has to adopt a composite reward function with both accuracy loss and weight No./FLOPs reduction. This is the source of issue in controllability, as the relative strength of accuracy loss and weight reduction is very different for small pruning rates (the first couple of rounds) and high pruning rates (the latter rounds). Then there is the paradox of using a single reward function in DRL (hard to satisfy the requirement throughout pruning process) or multiple reward functions (how many? how to adjust the parameters?). Third, it is difficult for DRL to support flexible and adaptive number of rounds in the automatic process to achieve the maximum pruning rates. As different DNNs have vastly different degrees of compression, it is challenging to achieve the best weight/FLOPs reduction with a fixed, predefined number of rounds. These can be observed in the evaluation section on the difficulty of DRL to achieve high pruning rates. As these issues can be mitigated by effective heuristic search, we emphasize that an additional benefit of heuristic search is the ability to perform guided search based on prior human experience. In fact, the DRL research also tries to learn from heuristic search methods in this aspect for action sampling [19, 24], but the generality is still not widely evaluated.3.3 Core Algorithm for Structured Pruning
This work adopts the ADMMbased weight pruning algorithm [33, 34] as the core algorithm, which generates stateofart results in both nonstructured and structured weight pruning. Details are in [33, 34, 2, 26]. The major step in the algorithm is ADMM regularization. Consider a general DNN with loss function , where and correspond to the collections of weights and biases in layer , respectively. The overall (structured) weight pruning problem is defined as
(1) 
where reflects the requirement that remaining weights in layer satisfy predefined “structures”. Please refer to [28, 8] for more details.
By defining (i) indicator functions , (ii) incorporating auxiliary variable and dual variable , (iii) adopting augmented Lagrangian [2], the ADMM regularization decomposes the overall problem into two subproblems, and iteratively solved them until convergence. The first subproblem is It can be solved using current gradient descent techniques and solution tools for DNN training. The second subproblem is , which can be optimally solved as Euclidean mapping.
Overall speaking, ADMM regularization is a dynamic regularization where the regularization target is dynamically adjusted in each iteration, without penalty on all the weights. This is the reason that ADMM regularization outperforms prior work of fixed , regularization or projected gradient descent (PGD). To further enhance the convergence rate, the multi method [30] is adopted in ADMM regularization, where the values will gradually increase with ADMM iterations.
3.4 Purification and Unused Weights Removal
After ADMMbased structured weight pruning, we propose the purification and unused weights removal step for further weight reduction without accuracy loss. First, as also noticed by prior work [8], a specific filter in layer is responsible for generating one channel in layer . As a result, removing the filter in layer (in fact removing the batch norm results) also results in the removal of the corresponding channel, thereby achieving further weight reduction. Besides this straightforward procedure, there is further margin of weight reduction based on the characteristics of ADMM regularization. As ADMM regularization is essentially a dynamic, norm based regularization procedure, there are a large number of nonzero, small weight values after regularization. Due to the nonconvex property in ADMM regularization, our observation is that removing these weights can maintain the accuracy or even slightly improve the accuracy occasionally. As a result, we define two thresholds, a columnwise threshold and a filterwise threshold, for each DNN layer. When the norm of a column (or filter) of weights is below the threshold, the column (or filter) will be removed. Also the corresponding channel in layer can be removed upon filter removal in layer . Structures in each DNN layer will be maintained after this purification step.
These two threshold values are layerspecific, depending on the relative weight values of each layer, and the sensitivity on overall accuracy. They are hyperparameters to be determined for each layer in the AutoCompress framework, for maximum weight/FLOPs reduction without accuracy loss.
3.5 The Overall AutoCompress Framework for Structured Weight Pruning and Purification
In this section, we discuss the AutoCompress framework based on the enhanced, guided heuristic search method, in which the automatic process determines perlayer weight pruning rates, structured pruning schemes (and combinations), as well as hyperparameters in the purification step (discussed in Section 3.4). The overall framework has two phases as shown in Figure 3: Phase I for structured weight pruning based on ADMM, and Phase II for the purification step. Each phase has multiple progressive rounds as discussed in Section 3.1, in which the weight pruning result from the previous round serves as the starting point of the subsequent round. We use Phase I as illustrative example, and Phase II uses the similar steps.
The AutoCompress framework supports flexible number of progressive rounds, as well as hard constraints on the weight or FLOPs reduction. In this way, it aims to achieve the maximum weight or FLOPs reduction while maintaining accuracy (or satisfying accuracy requirement). For each round
, we set the overall reduction in weight number/FLOPs to be a factor of 2 (with a small variance), based on the result from the previous round. In this way, we can achieve around
weight/FLOPs reduction within 2 rounds, already outperforming the reported structured pruning results in prior work [7].We leverage a classical heuristic search technique simulated annealing (SA), with enhancement on guided search based on prior experience. The enhanced SA technique is based on the observation that a DNN layer with more number of weights often has a higher degree of model compression with less impact on overall accuracy. The basic idea of SA is in the search for actions: When a perturbation on the candidate action results in better evaluation result (Step (ii) in Figure 2
), the perturbation will be accepted; otherwise the perturbation will be accepted with a probability depending on the degradation in evaluation result, as well as a temperature
. The reason is to avoid being trapped in local minimum in the search process. The temperature will gradually decrease during the search process, in analogy to the physical “annealing” process.Given the overall pruning rate (on weight No. or FLOPs) in the current round, we initialize a randomized action using the following process: i) order all layers based on the number of remaining weights, ii) assign a randomized pruning rate (and partition between filter and column pruning schemes) for each layer, satisfying that a layer with more weights will have no less pruning rate, and iii) normalize the pruning rates by . We also have a high initialized temperature . We define perturbation as the change of weight pruning rates (and portion of structured pruning schemes) in a subset of DNN layers. The perturbation will also satisfy the requirement that the layer will more remaining weights will have a higher pruning rate. The result evaluation is the fast evaluation introduced in Section 3.1. The acceptance/denial of action perturbation, the degradation in temperature , and the associated reduction in the degree of perturbation with follow the SA rules until convergence. The action outcome will become the decision of hyperparameter values (Step (iii), this is different from DRL which trains a neural network). The ADMMbased structured pruning will be adopted to generate pruning result (Step (iv)), possibly for the next round until final result.
4 Evaluation, Experimental Results, and Discussions
Method  Accuracy  CONV Params Rt.  CONV FLOPs Rt.  Inference time  
Original VGG16  93.7%  1.0  1.0  14ms  
Filter Pruning  2PFPCE [18]  92.8%  N/A  N/A  
2PFPCE [18]  91.0%  8.3  N/A  N/A  
ADMM, manual hyper. determ.  93.48%  9.3  2.1  7.1ms  
Auto Filter Pruning  ADMMbased, enhanced SA  93.22%  13.7  3.1  4.8ms 
TrainFromScratch  93.19%  13.7  3.1  4.8ms  
ADMMbased, enhanced SA  88.78%  47.4  14.0  1.7ms  
Combined Structured Pruning  ADMM, manual hyper. determ.  93.26%  44.3  8.1  2.9ms 
Full AutoCompress  93.21%  52.2  8.8  2.7ms  
TrainFromScratch  91.4%  52.2  8.8  2.7ms  
Method  Accuracy  CONV Params Rt.  CONV FLOPs Rt.  Inference time  
Original ResNet18  93.9%  1.0  1.0  11ms  
Filter Pruning  NISP [31]  93.2%  N/A  N/A  
ADMM, manual hyper. determ.  93.9%  5.2  2.7  4.2ms  
Auto Filter Pruning  AMC [7]  93.5%  1.7  N/A  N/A 
ADMMbased, enhanced SA  93.91%  8.0  4.7  2.4ms  
TrainFromScratch  93.89%  8.0  4.7  2.4ms  
Combined Structured Pruning  ADMM, DRL hyper. determ.  93.55%  11.8  3.8  4.7ms 
ADMM, manual hyper. determ.  93.69%  43.3  9.6  1.9ms  
Full AutoCompress  93.43%  61.2  13.3  1.3ms  
Full AutoCompress  93.81%  54.2  12.2  1.45ms  
TrainFromScratch  91.88%  54.2  12.2  1.45ms 
Setup:
The effectiveness of AutoCompress is evaluated on VGG16 and ResNet18 on CIFAR10 dataset, and VGG16 and ResNet18/50 on ImageNet dataset. We focus on the structured pruning on CONV layers, which are the most computationally intensive layers in DNNs and the major storage in stateofart DNNs such as ResNet. In this section we focus on the objective function of reduction in the number of weight parameters, and leave the objective in the amount of computation (FLOPs) in Supplementary Materials. The implementations are based on PyTorch
[20]. For structured pruning, we support (i) filter pruning only, and (ii) combined filter and column pruning, both supported in ADMMbased algorithm and AutoCompress framework. In the ADMMbased structured pruning algorithm, the number of epochs in each progressive round is 200, which is lower than the prior iterative pruning heuristic
[5]. We use an initial penalty parameter for ADMM and initial learning rate . The ADAM [11] optimizer is utilized. In the SA setup, we use cooling factor and Boltzmann’s constant . The initial probability of accepting high energy (bad) moves is set to be relatively high.Models and Baselines: We aim at fair and comprehensive evaluation on the effectiveness of three sources of performance improvements discussed in Section 3.2. Besides the original, unpruned DNN models, we compare with a set of prior baseline methods. Perhaps for software implementation convenience, almost all baseline methods we can find focus on filter/channel pruning. For fair comparison, we also provide pruning results on ADMMbased filter pruning with manual hyperparameter determination. This case is only different from prior work by a single source of performance improvement – the core algorithm using ADMM. We also show the results on ADMMbased filter pruning with enhanced SAbased hyperparameter determination, in order to show the effect of an additional source of improvement.
Beyond filter pruning only, we show the combined structured pruning results using ADMM to demonstrate the last source of performance improvement. We provide results on manual, our crafted DRLbased, and enhanced SAbased hyperparameter determination for fair comparison, the last representing the full version of AutoCompress. We provide the inference time of the pruned models using the latest Qualcomm Adreno 640 GPU in Samsung Galaxy S10 smartphone. The results clearly demonstrate the actual acceleration using the combined structured pruning. Note that our mobile DNN acceleration framework is a compiler assisted, strong framework by itself. For the original VGG16 and ResNet18 (without pruning) on CIFAR10, it achieves 14ms and 11ms endtoend inference times, respectively, on the Adreno 640 mobile GPU. For the original VGG16 and ResNet50 on ImageNet, it achieves 95ms and 48ms inference times, respectively. All these results, as starting points, outperform current DNN acceleration frameworks like TensorFlowLite
[27] and TVM [3].Recent work [15] points out an interesting aspect. When one trains from scratch based on the structure (not using weight values) of a pruned model, one can often retrieve the same accuracy as the model after pruning. We incorporate this “TrainFromScratch” process based on the results of filter pruning and combined filter and column pruning (both the best results using the enhanced SAbased search). We will observe whether accuracy can be retrieved.
Through extensive experiments, we conclude that AutoCompress is the key to achieve ultrahigh pruning rates on the number of weights and FLOPs that cannot be achieved before, while DRL cannot compete with human experts to achieve high structured pruning rates.
4.1 Results and Discussions on CIFAR10 Dataset
Table 1 illustrates the comparison results on VGG16 for CIFAR10 dataset, while Table 2 shows the results on ResNet18 (ResNet50 for some baselines but their accuracy is similar to our ResNet18). The objective function of our AutoCompress framework is reducing the number of weight parameters (please refer to Supplementary Materials for the cases when FLOPs reduction is set as objective), but the FLOPs reduction results are also reported.
From the two tables we have the following conclusions. First, for filter/channel pruning only using manual hyperparameter determination, our method outperforms prior work 2PFPCE, NISP and AMC (both in accuracy and in pruning rate). As no other sources of improvement are exploited, this improvement is attributed to the ADMMbased algorithm equipped with purification. Second, the combined structured pruning outperforms filteronly pruning in both weight reduction and FLOPs reduction. For manual hyperparameter determination, the combined structured pruning enhances from 9.3 pruning rate to 44.3 in VGG16, and enhances from 5.2 to 43.3 in ResNet18. If we aim at the same high pruning rate for filteronly pruning, it suffers a notable accuracy drop (e.g., 88.78% accuracy at 47.4 pruning rate for VGG16). Third, the enhanced SAbased hyperparameter determination outperforms DRL and manual counterparts. As can be observed in the two tables, the full AutoCompress achieves a moderate improvement in pruning rate compared with manual hyperparameter optimization, but significantly outperforms DRLbased framework (all other sources of improvement are the same). This demonstrates the statement that DRL is not compatible with ultrahigh pruning rates. For relatively small pruning rates, it appears that DRL can hardly outperform manual process as well, as the improvement over 2PFPCE is less compared with the improvement over AMC.
With all sources of performance improvements effectively exploited, the full AutoCompress framework achieves 15.3 improvement in weight reduction compared with 2PFPCE and 33 improvement compared with NISP and AMC, under the same (or higher for AutoCompress) accuracy. When accounting for the different number of parameters in ResNet18 and ResNet50 (NISP and AMC), the improvement can be even perceived as 120. It demonstrates the significant performance of our proposed AutoCompress framework, and also implies that the high redundancy of DNNs on CIFAR10 dataset has not been exploited in prior work. Also the measured inference speedup on mobile GPU validates the effectiveness of the combined pruning scheme and our proposed AutoCompress framework.
Moreover, there are some interesting results on “TrainFromScratch” cases, in response to the observations in [15]. When “TrainFromScratch” is performed based the result of filteronly pruning, it can recover the similar accuracy. The insight is that filter/channel pruning is similar to finding a smaller DNN model. In this case, the main merit of AutoCompress framework is to discover such DNN model, especially corresponding compression rates in each layer, and our method still outperforms prior work. On the other hand, when “TrainFromScratch” is performed based on the result of combined structured pruning, the accuracy CANNOT be recovered. This is an interesting observation. The underlying insight is that the combined pruning is not just training a smaller DNN model, but with adjustments of filter/kernel shapes. In this case, the pruned model represents a solution that cannot be achieved through DNN training only, even with detailed structures already given. In this case, weight pruning (and the AutoCompress framework) will be more valuable due to the importance of training from a fullsized DNN model.
4.2 Results and Discussions on ImageNet Dataset
In this subsection, we show the application of AutoCompress on ImageNet dataset, and more comparison results with filteronly pruning (equipped by ADMMbased core algorithm and SAbased hyperparameter determination). This will show the first source of improvement. Table 3 and Table 4 show the comparison results on VGG16 and ResNet18 (ResNet50) structured pruning on ImageNet dataset, respectively. We can clearly see the advantage of AutoCompress over prior work, such as [8] (filter pruning with manual determination), AMC [7] (filter pruning with DRL), and ThiNet [16] (filter pruning with manual determination). We can also see the advantage of AutoCompress over manual hyperparameter determination (both combined structured pruning with ADMMbased core algorithm), improving from 2.7 to 3.3 structured pruning rates on ResNet18 (ResNet50) under the same (Top5) accuracy. Finally, the full AutoCompress also outperforms filter pruning only (both ADMMbased core algorithm and SAbased hyperparameter determination), improvement from 3.8 to 6.4 structured pruning rates on VGG16 under the same (Top5) accuracy. This demonstrates the advantage of combined filter and column pruning compared with filter pruning only, when the other sources of improvement are the same. Besides, our filteronly pruning results also outperform prior work, demonstrating the strength of proposed framework.
Method  Top5 Acc. Loss  Params Rt.  Objective 
Filter [8]  1.7%  4  N/A 
AMC [7]  1.4%  4  NA 
Filter pruning, ADMM, SA  0.6%  3.8  Params# 
Full AutoCompress  0.6%  6.4  Params# 
Method  Top5 Acc. Loss  Params Rt.  Objective 
ThiNet50 [16]  1.1%  2  N/A 
ThiNet30 [16]  3.5%  3.3  N/A 
Filter pruning, ADMM, SA  0.8%  2.7  Params# 
Combined pruning, ADMM, manual  0.1%  2.7  N/A 
Full AutoCompress  0.1%  3.3  Params# 
Method  Top5 Acc. Loss  Params Rt.  Objective 
AMC [7]  0%  4.8  N/A 
ADMM, manual hyper.  0%  8.0  N/A 
Full AutoCompress  0%  9.2  Params# 
Full AutoCompress  0.7%  17.4  Params# 
Last but not least, the proposed AutoCompress framework can also be applied to nonstructured pruning. For nonstructured pruning on ResNet50 model for ImageNet dataset, AutoCompress results in 9.2 nonstructured pruning rate on CONV layers without accuracy loss (92.7% Top5 accuracy), which outperforms manual hyperparameter optimization with ADMMbased pruning (8 pruning rate) and prior work AMC (4.8 pruning rate).
5 Conclusion
This work proposes AutoCompress, an automatic structured pruning framework with the following key performance improvements: (i) effectively incorporate the combination of structured pruning schemes in the automatic process; (ii) adopt the stateofart ADMMbased structured weight pruning as the core algorithm, and propose an innovative additional purification step for further weight reduction without accuracy loss; and (iii) develop effective heuristic search method enhanced by experiencebased guided search, replacing the prior deep reinforcement learning technique which has underlying incompatibility with the target pruning problem. Extensive experiments on CIFAR10 and ImageNet datasets demonstrate that AutoCompress is the key to achieve ultrahigh pruning rates on the number of weights and FLOPs that cannot be achieved before.
References
 [1] (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §1, §2.
 [2] (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: §1, §2, §3.3, §3.3.

[3]
(2018)
tvm
: An automated endtoend optimizing compiler for deep learning
. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578–594. Cited by: §4.  [4] (2016) Dynamic network surgery for efficient dnns. In NIPS, pp. 1379–1387. Cited by: §1, §1, §2.
 [5] (2015) Learning both weights and connections for efficient neural network. In NIPS, pp. 1135–1143. Cited by: §1, §1, §2, §4.
 [6] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE CVPR, pp. 770–778. Cited by: §1, §2.
 [7] (2018) AMC: automl for model compression and acceleration on mobile devices. In ECCV, pp. 815–832. Cited by: §1, §1, §2, §3.1, §3.2, §3.2, §3.5, §4.2, Table 2, Table 3, Table 5, Table 7.
 [8] (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE ICCV, pp. 1389–1397. Cited by: §1, §1, §2, §2, §3.1, §3.3, §3.4, §4.2, Table 3.

[9]
(2018)
Deep reinforcement learning that matters.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §3.2.  [10] (2016) Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization 26 (1), pp. 337–364. Cited by: §2.
 [11] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2, §4.
 [12] (2018) Extremely low bit neural network: squeeze the last bit out with admm. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §1.
 [13] (2016) Hyperband: a novel banditbased approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560. Cited by: §1.
 [14] (2018) Progressive neural architecture search. In ECCV, pp. 19–34. Cited by: §1, §2.
 [15] (2018) Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §4.1, §4.
 [16] (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE ICCV, pp. 5058–5066. Cited by: §4.2, Table 4.
 [17] (2017) An entropybased pruning method for cnn compression. arXiv preprint arXiv:1706.05791. Cited by: §1, §1, §2.
 [18] (2018) 2PFPCE: twophase filter pruning based on conditional entropy. arXiv preprint arXiv:1809.02220. Cited by: §1, §1, §2, Table 1, Table 6.
 [19] (2016) Deep exploration via bootstrapped dqn. In NIPS, pp. 4026–4034. Cited by: §3.2.
 [20] (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.
 [21] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In ECCV, pp. 525–542. Cited by: §1.

[22]
(2017)
Largescale evolution of image classifiers
. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2902–2911. Cited by: §1.  [23] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE CVPR, pp. 4510–4520. Cited by: §2.
 [24] (2008) Samplebased learning and search with permanent and transient memories. In Proceedings of the 25th international conference on Machine learning, pp. 968–975. Cited by: §3.2.
 [25] (2015) Very deep convolutional networks for largescale image recognition. CoRR abs/1409.1556. Cited by: §1.
 [26] (2013) Dual averaging and proximal gradient descent for online alternating direction multiplier method. In ICML, pp. 392–400. Cited by: §1, §3.3.
 [27] (2017) TensorFlow lite. TensorFlow. Note: https://www.tensorflow.org/lite Cited by: §2, §4.
 [28] (2016) Learning structured sparsity in deep neural networks. In NIPS, pp. 2074–2082. Cited by: §1, §1, §2, §2, §3.1, §3.3.
 [29] (2011) Protecting against evaluation overfitting in empirical reinforcement learning. In 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 120–127. Cited by: §3.2.
 [30] (2018) Progressive weight pruning of deep neural networks using admm. arXiv preprint arXiv:1810.07378. Cited by: §3.3.

[31]
(2018)
Nisp: pruning networks using neuron importance score propagation
. In Proceedings of the IEEE CVPR, pp. 9194–9203. Cited by: Table 2, Table 7.  [32] (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §3.2.

[33]
(2018)
A systematic dnn weight pruning framework using alternating direction method of multipliers.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 184–199. Cited by: §1, §1, §2, §3.3.  [34] (2018) Adamadmm: a unified, systematic framework of structured weight pruning for dnns. arXiv preprint arXiv:1807.11091. Cited by: §1, §1, §2, §3.3.
 [35] (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.
Supplementary Materials
Comparison between two objective functions (weight parameter reduction and FLOPs reduction)
Method  Acc.  Params Rt.  FLOPs Rt.  Objective 
2PFPCE [18]  92.8%  4  N/A  N/A 
2PFPCE [18]  91.0%  8.3  N/A  N/A 
Combined struct. prun., ADMM, manual determ.  93.26%  44.3  8.1  N/A 
Full AutoCompress  93.21%  52.2  8.8  Params# 
Full AutoCompress  92.72%  61.1  10.6  Params# 
Full AutoCompress  92.65%  59.1  10.8  FLOPs# 
Full AutoCompress  92.79%  51.3  9.1  FLOPs# 
Method  Acc.  Params Rt  FLOPs Red.  Objective 
AMC [7]  93.5%  1.7  N/A  N/A 
NISP [31]  93.2%  N/A  N/A  
Combined struct. prun., ADMM, DRL determ.  93.55%  11.8  3.8  Params# 
Combined struct. prun., ADMM, manual determ.  93.69%  43.3  9.6  N/A 
Full AutoCompress  93.75%  55.6  12.3  Params# 
Full AutoCompress  93.43%  61.2  13.3  Params# 
Full AutoCompress  92.98%  80.8  17.2  Params# 
Full AutoCompress  93.81%  54.2  12.2  FLOPs# 
We compare between two objectives: weight No. and FLOPs reductions, in the AutoCompress framework. Detailed results on VGG16 and ResNet18 models for CIFAR10 dataset are shown in Table 6 and Table 7, respectively. Moreover, Figure 4 illustrates the portion of pruned weights per layer on VGG16 for CIFAR10 by Params# and FLOPs# search objectives. One can only observe slight difference in the portion of pruned weights per layer (also the tables show that parameter count reduction and FLOPs reduction are highly correlated). This is because further weight reduction in the first several layers of VGG16 and ResNet18 models (and many other DNN models alike) results in significant accuracy degradation. The somewhat convergence in results using two objectives seems to be one interesting feature under ultrahigh pruning rates. It also provides some hints that the observations (and conclusions) under relatively low pruning rates may not still hold under ultrahigh pruning rates.
Comments
There are no comments yet.