The high computational and storage requirements of large-scale DNNs, such as VGG (1) or ResNet (2), make it prohibitive for broad, real-time applications at the mobile end. Model compression techniques have been proposed that aim at reducing both the storage and computational costs for DNN inference phase (3; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15). One key model compression technique is DNN weight pruning (3; 4; 5; 6; 7; 8; 9; 10; 11; 14; 15) that reduces the number of weight parameters, with minor (or no) accuracy loss.
There are mainly two categories of weight pruning. The general, non-structured pruning (4; 6; 7; 10; 14; 15) can prune arbitrary weight in DNN. Despite the high pruning rate (weight reduction), it suffers from limited acceleration in actual hardware implementation due to the sparse weight matrix storage and associated indices (7; 3; 8). On the other hand, structured pruning (3; 5; 8; 11) can directly reduce the size of weight matrix while maintaining the form of a full matrix, without the need of indices. It is thus more compatible with hardware acceleration and has become the recent research focus. There are multiple types/schemes of structured pruning, e.g., filter pruning, channel pruning, and column pruning for CONV layers of DNN as summarized in (3; 4; 8; 11). Recently, a systematic solution framework (10; 11) has been developed based on the powerful optimization tool ADMM (Alternating Direction Methods of Multipliers) (16; 17; 18). It is applicable to different schemes of structured pruning (and non-structured one) and achieves state-of-art results (10; 11) by far.
The structured pruning problem of DNNs is flexible, comprising a large number of hyper-parameters, including the scheme of structured pruning and combination (for each layer), per-layer weight pruning rate, etc. Conventional hand-crafted policy has to explore the large design space for hyperparameter determination for weight or computation (FLOPs) reductions, with minimum accuracy loss. The trial-and-error process is highly time-consuming, and derived hyperparameters are usually sub-optimal. It is thus desirable to employ an automated process of hyperparameter determination for such structured pruning problem, motivated by the concept of AutoML (automated machine learning)(19; 20; 21; 22; 23; 24; 25). Recent work AMC (9) employs the popular deep reinforcement learning (DRL) (19; 20) technique for automatic determination of per-layer pruning rates. However, it has limitations that (i) it employs an early weight pruning technique based on fixed regularization, and (ii) it only considers filter pruning for structured pruning. As we shall see later, the underlying incompatibility between the utilized DRL framework with the problem further limits its ability to achieve high weight pruning rates (the maximum reported pruning rate in (9) is only 5 and is non-structured pruning).
This work makes the following innovative contributions in the automatic hyperparameter determination process for DNN structured pruning. First, we analyze such automatic process in details and extract the generic flow, with four steps: (i) action sampling, (ii) quick action evaluation, (iii) decision making, and (iv) actual pruning and result generation. Next, we identify three sources of performance improvement compared with prior work. We adopt the ADMM-based structured weight pruning algorithm as the core algorithm, and propose an innovative additional purification step for further weight reduction without accuracy loss. Furthermore, we found that the DRL framework has underlying incompatibility with the characteristics of the target pruning problem, and conclude that such issues can be mitigated simultaneously using effective heuristic search method enhanced by experience-based guided search.
Combining all the improvements results in our automatic framework AutoSlim, which outperforms the prior work on automatic model compression by up to 33 in pruning rate under the same accuracy. Through extensive experiments on CIFAR-10 and ImageNet datasets, we conclude that AutoSlim is the key to achieve ultra-high pruning rates on the number of weights and FLOPs that cannot be achieved before, while DRL cannot compete with human experts to achieve high pruning rates. We release codes and all models of this work at anonymous link: http://bit.ly/2VZ63dS.
2 Related Work
DNN Weight Pruning and Structured Pruning: DNN weight pruning includes two major categories: the general, non-structured pruning (4; 6; 7; 10; 14; 15) where arbitrary weight can be pruned, and structured pruning (3; 4; 5; 8; 11) that maintains certain regularity. Non-structured pruning can result in a higher pruning rate (weight reduction). However, as weight storage is in a sparse matrix format with indices, it often results in performance degradation in highly parallel implementations like GPUs. This limitation can be overcome in structured weight pruning.
Figure 1 illustrates three structured pruning schemes on the CONV layers of DNN: filter pruning, channel pruning, and filter-shape pruning (a.k.a. column pruning
), removing whole filter(s), channel(s), and the same location in each filter in each layer. CONV operations in DNNs are commonly transformed to matrix multiplications by converting weight tensors and feature map tensors to matrices(3), named general matrix multiplication (GEMM). The key advantage of structured pruning is that a full matrix will be maintained in GEMM with dimensionality reduction, without the need of indices, thereby facilitating hardware implementations.
It is also worth mentioning that filter pruning and channel pruning are correlated (8), as pruning a filter in layer (after batch norm) results in the removal of corresponding channel in layer . The relationship in ResNet (2) and MobileNet (26) will be more complicated due to bypass links.
Alternating Direction Method of Multipliers (ADMM) is a powerful mathematical optimization technique, by decomposing an original problem into two subproblems that can be solved separately and efficiently(16). Consider the general optimization problem . In ADMM, it is decomposed into two subproblems on and ( is an auxiliary variable), to be solved iteratively until convergence. The first subproblem derives given : . The second subproblem derives given : . Both and are quadratic functions.
As a key property, ADMM can effectively deal with a subset of combinatorial constraints and yield optimal (or at least high quality) solutions. The associated constraints in DNN weight pruning (both non-structured and structured) belong to this subset (27; 28). In DNN weight pruning problem,
is loss function of DNN and the first subproblem is DNN training with dynamic regularization, which can be solved using current gradient descent techniques and solution tools(29; 30) for DNN training. corresponds to the combinatorial constraints on the number of weights. As the result of the compatibility with ADMM, the second subproblem has optimal, analytical solution for weight pruning via Euclidean projection. This solution framework applies both to non-structured and different variations of structured pruning schemes.
AutoML: Many recent work have investigated the concept of automated machine learning (AutoML), i.e., using machine learning for hyperparameter determination in DNNs. Neural architecture search (NAS) (19; 20; 25) is an representative application of AutoML. NAS has been deployed in Google’s Cloud AutoML framework, which frees customers from the time-consuming DNN architecture design process. The most related prior work, AMC (9), applies AutoML for DNN weight pruning, leveraging a similar DRL framework as Google AutoML to generate weight pruning rate for each layer of the target DNN. In conventional machine learning methods, the overall performance (accuracy) depends greatly on the quality of features (31)
. To reduce the burdensome manual feature selection process, automated feature engineering(32) learns to generate appropriate feature set in order to improve the performance of corresponding machine learning tools.
3 The Proposed AutoSlim Framework for DNN Structured Pruning
Given a pretrained DNN or predefined DNN structure, the automatic hyperparameter determination process will decide the per-layer weight pruning rate, and type (and possible combination) of structured pruning scheme per layer. The objective is the maximum reduction in the number of weights or FLOPs, with minimum accuracy loss.
3.1 Automatic Process: Generic Flow and Key Steps
Figure 2 illustrates the generic flow of such automatic process, which applies to both AutoSlim and the prior work AMC. Here we call a sample selection of hyperparamters an “action" for compatibility with DRL. The flow has the following steps: (i) action sampling, (ii) quick action evaluation, (iii) decision making, and (iv) actual pruning and result generation. Due to the high search space of hyperparameters, steps (i) and (ii) should be fast. This is especially important for step (ii), in that we cannot employ the time-consuming, retraining based weight pruning (e.g., fixed regularization (3; 8) or ADMM-based techniques) to evaluate the actual accuracy loss. Instead, we can only use simple heuristic, e.g., eliminating a pre-defined portion (based on the chosen hyperparameters) of weights with least magnitudes for each layer, and evaluating the accuracy. This is similar to (9). Step (iii) makes decision on the hyperparameter values based on the collection of action samples and evaluations. Step (iv) generates the pruning result, and the optimized (core) algorithm for structured weight pruning will be employed here. Here the algorithm can be more complicated with higher performance (e.g., the ADMM-based one), as it is only performed once in each round.
The overall automatic process is often iterative, and the above steps (i) through (iv) reflect only one round. The reason is that it is difficult to search for high pruning rates in one single round, and the overall weight pruning process will be progressive. This applies to both AMC and AutoSlim. The number of rounds is 4 - 8 in AutoSlim for fair comparison. Note that AutoSlim supports flexible number of progressive rounds to achieve the maximum weight/FLOPs reduction given accuracy requirement (or with zero accuracy loss).
3.2 Motivation: Sources of Performance Improvements
Based on the generic flow, we identify three sources of performance improvement (in terms of pruning rate, accuracy, etc.) compared with prior work. The first is the structured pruning scheme. Our observation is that an effective combination of filter pruning (which is correlated with channel pruning) and column pruning will perform better compared with filter pruning alone (as employed in AMC (9)). Comparison results are shown in Section 4. This is because of the high flexibility in column pruning, while maintaining the hardware-friendly full matrix format in GEMM. The second is the core algorithm for structured weight pruning in Step (iv). We adopt the state-of-art ADMM-based weight pruning algorithm in this step. Furthermore, we propose further improvement of a purification step on the ADMM-based algorithm taking advantages of the special characteristics after ADMM regularization. In the following Section 3.3 and 3.4, we will discuss the core algorithm and the proposed purification step, respectively.
The third source of improvement is the underlying principle of action sampling (Step (i)) and decision making (Step (iii)). The DRL-based framework in (9)
adopts an exploration vs. exploitation-based search for action sampling. For Step (iii), it trains a neural network using action samples and fast evaluations, and uses the neural network to make decision on hyperparameter values. Our hypothesis is that DRL is inherently incompatible with the target automatic process, and can be easily outperformed by effective heuristic search methods (such as simulated annealing or genetic algorithm), especially the enhanced versions. More specifically, the DRL-based framework adopted in(9) is difficult to achieve high pruning rates (the maximum pruning rate in (9) is only 5 and is on non-structured pruning), due to the following reasons.
, the sample actions in DRL are generated in a randomized manner, and are evaluated (Step (ii)) using very simple heuristic. As a result, these action samples and evaluation results (rewards) are just rough estimations. When training a neural network and relying on it for making decisions, it will hardly generate satisfactory decisions especially for high pruning rates.Second, there is a common limitation of reinforcement learning technique (both basic one and DRL) on optimization problem with constraints (33; 34; 35; 36). As pruning rates cannot be set as hard constraints in DRL, it has to adopt a composite reward function with both accuracy loss and weight No./FLOPs reduction. This is the source of issue in controllability, as the relative strength of accuracy loss and weight reduction is very different for small pruning rates (the first couple of rounds) and high pruning rates (the latter rounds). Then there is the paradox of using a single reward function in DRL (hard to satisfy the requirement throughout pruning process) or multiple reward functions (how many? how to adjust the parameters?). Third, it is difficult for DRL to support flexible and adaptive number of rounds in the automatic process to achieve the maximum pruning rates. As different DNNs have vastly different degrees of compression, it is challenging to achieve the best weight/FLOPs reduction with a fixed, predefined number of rounds. These can be observed in Section 4 on the difficulty of DRL to achieve high pruning rates. As these issues can be mitigated by effective heuristic search, we emphasize that an additional benefit of heuristic search is the ability to perform guided search based on prior human experience. In fact, the DRL research also tries to learn from heuristic search methods in this aspect for action sampling (37; 38; 39; 40), but the generality is still not widely evaluated.
3.3 Core Algorithm for Structured Weight Pruning
This work adopts the ADMM-based weight pruning algorithm (10; 11) as the core algorithm, which generates state-of-art results in both non-structured and structured weight pruning. Details are in (10; 11; 16; 17; 18). The major step in the algorithm is ADMM regularization. Consider a general DNN with loss function , where and correspond to the collections of weights and biases in layer , respectively. The overall (structured) weight pruning problem is defined as
By defining (i) indicator functions , (ii) incorporating auxiliary variable and dual variable , (iii) adopting augmented Lagrangian (16), the ADMM regularization decomposes the overall problem into two subproblems, and iteratively solved them until convergence. The first subproblem is It can be solved using current gradient descent techniques and solution tools for DNN training. The second subproblem is , which can be optimally solved as Euclidean mapping.
Overall speaking, ADMM regularization is a dynamic regularization where the regularization target is dynamically adjusted in each iteration, without penalty on all the weights. This is the reason that ADMM regularization outperforms prior work of fixed , regularization or projected gradient descent (PGD). To further enhance the convergence rate, the multi- method (41) is adopted in ADMM regularization, where the values will gradually increase with ADMM iterations.
3.4 Purification and Unused Weights Removal Step
After ADMM-based structured weight pruning, we propose the purification and unused weights removal step for further weight reduction without accuracy loss. First, as also noticed by prior work (8), a specific filter in layer is responsible for generating one channel in layer . As a result, removing the filter in layer (in fact removing the batch norm results) also results in the removal of the corresponding channel, thereby achieving further weight reduction. Besides this straightforward procedure, there is further margin of weight reduction based on the characteristics of ADMM regularization. As ADMM regularization is essentially a dynamic, -norm based regularization procedure, there are a large number of non-zero, small weight values after regularization. Due to the non-convex property in ADMM regularization, our observation is that removing these weights can maintain the accuracy or even slightly improve the accuracy occasionally. As a result, we define two thresholds, a column-wise threshold and a filter-wise threshold, for each DNN layer. When the norm of a column (or filter) of weights is below the threshold, the column (or filter) will be removed. Also the corresponding channel in layer can be removed upon filter removal in layer . Structures in each DNN layer will be maintained after this purification step.
These two threshold values are layer-specific, depending on the relative weight values of each layer, and the sensitivity on overall accuracy. They are hyperparameters to be determined for each layer in the AutoSlim framework, for maximum weight/FLOPs reduction without accuracy loss.
3.5 The Overall AutoSlim Framework for Structured Weight Pruning and Purification
In this section, we discuss the AutoSlim framework based on the enhanced, guided heuristic search method, in which the automatic process determines per-layer weight pruning rates, structured pruning schemes (and combinations), as well as hyperparameters in the purification step (discussed in Section 3.4). The overall framework has two phases as shown in Figure 3: Phase I for structured weight pruning based on ADMM, and Phase II for the purification step. Each phase has multiple progressive rounds as discussed in Section 3.1, in which the weight pruning result from the previous round serves as the starting point of the subsequent round. We use Phase I as illustrative example, and Phase II uses the similar steps.
The AutoSlim framework supports flexible number of progressive rounds, as well as hard constraints on the weight or FLOPs reduction. In this way, it aims to achieve the maximum weight or FLOPs reduction while maintaining accuracy (or satisfying accuracy requirement). For each round
, we set the overall reduction in weight number/FLOPs to be a factor of 2 (with a small variance), based on the result from the previous round. In this way, we can achieve aroundweight/FLOPs reduction within 2 rounds, already outperforming the reported structured pruning results in prior work (9).
We leverage a classical heuristic search technique simulated annealing (SA), with enhancement on guided search based on prior experience. The enhanced SA technique is based on the observation that a DNN layer with more number of weights often has a higher degree of model compression with less impact on overall accuracy. The basic idea of SA is in the search for actions: When a perturbation on the candidate action results in better evaluation result (Step (ii) in Figure 2
), the perturbation will be accepted; otherwise the perturbation will be accepted with a probability depending on the degradation in evaluation result, as well as a temperature. The reason is to avoid being trapped in local minimum in the search process. The temperature will gradually decrease during the search process, in analogy to the physical “annealing" process.
Given the overall pruning rate (on weight No. or FLOPs) in the current round, we initialize a randomized action using the following process: i) order all layers based on the number of remaining weights, ii) assign a randomized pruning rate (and partition between filter and column pruning schemes) for each layer, satisfying that a layer with more weights will have no less pruning rate, and iii) normalize the pruning rates by . We also have a high initialized temperature . We define perturbation as the change of weight pruning rates (and portion of structured pruning schemes) in a subset of DNN layers. The perturbation will also satisfy the requirement that the layer will more remaining weights will have a higher pruning rate. The result evaluation is the fast evaluation introduced in Section 3.1. The acceptance/denial of action perturbation, the degradation in temperature , and the associated reduction in the degree of perturbation with follow the SA rules until convergence. The action outcome will become the decision of hyperparameter values (Step (iii), this is different from DRL which trains a neural network). The ADMM-based structured pruning will be adopted to generate pruning result (Step (iv)), possibly for the next round until final result.
4 Evaluation, Experimental Results, and Discussions
In this section, the effectiveness of AutoSlim is evaluated on VGG-16 and ResNet-18 on CIFAR-10 dataset, and VGG-16 and ResNet-18/50 on ImageNet dataset. We focus on the structured pruning on CONV layers, which are the most computationally intensive layers in DNNs and the major storage in state-of-art DNNs such as ResNet. We focus on two objective functions: reduction in the number of weight parameters and computation (FLOPs). The implementations are based on PyTorch(42)
. In the ADMM-based structured pruning algorithm, the number of epochs in each progressive round is 200, which is lower than the prior iterative pruning and retraining heuristic(7). We use an initial penalty parameter for ADMM and initial learning rate . The ADAM (29) optimizer is utilized. In the SA setup, we use cooling factor and Boltzmann’s constant . The initial probability of accepting high energy (bad) moves is set to be relatively high.
We aim at fair and comprehensive evaluation on the effectiveness of three sources of performance improvements discussed in Section 3.2. In order to illustrate the effect of individual improvement source, we design a number of combinations of techniques in the experiments. In the source of structured pruning scheme, we compare between filter pruning only vs. combined structured pruning (abbr. Fil vs. Comb). In the source of core algorithm, we compare between fixed regularization vs. ADMM (abbr. Fix vs. ADMM). In action sampling and decision making framework, we compare among DRL, manual optimization, and enhanced SA (abbr. DRL vs. Man vs. SA). For example, the configuration Comb-ADMM-SA reflects the full AutoSlim, while Fil-Fix-DRL reflects AMC (9).
Through extensive experiments, we conclude that AutoSlim is the key to achieve ultra-high pruning rates on the number of weights and FLOPs that cannot be achieved before, while DRL cannot compete with human experts to achieve high structured pruning rates.
4.1 Results and Discussions on CIFAR-10 Dataset
Table 1 illustrates the comparison results on weight reduction and FLOPs reduction on VGG-16 for CIFAR-10 dataset, while Table 2 shows the results on ResNet-18. The proposed AutoSlim framework (Comb-ADMM-SA) uses two objective functions: reducing the number of weight parameters or FLOPs. For VGG-16, compared to the prior work 2PFPCE (5) (Fil-Fix-Man) with 4 weight reduction, AutoSlim achieves 61.1 weight reduction under the same accuracy, a 15.3 improvement. For ResNet-18, compared to the prior work AMC (9) (Fil-Fix-DRL) with 1.7 weight reduction, AutoSlim achieves 61.2 reduction under the same accuracy, a significant improvement as high as 33. When accounting for the different number of parameters in ResNet-18 and ResNet-50 (AMC), the improvement can be even perceived as 120. In fact, it implies that the high redundancy of DNNs on CIFAR-10 dataset has not been exploited in prior work.
AutoSlim outperforms the prior work due to all three sources of improvement. We perform more comparisons to show the improvement due to enhanced SA compared with DRL and manual hyperparameter optimization. More specifically, we compare between AutoSlim (Comb-ADMM-SA) with configurations Comb-ADMM-Man and Comb-ADMM-DRL at (about) the same accuracy. As can be observed in the two tables, AutoSlim achieves a moderate improvement in pruning rate compared with manual hyperparameter optimization, but significantly outperforms DRL-based framework (all other sources of improvement are the same). This demonstrates the statement that DRL is not compatible with ultra-high pruning rates. For relatively small pruning rates, it appears that DRL can hardly outperform manual process as well, as the improvement over 2PFPCE (Fil-Fix-Man) is less compared with the improvement over AMC (Fil-Fix-DRL).
Next, we compare between two objectives: weight and FLOPs reductions. Figure 4 reveals the portion of pruned weights per layer on VGG-16 for CIFAR-10 by Params# and FLOPs# search objectives. One can only observe slight difference in the portion of pruned weights per layer. This is because further weight reduction in the first several layers results in significant accuracy degradation. The somewhat convergence in results using two objectives seems to be one characteristic under ultra-high pruning rates.
4.2 Results and Discussions on ImageNet Dataset
In this subsection, we show the application of AutoSlim on ImageNet dataset, and more comparison results with Fil-ADMM-SA (showing the first source of improvement). Table 4 and Table 4 show the comparison results on VGG-16 and ResNet-18 (ResNet-50) structured pruning on ImageNet dataset, respectively. We can clearly see the advantage of AutoSlim over prior work, such as (8) (Fil-Fix-Man), AMC (9) (Fil-Fix-DRL), and ThiNet (43) (Fil-Fix-Man). We can also see the advantage of AutoSlim over manual hyperparameter determination (Comb-ADMM-Man), improving from 2.7 to 3.3 structured pruning rates on ResNet-18 (ResNet-50) under the same (Top-5) accuracy. Finally, AutoSlim also outperforms filter pruning only (Fil-ADMM-SA), improvement from 3.8 to 6.4 structured pruning rates on VGG-16 under the same (Top-5) accuracy. This demonstrates the advantage of combined filter and column pruning compared with filter pruning only, when the other sources of improvement are the same. Besides, our Fil-ADMM-SA also outperforms prior work (Fil-Fix-Man) and (Fil-Fix-DRL), demonstrating the advantage of the proposed AutoSlim framework.
Last but not least, the proposed AutoSlim framework can also be applied to non-structured pruning. For non-structured pruning on ResNet-50 model for ImageNet dataset, AutoSlim results in 9.2 non-structured pruning rate on CONV layers without accuracy loss (92.7% Top-5 accuracy), which outperforms manual hyperparameter optimization with ADMM-based pruning (8 pruning rate) and prior work AMC (4.8 pruning rate).
This work proposes AutoSlim, an automatic structured pruning framework with the following key performance improvements: (i) effectively incorporate the combination of structured pruning schemes in the automatic process; (ii) adopt the state-of-art ADMM-based structured weight pruning as the core algorithm, and propose an innovative additional purification step for further weight reduction without accuracy loss; and (iii) develop effective heuristic search method enhanced by experience-based guided search, replacing the prior deep reinforcement learning technique which has underlying incompatibility with the target pruning problem. Extensive experiments on CIFAR-10 and ImageNet datasets demonstrate that AutoSlim is the key to achieve ultra-high pruning rates on the number of weights and FLOPs that cannot be achieved before.
- (1) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015.
- (2) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
- (3) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074–2082, 2016.
- (4) Jian-Hao Luo and Jianxin Wu. An entropy-based pruning method for cnn compression. arXiv preprint arXiv:1706.05791, 2017.
- (5) Chuhan Min, Aosen Wang, Yiran Chen, Wenyao Xu, and Xin Chen. 2pfpce: Two-phase filter pruning based on conditional entropy. arXiv preprint arXiv:1809.02220, 2018.
- (6) Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
- (7) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
- (8) Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
- (9) Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In European Conference on Computer Vision, pages 815–832. Springer, 2018.
- (10) Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 184–199, 2018.
- (11) Tianyun Zhang, Kaiqi Zhang, Shaokai Ye, Jiayu Li, Jian Tang, Wujie Wen, Xue Lin, Makan Fardad, and Yanzhi Wang. Adam-admm: A unified, systematic framework of structured weight pruning for dnns. arXiv preprint arXiv:1807.11091, 2018.
- (12) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
Cong Leng, Zesheng Dou, Hao Li, Shenghuo Zhu, and Rong Jin.
Extremely low bit neural network: Squeeze the last bit out with admm.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- (14) Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pages 4857–4867, 2017.
- (15) Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5687–5695, 2017.
- (16) Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
- (17) Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochastic alternating direction method of multipliers. In International Conference on Machine Learning, pages 80–88, 2013.
- (18) Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In International Conference on Machine Learning, pages 392–400, 2013.
- (19) Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
- (20) Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
- (21) Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016.
- (22) Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu,
Jie Tan, Quoc V Le, and Alexey Kurakin.
Large-scale evolution of image classifiers.In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2902–2911. JMLR. org, 2017.
- (24) Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017.
- (25) Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
- (26) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
- (27) Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016.
- (28) Sijia Liu, Jie Chen, Pin-Yu Chen, and Alfred O Hero. Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications. arXiv preprint arXiv:1710.07804, 2017.
- (29) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- (30) https://www.tensorflow.org/mobile/tflite/.
- (31) Tom Mitchell, Bruce Buchanan, Gerald DeJong, Thomas Dietterich, Paul Rosenbloom, and Alex Waibel. Machine learning. Annual review of computer science, 4(1):417–433, 1990.
- (32) Gilad Katz, Eui Chul Richard Shin, and Dawn Song. Explorekit: Automatic feature generation and selection. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 979–984. IEEE, 2016.
- (33) Shimon Whiteson, Brian Tanner, Matthew E Taylor, and Peter Stone. Protecting against evaluation overfitting in empirical reinforcement learning. In 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 120–127. IEEE, 2011.
- (34) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
- (35) Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- (36) Stefan Falkner, Aaron Klein, and Frank Hutter. Bohb: Robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774, 2018.
- (37) Arthur Guez, David Silver, and Peter Dayan. Efficient bayes-adaptive reinforcement learning using sample-based search. In Advances in neural information processing systems, pages 1025–1033, 2012.
- (38) Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026–4034, 2016.
- (39) Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13(1):103–130, 1993.
- (40) David Silver, Richard S Sutton, and Martin Müller. Sample-based learning and search with permanent and transient memories. In Proceedings of the 25th international conference on Machine learning, pages 968–975. ACM, 2008.
- (41) Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Kaidi Xu, Yunfei Yang, Fuxun Yu, Jian Tang, Makan Fardad, Sijia Liu, et al. Progressive weight pruning of deep neural networks using admm. arXiv preprint arXiv:1810.07378, 2018.
- (42) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
- (43) Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017.