1 Introduction
Deep learning has prevailed in many realworld applications like autonomous driving, robotics, and mobile VR/AR, while efficiency is the key to bridge research and deployment. Given a constrained resource budget on the target hardware (e.g., latency, model size, and energy consumption), it requires an careful design of network architecture to achieve the optimal performance within the constraint. Traditionally, the deployment of efficient deep learning can be split into model architecture design and model compression (pruning and quantization). Some existing works [10, 9] have shown that such a sequential pipeline can significantly reduce the cost of existing models. Nevertheless, careful hyperparameter tuning is required to obtain optimal performance [12]. The number of hyperparameters grows exponentially when we consider the three stages in the pipeline together, which will soon exceed acceptable human labor bandwidth.
To tackle the problem, recent works have applied AutoML techniques to automate the process. Researchers proposed Neural Architecture Search (NAS) [44, 45, 18, 19, 2, 4, 3, 9]
to automate the model design, outperforming the humandesigned models by a large margin. Using a similar technique, researchers adopt reinforcement learning to compress the model through automated pruning
[12] and automated quantization [36]. However, optimizing these three factors in separate stages will lead to suboptimal results: e.g., the best network architecture for the fullprecision model is not necessarily the optimal one after pruning and quantization. Besides, this threestep strategy also requires considerable search time and energy consumption [32]. Therefore, we need a solution to jointly optimize the deep learning model for a certain hardware platform.ProxylessNAS  ChamNet  SPOS  AMC  HAQ  APQ  

Hardwareaware  ✓  ✓  ✓  ✓  ✓  ✓ 
No training during search  ✓  ✓  ✓  
No evaluation during search  ✓  ✓  
Channel pruning  ✓  ✓  
Mixedprecision quantization  ✓  ✓  ✓ 
Directly extending existing AutoML techniques to the joint model optimization setting can be problematic. Firstly, the joint search space is much larger (multiplicative) compared to the stagewise search, making the search difficult. Pruning and quantization usually requires timeconsuming finetuning process to restore accuracy [36, 39], which dramatically increases the search cost. As shown in Fig. 2, searching for each deployment (ProxylessNAS+AMC+HAQ) will lead to a considerable CO emission, which can exacerbate the greenhouse effect and seriously deteriorate the environment. Moreover, each step has its own optimization objective (e.g., accuracy, latency, energy); the final policy of the pipeline always turns out to be suboptimal.
To this end, we propose APQ, a joint design method to enable endtoend search of model Architecture, Pruning, and Quantization policy with light cost. The core idea of APQ is to use a quantizationaware accuracy predictor
to accelerate the search process. The predictor takes the model architecture and the quantization scheme as input, and can quickly predicts its accuracy. Instead of finetuning the pruned and quantized network to get the accuracy, we use the estimated accuracy generated by the predictor, which can be obtained with negligible cost (since the predictor requires only a few FC layers).
However, training an accurate predictor is challenging: it requires a lot of (quantized model, quantized accuracy) data points to train an accurate predictor. Collecting each of the data points could be quite expensive: 1. we need to train the network to get the initial fp32 weights, 2. and further finetuning to get the quantized int8 weights to evaluate the accuracy. Both stages are quite expensive, requiring hundreds of GPU hours.
Luckily, inspired by the weight sharing mechanism in recent oneshot neural architecture search methods [8, 3], we reduce the cost of stage 1 by training a super network that contains all the subnetworks in the search space through weight sharing, and directly evaluate the subnetwork accuracy without further finetuning. As shown in [3], it is possible to train a “onceforall” super network that can support all the subnetworks while achieving onpar or even higher accuracy compared to training from scratch. In this way, we only need to evaluate the subnetwork instead of training to get (fp32 model, fp32 accuracy) data points, which requires orders of magnitude smaller computation.
Reducing the cost of stage 2 is more challenging. Typically, direct lowbit quantization without finetuning usually leads to nearzero accuracy. Therefore, finetuning is still needed to collect (quantized model, quantized accuracy) data points. To reduce the cost of stage 2, we propose predictortransfer technique. Instead of collecting a lot of expensive (quantized model, quantized accuracy) data points to directly train the quantizationaware predictor, we first train a fp32 model accuracy predictor using the cheap (fp32 model, fp32 accuracy) data points collected with the weightsharing onceforall network (evaluation only, no training required), and then transfer the predictor to the quantized model domain by finetuning it on a small number of expensive (quantized model, quantized accuracy) data points. The transfer technique dramatically improves the sample efficiency on the quantized network domain and reduces the overall cost to train the predictor.
After training this quantizationaware predictor , the architecture search becomes ultrafast by using the predictor. With the above design, we are able to efficiently perform a joint search over model architecture, channel number, and mixedprecision quantization. The predictor can also be used for new hardware and deployment scenarios.
Extensive experiments show the superiority of APQ. APQ achieves 8 BitOps reduction than an 8bit ResNet while having higher accuracy; APQ can not only optimize latency and accuracy, but also energy. We obtain the same accuracy as MobileNetV2+HAQ, and achieve 2/1.3 latency/energy saving; APQ outperforms separate sequential optimizations using ProxylessNAS+AMC+HAQ by 2.3% accuracy under same latency constraints, while reducing 600 GPU hours and CO emission, which efficiently search an efficient model, pushing the frontier for green AI that is environmentalfriendly.
The contributions of this paper are:

We propose a methodology to jointly perform NASpruningquantization, unifying the conventionally separated stages into an integrated solution.

We propose a predictortransfer method to tackle the high cost of the quantizationaware accuracy predictor’s dataset collection NN architecture, quantization policy, quantized accuracy.

We achieve significant speedup to search optimal network architecture with quantization policy via this joint optimization, and enable automatic model adjustment in diverse deployment scenarios.
2 Background and Outline
Researchers have proposed various methods to accelerate the model inference, including architecture design [14, 30], network pruning [11, 22] and network quantization [10].
Neural Architecture Search.
Tracing back to the development of NAS, one can see the reduction in the search time. Former NAS [45, 29] use an RL agent to determine the cellwise architecture. To efficiently search for the architecture, many later works viewed architecture searching as a pathfinding problem [20, 5], it cuts down the search time by jointly training rather than iteratively training from scratch. Inspired by the path structure, some oneshot methods [8] have been proposed to further leverage the network’s weights in training time and begin to handle the mixedprecision case for efficient deployment. Another line of works tries to grasp the information by a performance predictor [23, 7], which reduces the frequent evaluation for the target dataset when searching for optimal.
Pruning.
Extensive works show the progress achieved in pruning: in the early time, researchers proposed finegrained pruning [11, 10] by cutting off the connections (i.e., elements) within the weight matrix. However, such kind of method is not friendly to the CPU and GPU, and requires dedicated hardware[26, 40] to support sparse matrix multiplication, which is highly demanding to design [35, 34, 24]. Later, some researchers proposed channellevel pruning [13, 22, 17, 25, 1, 15, 27] by pruning the entire convolution channel based on some importance score (e.g., L1norm) to enable acceleration on generalpurpose hardware. However, both finegrained pruning and channellevel pruning introduces an enormous search space as different layer has different sensitivities (e.g., the first convolution layer is very sensitive to be pruned as it extracts important lowlevel features; while the last layer can be easily pruned as it’s very redundant). To this end, recent researches leverage the AutoML techniques [12, 39] to automate this exploration process and surpass the human design.
Quantization.
Quantization is a necessary technique to deploy the models on hardware platforms like FPGAs and mobile phones. [10]
quantized the network weights to reduce the model size by grouping the weights using kmeans.
[6]binarized the network weights into ; [42] quantized the network using one bit for weights and two bits for activation; [28] binarized each convolution filter into ; [43] mapped the network weights into using two bits with a trainable range; [41] explicitly regularized the loss perturbation and weight approximation error in a incremental way to quantize the network using binary or ternary weights. [16] used 8bit integers for both weights and activation for deployment on mobile devices. Some existing works explored the relationship between quantization and network architecture. HAQ [36] proposed to leverage AutoML to determine the bitwidth for a mixedprecision quantized model. A better tradeoff can be achieved when different layers are quantized with different bits, showing the strong correlation between network architecture and quantization.MultiStage Optimization.
Above methods are orthogonal to each other and a straightforward combination approach is to apply them sequentially in multiple stages i.e. NAS+Pruning+Quantization:

In the second stage, we can prune the channels in the model automatically [12]:
(2) where outputs a pair denoting the model architecture and finetuned weight after applying certain pruning policy .

In the third stage, we can quantize the model to mixedprecision [36]:
(3) where outputs a pair denoting the model architecture and finetuned weight after applying certain quantization policy .
However, this separation usually leads to a suboptimal solution: e.g., the best neural architecture for the floatingpoint model may not be optimal for the quantized model. Moreover, frequent evaluations on the target dataset make such kind of methods timecostly: e.g., a typical pipeline as above can take about 300 GPU hours, making it hard for researchers with limited computation resources to do automatic design.
Joint Optimization.
Instead of optimizing NAS, pruning and quantization independently, joint optimization aims to find a balance among these configurations and search for the optimal strategy. To this end, the joint optimization objective can be formalized into:
(4) 
However, the search space of this new objective is tripled as original one, so it becomes challenging to perform joint optimization. We endeavor to unify NAS, pruning and quantization as joint optimization. The outline is: 1. Train a onceforall network that covers a large search space and every subnetwork can be directly extracted without retraining. 2. Build a quantizationaware accuracy predictor to predict quantized accuracy given a subnetwork and quantization policy. 3. Construct a latency/energy lookup table and do resource constrained evolution search. Thereby, this optimization problem can be tackled jointly.
3 Joint Design Methodoloy
The overall framework of our joint design is shown in Figure 3. It consists of a highly flexible onceforall network with finegrained channels, an accuracy predictor, and evolution search to jointly optimize the architecture, pruning, and quantization.
3.1 OnceForAll Network with Finegrained Channel Pruning
Neural architecture search aims to find a good subnetwork from a large search space. Traditionally, each sampled network is trained to obtain the actual accuracy [44], which is timeconsuming. Recent oneshot based NAS [8] first trains a large, multibranch network. At each time, a subnetwork is extracted from the large network to directly evaluate the approximated accuracy. Such a large network is called onceforall network. Since the choice of different layers in a deep neural network is largely independent, a popular way is to design multiple choices (e.g., kernel size, expansion ratios) for each layer.
In this paper, we used MobileNetV2 as the backbone to build a onceforall network that supports different kernel sizes (i.e. 3, 5, 7) and channel number (i.e. 4 to 6, 8 as interval, is the base channel number in that block) in block level, and different depths (i.e. 2, 3, 4) in stage level. The combined search space contains more than subnetworks, which is large enough to perform the search on the top of it.
Properties of the OnceForAll Network.
To ensure efficient architecture search, we find that the onceforall network needs to satisfy the following properties: (1) For every extracted subnetwork, the performance could be directly evaluated without retraining, so that the cost of training only needs to be paid once. (2) Support an extremely large and finegrained search space to support channel number search. As we hope to incorporate pruning policy into architecture space, the onceforall network not only needs to support different operators, but also finegrained channel numbers (8 as interval). Thereby, the new space is significantly enlarged (nearly quadratic from to ).
However, it is hard to achieve the two goals at the same time due to the nature of onceforall network training: it is generally believed that if the search space gets too large (e.g., supporting finegrained channel numbers), the accuracy approximation would be inaccurate [21]
. A large search space will result in high variance when training the onceforall network. To address the issue, we adopt progressive shrinking (PS) algorithm
[3] to train the onceforall network. Specifically, we first train a full subnetwork with the largest kernel sizes, channel numbers and depths in the onceforall network, and use it as a teacher to progressively distill the smaller subnetworks sampled from the onceforall network. During distillation, the trained subnetworks still update the weights to prevent accuracy loss. The PS algorithm effectively reduces the variance during onceforall network training. By doing so, we can assure that the extracted subnetwork from the onceforall network preserves competitive accuracy without retraining.3.2 QuantizationAware Accuracy Predictor
To reduce the cost for designs in various deployment scenarios, we propose to build a quantizationaware accuracy predictor , which predicts the accuracy of the mixedprecision (MP) model based on architecture configurations and quantization policies. During search, we used the predicted accuracy instead of the measured accuracy. The input to the predictor is the encoding of the network architecture, the pruning strategy, and the quantization policy.
Architecture and Quantization Policy Encoding.
We encode the network architecture block by block: for each building block (i.e. bottleneck residual block like MobileNetV2 [30]
), we encode the kernel size, channel numbers, weight/activation bits for pointwise and depthwise convolutions into onehot vectors, and concatenate these vectors together as the encoding of the block. For example, a block has 3 choices of kernel sizes (
e.g. 3,5,7) and 4 choices of channel numbers (e.g. 16,24,32,40), if we choose kernel size=3 and channel numbers=32, then we get two vectors [1,0,0] and [0,0,1,0], and we concatenate them together and use [1,0,0,0,0,1,0] to represent this block’s architecture. Likewise, we also use onehot vectors to denote the choice of bitwidth for certain weights/activation of pointwise and depthwise layers, e.g. suppose weight/activation bitwidth choices for pointwise/depthwise layer are 4 or 8, we use [1,0,0,1,0,1,1,0] to denote the choice (4,8,8,4) for quantization policy. If this block is skipped, we set all values of the vector to 0. We further concatenate the features of all blocks as the encoding of the whole network. Then for a 5layer network, we can use a 75dim (5(3+4+24)=75) vector to represent such an encoding. In our setting, the choices of kernel sizes are [3,5,7], the choices of channel number depend on the base channel number for each block, and bitwidth choices are [4,6,8], there are 21 blocks in total to design.Accuracy Predictor.
The predictor we use is a 3layer feedforward neural network with each embedding dim equaling to 400. As shown in the left of Figure
4, the input of the predictor is the onehot encoding described above and the output is the predicted accuracy. Different from existing methods
[20, 5, 37], our predictor based method does not require frequent evaluation of architecture on the target dataset in the search phase. Once we have the predictor, we can integrate it with any search method (e.g. reinforcement learning, evolution, bayesian optimization, etc.) to perform joint design over architecturepruningquantization at a negligible cost. However, the biggest challenge is how to collect the architecture, quantization policy, accuracy dataset to train the predictor for quantized models, which is due to: 1) collecting quantized model’s accuracy is timeconsuming: finetuning is required to recover the accuracy after quantization, which takes about 0.2 GPU hours per data point. In fact, we find that for training a good full precision accuracy predictor, 80k NN architecture, ImageNet accuracy data pairs would be enough. However, if we collect a quantized dataset with the same size as the full precision one, it can cost 16,000 GPU hours, which is far beyond affordable. 2) The quantizationaware accuracy predictor is harder to train than the traditional accuracy predictor on fullprecision models: the architecture design and quantization policy affect network performance from two separate aspects, making it hard to model the mutual influence. Thus using the traditional way to train quantizationaware accuracy predictor can result in a significant performance drop (Table 2).Transfer Predictor to Quantized Models.
Collecting a quantized NN dataset for training the predictor is difficult (needs finetuning), but collecting a fullprecision NN dataset is easy: we can directly pick subnetworks from the onceforall network and measure its accuracy. We propose the predictortransfer technique to increase the sample efficiency and make up for the lack of data. As the order of accuracy before and after quantization is usually preserved, we first pretrain the predictor on a largescale dataset to predict the accuracy of fullprecision models, then transfer to quantized models. The quantized accuracy dataset is much smaller and we only perform shortterm finetuning. As shown in Figure 4, we add the quantization bits (weights& activation) of the current block into the input embedding to build the quantizationaware accuracy predictor. We then further finetune the quantizationaware accuracy predictor using pretrained FP predictor’s weights as initialization. Since most of the weights are inherited from the fullprecision predictor, the training requires much fewer data compared to training from scratch.
3.3 HardwareAware Evolutionary Search
As different hardware might have drastically different properties (e.g., cache size, level of parallelism), the optimal network architecture and quantization policy for one hardware are not necessarily the best for the other. Therefore, instead of relying on some indirect signals (e.g., BitOps), our optimization is directly based on the measured latency and energy on the target hardware.
Measuring Latency and Energy.
Evaluating each candidate policy on actual hardware can be very costly. Thanks to the sequential structure of the neural network, we can approximate the latency (or energy) of the model by summing up the latency (or energy) of each layer. We can first build a lookup table containing the latency and energy of each layer under different architecture configurations and bitwidths. Afterward, for any candidate policy, we can break it down and query the lookup table to directly calculate the latency (or energy) at negligible cost. In practice, we find that such practice can precisely approximate the actual inference cost.
ResourceConstrained Evolution Search.
We adopt the evolutionbased architecture search [8] to explore the best resourceconstrained model. Based on this, we further replace the evaluation process with our quantizationaware accuracy predictor to estimate the performance of each candidate directly. The cost for each candidate can then be reduced from times of model inference to only one time of predictor inference (where is the size of the validation set). Furthermore, we can verify the resource constraints by our latency/energy lookup table to avoid the direct interaction with the target hardware. Given a resource budget, we directly eliminate the candidates that exceed the constraints.
4 Implementation Details
Data Preparation for Quantizationaware Accuracy Predictor.
We generate two kinds of data (2,500 for each): 1. random sample both architecture and quantization policy; 2. random sample architecture, and sample 10 quantization policies for each architecture configuration. We mix the data for training the quantizationaware accuracy predictor, and use fullprecision pretrained predictor’s weights to transfer. The number of data to train a full precision predictor is 80,000. As such, our quantization accuracy predictor can have the ability to generalize among different architecture/quantization policy pairs and learn the mutual relation between architecture and quantization policy.
Evolutionary Architecture Search.
For evolutionary architecture search, we set the population size to be 100, and choose Top25 candidates to produce the next generation (50 by mutation, 50 by crossover). Each population is a network architecture with a quantization policy, using the same encoding as a quantizationaware accuracy predictor. The mutation rate is 0.1 for each layer, which is the same as that in [8], and we randomly choose the new kernel size and channel number for mutation. For a crossover, each layer is randomly chosen from the layer configuration of its parents. We set max iterations to 500, and choose the best candidate among the final population.
Model  ImageNet  Latency  Energy  BitOps  Design cost  COe  Cloud compute cost 

Top1 (%)  (ms)  (mJ)  (G)  (GPU hours)  (marginal)  (marginal)  
MobileNetV2  8bit  71.8  9.10  12.46  19.2       
ProxylessNAS  8bit  74.2  13.14  14.12  19.5  200N  56.72  $148 – $496 
ProxylessNAS + AMC  8bit  73.3  9.77  10.53  15.0  204N  57.85  $151 – $506 
MobileNetV2 + HAQ  71.9  8.93  11.82    96N  27.23  $71 – $238 
ProxylessNAS + AMC + HAQ  71.8  8.45  8.84    300N  85.08  $222 – $744 
DNAS [38]  74.0      57.3  40N  11.34  $30 – $99 
Single Path OneShot [8]  74.6      51.9  288 + 24N  6.81  $18 – $60 
OursA (w/o transfer)  72.1  8.85  11.79  13.2  2400 + 0.5N  0.14  $0.4 – $1.2 
OursB (w/ transfer)  74.1  8.40  12.18  16.5  2400 + 0.5N  0.14  $0.4 – $1.2 
OursC (w/ transfer)  75.1  12.17  14.14  23.6  2400 + 0.5N  0.14  $0.4 – $1.2 
Quantization.
We follow the implementation in [36] to do quantization. Specifically, we quantize the weights and activations with the specific quantization policies. For each layer with weights with quantization bit , we linearly quantize it to , the quantized weight is:
(5) 
We set choose different for each layer that minimize the KLdivergence between origin weights and quantized weights . For activation weights, we quantize it to since the value is nonnegative after ReLU6 layer.
5 Experiments
To verify the effectiveness of our methods, we conduct experiments that cover two of the most important constraints for ondevice deployment: latency and energy consumption in comparison with some stateoftheart models using neural architecture search. Besides, we compare BitOps with some multistage optimized models.
Dataset, Models and Hardware Platform.
The experiments are conducted on ImageNet dataset. We compare the performance of our joint designed models with mixedprecision models searched by [36, 12, 5] and some SOTA fixed precision 8bit models. The platform we used to measure the resource consumption for the mixedprecision model is BitFusion [31], which is a stateoftheart spatial ASIC design for neural network accelerator. It employs a 2D systolic array of Fusion Units which spatially sum the shifted partial products of twobit elements from weights and activations.
5.1 Comparison with SOTA Efficient Models
Table 2 presents the results for different efficiency constraints. As one can see, our model can consistently outperform stateoftheart models with either fixed or mixedprecision. Specifically, our small model (OursB) can have 2.2% accuracy boost than mixedprecision MobileNetV2 search by HAQ (from 71.9% to 74.1%); our large model (OursC) attains better accuracy (from 74.6% to 75.1%) while only requiring half of BitOps. When applied with transfer technology, it does help for the model to get better performance (from 72.1% to 74.1%). It is also notable that the marginal cost for cloud computer and CO emission is two orders of magnitudes smaller than other works.
5.2 Effectiveness of Joint Design
Comparison with MobileNetV2+HAQ.
Figure 5 show the results on the BitFusion platform under different latency constraints and energy constraints. Our jointly designed models consistently outperform both mixedprecision and fixed precision SOTA models under certain constraints. It is notable when the constraint is tight, our models have significant improvement compared with stateoftheart mixedprecision models. Specifically, with similar efficiency constraints, we improve the ImageNet top1 accuracy from the MobileNetV2 baseline 61.4% to 71.9% (+10.5%) and 72.7% (+11.3%) for latency and energy constraints, respectively. Moreover, we show some models searched by our quantizationaware predictor without predictortransfer technique. With this technique applied, the accuracy can consistently have an improvement, since the nontransferred predictor might lose some mutual information between architecture and quantization policy.
Comparison with MultiStage Optimized Model.
Figure 6 compares the multistage optimization with our joint optimization results. As one can see, under the same latency/energy constraint, our model can attain better accuracy than the multistage optimized model (74.1% vs 71.8%). This is reasonable since the perstage optimization might not find the global optimal model as the joint design does.
Comparison under Limited BitOps.
Figure 6 reports the results with limited BitOps budget. As one can see, under a tight BitOps constraint, our model improves over 2% accuracy (from 71.5% to 73.9%) compared with the searched model using [8]. Moreover, our models achieve even higher accuracy (75.1%) as ResNet34 8bit model (75.0%) while saving 8 BitOps.
5.3 Effectiveness of PredictorTransfer
Figure 7 shows the performance of our predictortransfer technique compared with training from scratch. For each setting, we train the predictor to convergence and evaluate the pairwise accuracy (i.e. the proportion that predictor correctly identifies which is better between two randomly selected candidates from a heldout dataset), which is a measurement for the predictor’s performance. We use the same test set with 2000 NN architecture, ImageNet accuracy pairs that are generated by randomly choosing network architecture and quantization policy. Typically, for training with data points, the number of two kinds of data as mentioned in Sec. 4 is equal, i.e., . As shown, the transferred predictor has a higher and faster pairwise accuracy convergence. Also, when the data is very limited, our method can achieve more than 10% pairwise accuracy over scratch training.
6 Conclusion
We propose APQ, a joint design method for architecting mixedprecision model. Unlike former works that decouple into separated stages, we directly search for the optimal mixedprecision architecture without multistage optimization. We use a predictorbase method that can have no extra evaluation for the target dataset, which greatly saves GPU hours for searching under an upcoming scenario, thus reducing marginally CO emission and cloud compute cost. To tackle the problem for the high expense of data collection, we propose a predictortransfer technique to make up for the limitation of data. Comparisons with stateoftheart models show the necessity of joint optimization and the prosperity of our joint design method.
Acknowledgments
We thank NSF Career Award #1943349, MITIBM Watson AI Lab, Samsung, SONY, SRC, AWS Machine Learning Research Award for supporting this research. We thank Hanrui Wang and Yujun Lin for their kindly help to this paper.
References

[1]
(2016)
Compact deep convolutional neural networks with coarse pruning
. External Links: 1610.09639 Cited by: §2.  [2] (2018) Efficient architecture search by network transformation. In AAAI, Cited by: §1.
 [3] (2020) Once for all: train one network and specialize it for efficient deployment. In ICLR, External Links: Link Cited by: APQ: Joint Search for Network Architecture, Pruning and Quantization Policy, Table 1, §1, §1, §3.1.
 [4] (2018) Pathlevel network transformation for efficient architecture search. In ICML, Cited by: §1.
 [5] (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In ICLR, Cited by: APQ: Joint Search for Network Architecture, Pruning and Quantization Policy, Table 1, 1st item, §2, §3.2, Figure 6, §5.
 [6] (2016) Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. arXiv. Cited by: §2.
 [7] (2019) ChamNet: towards efficient network design through platformaware model adaptation. CVPR. Cited by: Table 1, §2.
 [8] (2019) Single path oneshot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: Table 1, §1, §2, §3.1, §3.3, §4, Table 2, §5.2.
 [9] (2019) Design automation for efficient deep learning computing. arXiv preprint arXiv:1904.10616. Cited by: §1, §1.
 [10] (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, Cited by: §1, §2, §2, §2.
 [11] (2015) Learning both weights and connections for efficient neural network. In NeurIPS, Cited by: §2, §2.
 [12] (2018) AMC: automl for model compression and acceleration on mobile devices. In ECCV, Cited by: APQ: Joint Search for Network Architecture, Pruning and Quantization Policy, Table 1, §1, §1, 2nd item, §2, Figure 6, §5.
 [13] (2017) Channel pruning for accelerating very deep neural networks. ICCV. Cited by: §2.
 [14] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.

[15]
(2016)
Network trimming: a datadriven neuron pruning approach towards efficient deep architectures
. External Links: 1607.03250 Cited by: §2.  [16] (2018) Quantization and Training of Neural Networks for Efficient IntegerArithmeticOnly Inference.. In CVPR, Cited by: §2.
 [17] (2017) Runtime Neural Pruning. In NIPS, Cited by: §2.
 [18] (2018) Progressive neural architecture search. In ECCV, Cited by: §1.
 [19] (2018) Hierarchical representations for efficient architecture search. In ICLR, Cited by: §1.
 [20] (2019) DARTS: differentiable architecture search. In ICLR, Cited by: §2, §3.2.
 [21] (2019) MetaPruning: meta learning for automatic neural network channel pruning. In ICCV, Cited by: §3.1.
 [22] (2017) Learning efficient convolutional networks through network slimming. In ICCV, Cited by: §2, §2.
 [23] (2018) Neural architecture optimization. In NeurIPS, Cited by: §2.
 [24] (2019) Park: an open platform for learningaugmented computer systems. In Advances in Neural Information Processing Systems, pp. 2490–2502. Cited by: §2.
 [25] (2016) Pruning convolutional neural networks for resource efficient inference. External Links: 1611.06440 Cited by: §2.
 [26] (2018) Outerspace: an outer product based sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 724–736. Cited by: §2.
 [27] (2015) Channellevel acceleration of deep face representations. IEEE Access. Cited by: §2.
 [28] (2016) XNORNet  ImageNet Classification Using Binary Convolutional Neural Networks.. In ECCV, Cited by: §2.

[29]
(2019)
Regularized evolution for image classifier architecture search
. In AAAI, Cited by: §2.  [30] (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, Cited by: APQ: Joint Search for Network Architecture, Pruning and Quantization Policy, §2, §3.2.
 [31] (201806) Bit fusion: bitlevel dynamically composable architecture for accelerating deep neural network. ISCA. Cited by: §5.
 [32] (2019) Energy and policy considerations for deep learning in nlp. ACL. Cited by: §1, Table 2.
 [33] (2019) MnasNet: platformaware neural architecture search for mobile. In CVPR, Cited by: 1st item.
 [34] (2020) TTS: transferable transistor sizing with graph neural networks and reinforcement learning. In ACM/IEEE 57th Design Automation Conference (DAC), Cited by: §2.
 [35] (2018) Learning to design circuits. In NeurIPS 2018 Machine Learning for Systems Workshop, Cited by: §2.
 [36] (2019) HAQ: hardwareaware automated quantization. In CVPR, Cited by: APQ: Joint Search for Network Architecture, Pruning and Quantization Policy, Table 1, §1, §1, 3rd item, §2, Figure 5, §4, Figure 6, §5.
 [37] (2019) FBNet: hardwareaware efficient convnet design via differentiable neural architecture search. In CVPR, Cited by: 1st item, §3.2.
 [38] (2018) Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv 1812. Cited by: Table 2.
 [39] (2018) NetAdapt: platformaware neural network adaptation for mobile applications. Lecture Notes in Computer Science. Cited by: §1, §2.
 [40] (2020) SpArch: efficient architecture for sparse matrix multiplication. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), Cited by: §2.

[41]
(2018)
Explicit losserroraware quantization for lowbit deep neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 9426–9435. Cited by: §2.  [42] (2016) DoReFanet: training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160. External Links: Link Cited by: §2.
 [43] (2017) Trained ternary quantization. In ICLR, Cited by: §2.
 [44] (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §1, §3.1.
 [45] (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: §1, §2.