1 Introduction
In many realtime machine learning applications (such as robotics, autonomous driving, and mobile VR/AR), deep neural networks is strictly constrained by the latency, energy, and model size. In order to improve the hardware efficiency, many researchers have proposed to quantize the weights and activations to low precision
[8, 18, 33].Conventional quantization methods use the same number of bits for all layers [2, 14], but as different layers have different redundancy and behave differently on the hardware (computation bounded or memory bounded), it is necessary to use flexible bitwidths for different layers (as shown in Figure 1). This flexibility was originally not supported by chip vendors until recently the hardware manufacturers started to implement this feature: Apple released the A12 Bionic chip that supports flexible bits for the neural network inference [6]; NVIDIA recently introduced the Turing GPU architecture that supports 1bit, 4bit, 8bit and 16bit arithmetic operations [21]; Imagination launched a flexible neural network IP that supports perlayer bitwidth adjustment for both weights and activations [13]. Besides industry, recently academia also works on the bitlevel flexible hardware design: BISMO [26] proposed the bitserial multiplier to support multiplications of 1 to 8 bits; BitFusion [25] supports multiplications of 2, 4, 8 and 16 bits in a spatial manner.
Inference latency on  

HW1  HW2  HW3  
Best Q. policy for HW1  16.29 ms  85.24 ms  117.44 ms 
Best Q. policy for HW2  19.95 ms  64.29 ms  108.64 ms 
Best Q. policy for HW3  19.94 ms  66.15 ms  99.68 ms 
However, a missing part is how to determine the bitwidth of both weights and activations for each layer on different hardware accelerators. This is a vast design space: with different neural network models, each with layers, on different hardware platforms, there are in total ^{*}^{*}*Here, we assume that the bitwidth is between 1 to 8 for both weights and activations. possible solutions. For a widely used ResNet50 [9] model, the size of the search space is about
, which is even larger than the number of particles in the universe. Conventional methods require domain experts (with knowledge of both machine learning and hardware architecture) to explore the huge design space smartly with rulebased heuristics. For instance, we should retain more bits in the first layer which extracts low level features and in the last layer which computes the final outputs; also, we should use more bits in the convolution layers than in the fullyconnected layers because empirically, the convolution layers are more sensitive. As the neural network becomes deeper, the search space increases exponentially, which makes it infeasible to rely on handcrafted strategies. Therefore, these
rulebased quantization policies are usually suboptimal, and they cannot generalize from one model to another. In this paper, we would like to automate this exploration process by a learningbased framework.Another great challenge is how to measure the latency and the energy consumption of a given model on the hardware. A widely adopted approach is to rely on some proxy signals (e.g., FLOPs, number of memory references) [12, 24]. However, as different hardware behaves very differently, the performance of a model on the hardware cannot always be accurately reflected by these proxy signals. Therefore, it is important to directly involve the hardware architecture into the loop. Also, as demonstrated in Table 1, the quantization solution optimized on one hardware might not be optimal on the other, which raises the demand for specialized policies for different hardware architectures.
We propose the HardwareAware Automated Quantization (HAQ) framework that leverages reinforcement learning to automatically predict the quantization policy given the hardware’s feedback. The RL agent decides the bitwidth of a given neural network in a layerwise manner. For each layer, the agent receives the layer configuration and statistics as observation, and it then outputs the action which is the bitwidth of weights and activations. We then leverage the hardware accelerator as the environment to obtain the direct feedback from hardware
to guide the RL agent to satisfy the resource constraints. After all layers are quantized, we finetune the quantized model for one more epoch, and feed the validation accuracy after shortterm retraining as the reward signal to our RL agent. During the exploration, we leverage the deep deterministic policy gradient (DDPG)
[17] to supervise our RL agent. We studied the quantization policy on multiple hardware architectures: both cloud and edge NN accelerators, with spatial or temporal multiprecision design.The contribution of this paper has four aspects:

Automation: We propose an automated framework for quantization, which does not require domain experts and rulebased heuristics. It frees the human labor from exploring the vast search space of choosing bitwidths.

HardwareAware: Our framework involves the hardware architecture into the loop so that it can directly reduce the latency, energy and storage on the target hardware.

Specialization: For different hardware architectures, our framework can offer a specialized quantization policy that’s exactly tailored for the hardware architecture.

Design Insights: We interpreted the different quantization polices learned for different hardware architectures. Taking both computation and memory access into account, the interpretation offers insights on both neural network architecture design and hardware architecture design.
2 Related Work
Quantization.
There have been extensive explorations on compressing and accelerating deep neural networks using quantization. Han et al. [8] quantized the network weights to reduce the model size by rulebased strategies: e.g., they used human heuristics to determine the bitwidths for convolution and fullyconnected layers. Courbariaux et al. [4]binarized the network weights into ; Rastegari et al. [23] binarized each convolution filter into ; Zhu et al. [33] mapped the network weights into using two bits; Zhou et al. [32] used one bit for network weights and two bits for activations; Jacob et al. [14] made use of 8bit integers for both weights and activations. We refer the reader to the survey paper by Krishnamoorthi et al. [16] for a more detailed overview. These conventional quantization methods either simply assign the same number of bits to all layers or require domain experts to determine the bitwidths for different layers, while our framework automates this design process, and our learningbased policy outperforms rulebased strategies.
AutoML.
Many researchers aimed to improve the performance of deep neural networks by searching the network architectures: Zoph et al. [34]
proposed the Neural Architecture Search (NAS) to explore and design the transformable network building blocks, and their network architecture outperforms several human designed networks; Liu
et al. [19] introduced the Progressive NAS to accelerate the architecture search by 5 using sequential modelbased optimization; Pham et al. [22] introduced the Efficient NAS to speed up the exploration by 1000 using parameter sharing; Cai et al. [1] introduced the pathlevel network transformation to effectively search the treestructured architecture space. Motivated by these AutoML frameworks, He et al. [10] leveraged the reinforcement learning to automatically prune the convolution channels. Our framework further explores the automated quantization for network weights and activations, and it takes the hardware architectures into consideration.Efficient Models.
To facilitate the efficient deployment, researchers designed hardwarefriendly approaches to slim neural network models. For instance, the coarsegrained channel pruning methods [11, 20] prune away the entire channel of convolution kernels to achieve speedup. Recently, researchers have explicitly optimized for various aspects of hardware properties, including the inference latency and energy: Yang et al. [30] proposed the energyaware pruning to directly optimize the energy consumption of neural networks; Yang et al. [31] reduced the inference time of neural networks on the mobile devices through a lookup table. Nevertheless, these methods are still rulebased and mostly focus on pruning. Our framework automates the quantization process by taking hardwarespecific metric as direct rewards using a learning based method.
3 Approach
The overview of our proposed framework is in Figure 2. We model the quantization task as a reinforcement learning problem. We used the actor critic model with DDPG agent to give action: bits for each layer. We collect hardware counters, together with accuracy as direct rewards to search the optimal quantization policy for each layer. We have three hardware environments that covers edge and cloud, spatial and temporal architectures for multiprecision accelerator. Below describes the details of the RL formulation.
3.1 Observation (State Space)
Our agent processes the neural network in a layerwise manner. For each layer, our agent takes two steps: one for weights, and one for activations. In this paper, we introduce a tendimensional feature vector
as our observation:If the ^{th} layer is a convolution layer, the state is
(1) 
where is the layer index, is #input channels, is #output channels, is kernel size, is the stride, is the input feature map size, is #parameters, is a binary indicator for depthwise convolution, is a binary indicator for weight/activation, and is the action from the last time step.
If the ^{th} layer is a fullyconnected layer, the state is
(2) 
where is the layer index, is #input hidden units, is #output hidden units, is the size of input feature vector, is #parameters, is a binary indicator for weight/ activation, and is the action from the last step.
For each dimension in the observation vector , we normalize it into to make them in the same scale.
3.2 Action Space
We use a continuous action space to determine the bitwidth. The reason that we do not use a discrete action space is because it loses the relative order: e.g., 2bit quantization is more aggressive than 4bit and even more than 8bit. At the ^{th} time step, we take the continuous action (which is in the range of ), and round it into the discrete bitwidth value :
(3) 
where and denote the min and max bitwidth (in our experiments, we set to and to ).
Resource Constraints.
In realworld applications, we have limited computation budgets (i.e., latency, energy, and model size). We would like to find the quantization policy with the best performance given the constraint.
We encourage our agent to meet the computation budget by limiting the action space. After our RL agent gives actions to all layers, we measure the amount of resources that will be used by the quantized model. The feedback is directly obtained from the hardware accelerator, which we will discuss in Section 3.3. If the current policy exceeds our resource budget (on latency, energy or model size), we will sequentially decrease the bitwidth of each layer until the constraint is finally satisfied.
3.3 Direct Feedback from Hardware Accelerators
An intuitive feedback to our RL agent can be FLOPs or the model size. However, as these proxy signals are indirect, they are not equal to the performance (i.e., latency, energy consumption) on the hardware. Cache locality, number of kernel calls, memory bandwidth all matters. Proxy feedback can not model these hardware functionality to find the specialized strategies (see Table 1).
Instead, we use direct latency and energy feedback from hardware accelerators to optimize the performance, which enables our RL agent to determine the bitwidth allocation policy from the subtle differences between different layers: e.g., vanilla convolution has more data reuse and locality, while depthwise convolution [3] has less reuse and worse locality, which makes it memory bounded.
3.4 Quantization
We linearly quantize the weights and activations of each layer using the action given by our agent, as linearly quantized model only need fixed point arithmetic unit which is efficient.
Specifically, for each weight value in the ^{th} layer, we first truncate it into the range of , and we then quantize it linearly into bits:
(4) 
where is to truncate the values into , and the scaling factor is defined as
(5) 
In this paper, we choose the value of by finding the optimal value that minimizes the KLdivergence between the original weight distribution and the quantized weight distribution :
(6) 
where is the KLdivergence that characterizes the distance between two distributions. As for activations, we quantize the values similarly except that we truncate them into the range of , not
since the activation values (which are the outputs of the ReLU layers) are nonnegative.
3.5 Reward Signal
After quantization, we retrain the quantized model for one more epoch to recover the performance. We define our reward function to be directly related to the accuracy:
(7) 
where is the accuracy of the original model, is the accuracy of the quantized model after finetuning, and is a scaling factor which is set to in our experiments.
3.6 Agent
As for our agent, we leverage the deep deterministic policy gradient (DDPG) [17], which is an offpolicy actorcritic algorithm for continuous control problem. We apply a variant form of the Bellman’s Equation, where each transition in an episode is defined as
(8) 
During exploration, the function is computed as
(9) 
and the gradient signal can be approximated using
(10) 
where denotes the number of steps in this episode, the baseline
is defined as an exponential moving average of all previous rewards in order to reduce the variance of the gradient estimation, and the discount factor
is set to to avoid overprioritizing the shortterm rewards.3.7 Implementation Details
In this section, we present some implementation details about RL exploration and finetuning quantized models.
Agent.
The DDPG agent consists of an actor network and a critic network. Both follow the same network architecture: they take the state vector and the action from the last time step as inputs and feed them into two separate fullyconnected layers with hidden sizes of . After that, we add the two hidden vectors together and go through another two fullyconnected layers with hidden sizes of
. As for the actor network, we use an additional sigmoid function to project the output into the range of
.Exploration.
Optimization of the DDPG agent is carried out using ADAM [15] with and . We use a fixed learning rate of for the actor network and for the critic network. During exploration, we employ the following stochastic process of the noise:
(11) 
where
is the truncated normal distribution, and
is the model weights. The noise is initialized as , and after each episode, the noise is decayed exponentially with a decay rate of .Finetuning.
During exploration, we finetune the quantized model for one epoch to help recover the performance (using SGD with a fixed learning rate of and momentum of
). We randomly select 100 categories from ImageNet
[5] to accelerate the model finetuning during exploration. After exploration, we quantize the model with our best policy and finetune it on the full dataset.4 Experiments
We conduct extensive experiments to demonstrate the consistent effectiveness of our framework for multiple objectives: latency, energy, model size, and accuracy.
Datasets and Models.
Our experiments are performed on the ImageNet [5] dataset. As our focus is on more efficient models, we extensively study the quantization of MobileNetV1 [12] and MobileNetV2 [24]. Both MobileNets are inspired from the depthwise separable convolutions [3] and replace the regular convolutions with the pointwise and depthwise convolutions: MobileNetV1 stacks multiple “depthwise – pointwise” blocks repeatedly; while MobileNetV2 uses the “pointwise – depthwise – pointwise” blocks as its basic building primitives.
Hardware  Batch  PE Array  AXI port  Block RAM  

Edge  Zynq7020  1  88  464b  14036Kb 
Cloud  VU9P  16  1616  4256b  216036Kb 
4.1 LatencyConstrained Quantization
We first evaluate our framework under latency constraints on two representative hardware architectures: spatial and temporal architectures for multiprecision CNN:
Temporal Architecture.
BitSerial Matrix Multiplication Overlay (BISMO)^{†}^{†}†https://github.com/EECSNTNU/bismo proposed by Yaman et al. [26] is a classic temporal design of neural network accelerator on FPGA. It introduces bitserial multipliers which are fed with onebit digits from 256 weights and corresponding activations in parallel at one time and accumulates their partial products by shifting over time.
Spatial Architecture.
BitFusion architecture ^{‡}^{‡}‡https://github.com/hsharma35/bitfusion proposed by Hardik et al. [25] is a stateoftheart spatial ASIC design for neural network accelerator. It employs a 2D systolic array of Fusion Units which spatially sum the shifted partial products of twobit elements from weights and activations.
Edge Accelerator  Cloud Accelerator  
MobileNetV1  MobileNetV2  MobileNetV1  MobileNetV2  
Bitwidths  Acc.1  Acc.5  Latency  Acc.1  Acc.5  Latency  Acc.1  Acc.5  Latency  Acc.1  Acc.5  Latency  
PACT [2]  4 bits  62.44  84.19  45.45 ms  61.39  83.72  52.15 ms  62.44  84.19  57.49 ms  61.39  83.72  74.46 ms 
Ours  flexible  67.40  87.90  45.51 ms  66.99  87.33  52.12 ms  65.33  86.60  57.40 ms  67.01  87.46  73.97 ms 
PACT [2]  5 bits  67.00  87.65  57.75 ms  68.84  88.58  66.94 ms  67.00  87.65  77.52 ms  68.84  88.58  99.43 ms 
Ours  flexible  70.58  89.77  57.70 ms  69.40  88.84  66.92 ms  69.97  89.37  77.49 ms  69.45  88.94  99.07 ms 
PACT [2]  6 bits  70.46  89.59  70.67 ms  71.25  90.00  82.49 ms  70.46  89.59  99.86 ms  71.25  90.00  127.07 ms 
Ours  flexible  71.20  90.19  70.35 ms  71.34  90.04  82.34 ms  71.20  90.08  99.66 ms  71.85  90.24  127.03 ms 
Original  8 bits  70.82  89.85  96.20 ms  71.81  90.25  115.84 ms  70.82  89.85  151.09 ms  71.81  90.25  189.82 ms 
4.1.1 Quantization policy for BISMO Architecture
Inferencing the neural networks on edge devices and cloud severs can be quite different, since the tasks on the cloud servers are more intensive and the edge devices are usually limited to low computation resources and memory bandwidth. We use Xilinx Zynq7020 FPGA [29] as our edge device and Xilinx VU9P [28] as our cloud device. Table 2 shows our experiment configurations on these two platforms along with their available resources.
As for comparison, we adopt the PACT [2] as our baseline, which uses the same number of bits for all layers except for the first layer which extracts the low level features, they use 8 bits for both weights and activations as it has fewer parameters and is very sensitive to errors. We follow a similar setup as PACT: we quantize the weights and activations of the first and last layer to 8 bits and explore the bitwidth allocation policy for all the other layers. Under the same latency, HAQ consistently achieved better accuracy than the baseline on both the cloud and the edge (Table 3). With similar accuracy, HAQ can reduce the latency by 1.4 to 1.95 compared with the baseline.
Interpreting the quantization policy.
Our agent gave quite different quantization policy for edge and cloud accelerators (Figure 3). For the activations, the depthwise convolution layers are assigned much less bitwidth than the pointwise layers on the edge; while on the cloud device, the bitwidth of these two types of layers are similar to each other. For weights, the bitwidth of these types of layers are nearly the same on the edge; while on the cloud, the depthwise convolution layers are assigned much more bitwidth than the pointwise convolution layers.
Weights  Activations  Acc.1  Acc.5  Latency  
PACT [2]  4 bits  4 bits  62.44  84.19  7.86 ms 
Ours  flexible  flexible  67.45  87.85  7.86 ms 
PACT [2]  6 bits  4 bits  67.51  87.84  11.10 ms 
Ours  flexible  flexible  70.40  89.69  11.09 ms 
PACT [2]  6 bits  6 bits  70.46  89.59  19.99 ms 
Ours  flexible  flexible  70.90  89.95  19.98 ms 
Original  8 bits  8 bits  70.82  89.85  20.08 ms 
Weights  Activations  Acc.1  Acc.5  Energy  
PACT [2]  4 bits  4 bits  62.44  84.19  13.47 mJ 
Ours  flexible  flexible  64.78  85.85  13.69 mJ 
PACT [2]  6 bits  4 bits  67.51  87.84  16.57 mJ 
Ours  flexible  flexible  70.37  89.40  16.30 mJ 
PACT [2]  6 bits  6 bits  70.46  89.59  26.80 mJ 
Ours  flexible  flexible  70.90  89.73  26.67 mJ 
Original  8 bits  8 bits  70.82  89.95  31.03 mJ 
We explain the difference of quantization policy between edge and cloud by the roofline model [27]. Many previous works use FLOPs or BitOPs as metrics to measure computation complexity. However, they are not able to directly reflect the latency, since there are many other factors influencing the hardware performance, such as memory access cost and degree of parallelism [24, 20]. Taking computation and memory access into account, the roofline model assumes that applications are either computationbound or memory bandwidthbound, if not fitting in onchip caches, depending on their operation intensity. Operation intensity is measured as operations (MACs in neural networks) per DRAM byte accessed. A lower operation intensity indicates suffering more from the memory access.
MobileNetV1  MobileNetV2  ResNet50  
Weights  Acc.1  Acc.5  Model Size  Acc.1  Acc.5  Model Size  Acc.1  Acc.5  Model Size  
Han et al. [8]  2 bits  37.62  64.31  1.09 MB  58.07  81.24  0.96 MB  68.95  88.68  6.32 MB 
Ours  flexible  57.14  81.87  1.09 MB  66.75  87.32  0.95 MB  70.63  89.93  6.30 MB 
Han et al. [8]  3 bits  65.93  86.85  1.60 MB  68.00  87.96  1.38 MB  75.10  92.33  9.36 MB 
Ours  flexible  67.66  88.21  1.58 MB  70.90  89.76  1.38 MB  75.30  92.45  9.22 MB 
Han et al. [8]  4 bits  71.14  89.84  2.10 MB  71.24  89.93  1.79 MB  76.15  92.88  12.40 MB 
Ours  flexible  71.74  90.36  2.07 MB  71.47  90.23  1.79 MB  76.14  92.89  12.14 MB 
Original  32 bits  70.90  89.90  16.14 MB  71.87  90.32  13.37 MB  76.15  92.86  97.49 MB 
The bottom of Figure 3 shows the operation intensities (OPs per Byte) of convolution layers in the MobileNetV1. Depthwise convolution is a memory bounded operation, and the pointwise convolution is a computation bounded operation. Our experiments show that when running MobileNetV1 on the edge devices with small batch size, its latency is dominated by the depthwise convolution layers. Since the feature maps take a major proportion in the memory of depthwise convolution layers, our agent gives the activations less bits. In contrast, when running MobileNetV1 on the cloud with large batch size, both two types of layers have nearly the equal influence on the speed. Therefore, our agent tries to reduce the bitwidth of both activation and weights. However, since the weights of the depthwise convolution layers takes a small proportion of the memory, our agent increases their bitwidth to preserve the network accuracy at low memory overhead. A similar phenomenon can be observed in Figure 4 for quantizing MobileNetV2. Moreover, since the activation size in the deeper layers gets smaller, they get assigned more bits. Another interesting phenomenon we discover in Figure 4 is that the downsample layer gets assigned more activation bits than the adjacent layer. This is because downsampled layers are more prone to lose information, so our agent learns to assign more bits to the activations to compensate.
4.1.2 Quantization policy for BitFusion Architecture
In order to demonstrate the effectiveness of our framework on different hardware architectures, we further compare our framework with PACT [2] under the latency constraints on the BitFusion [25] architecture. As demonstrated in Table 4, our framework performs much better than the handcraft policy with the same latency. It can achieve almost no degradation of accuracy with only half of the latency used by the original MobileNetV1 model (from 20.08 to 11.09 ms). Therefore, our framework is indeed very flexible and can be applied to different hardware platforms.
4.2 EnergyConstrained Quantization
We then evaluate our framework under the energy constraints on the BitFusion [25] architecture. Similar to the latencyconstrained experiments, we compare our framework with PACT [2] which uses fixed number of bits for both weights and activations. From Table 5, we can clearly see that our framework outperforms the rulebased baseline: it achieves much better performance while consuming similar amount of energy. In particular, our framework is able to achieve almost no loss of accuracy with nearly half of the energy consumption of the original MobileNetV1 model (from 31.03 to 16.57 mJ), which suggests that flexible bitwidths can indeed help reduce the energy consumption.
4.3 Model SizeConstrained Quantization
Finally, we evaluate our framework under the model size constraints. Following Han et al. [8], we employ the means algorithm to quantize the values into different centroids instead of using the linear quantization for compression.
We compare our framework with Deep Compression [8] on MobileNets and ResNet50. From Table 6, we can see that our framework performs much better than Deep Compression: it achieves higher accuracy with the same model size. For MobileNets which are already very compactly designed, our framework can preserve the performance to some extent; while Deep Compression significantly degrades the performance especially when the model size is very small. For instance, when Deep Compression quantizes the weights of MobileNetV1 to 2 bits, the accuracy drops significantly from 70.90 to 37.62; while our framework can still achieve 57.14 of accuracy with the same model size, which is because our framework makes full use of the flexible bitwidths.
Discussions.
In Figure 5, we visualize the bitwidth allocation strategy for MobileNetV2. From this figure, we can observe that our framework assigns more bitwidths to the weights in depthwise convolution layers than pointwise convolution layers. Intuitively, this is because the number of parameters in the former is much smaller than the latter. Comparing Figure 4 and Figure 5, the policies are drastically different under different optimization objectives (fewer bitwiths for depthwise convolutions under latency optimization, more bitwidths for depthwise convolutions under model size optimization). Our framework succeeds in learning to adjust its bitwidth policy under different constraints.
4.4 AccuracyGuaranteed Quantization
Apart from the resourceconstrained experiments, we also evaluate our framework under the accuracyguaranteed scenario, that is to say, we aim to minimize the resource (i.e., latency and energy) we use while preserving the accuracy.
Instead of using the resourceconstrained action space in Section 3.2, we define a new reward function that takes both the resource and the accuracy into consideration:
(12) 
Here, the reward functions are defined to encourage each term to be as good as possible:
(13)  
where are scaling factors that encourage the RL agent to trade off between the computation resource and the accuracy. In our experiments, we set and to , and to to ensure that our RL agent will prioritize the accuracy.
As it is extremely challenging to preserve the accuracy while reducing the computation resource, we choose to perform our experiments on a tencategory subset of ImageNet. In Figure 6, we illustrate the exploration curves of our RL agents, and we can observe that the exploration process can be divided into three phases. In the first phase, our RL agent puts its focus on the accuracy: it tries to preserve the accuracy while completely ignoring the latency and the energy consumption. In the second phase, the accuracy begins to be more stable, and our RL agent starts to aggressively reduce the latency and the energy. In the third phase, our RL agent converges to the best policy it has found. We conjecture that this interesting behavior might be because that the scaling factor is much larger than the other two, which encourages our RL agent to first optimize the value of accuracy, and after its value has been stabilized, our RL agent then attempts to reduce the value of latency and energy to further optimize the reward value (see the reward curve in Figure 6).
5 Conclusion
In this paper, we propose HardwareAware Automated Quantization (HAQ), an automated framework for quantization which does not require any domain experts and rulebased heuristics. We provide a learning based method that can search the quantization policy with hardware feedback. Compared with indirect proxy signals, our framework can offer a specialized quantization solution for different hardware platforms. Extensive experiments demonstrate that our framework performs better than conventional rulebased approaches for multiple objectives: latency, energy and model size. Our framework reveals that the optimal policies on different hardware architectures are drastically different, and we interpreted the implication of those policies. We believe the insights will inspire the future software and hardware codesign for efficient deployment of deep neural networks.
Acknowledgements.
We thank MIT Quest for Intelligence, Xilinx, Samsung, Intel, ARM, Qualcomm, and SONY for supporting this research. We thank Google Cloud and AWS for providing the computation resource.
References
 [1] H. Cai, J. Yang, W. Zhang, S. Han, and Y. Yu. PathLevel Network Transformation for Efficient Architecture Search. In ICML, 2018.
 [2] J. Choi, Z. Wang, S. Venkataramani, P. I.J. Chuang, V. Srinivasan, and K. Gopalakrishnan. PACT: Parameterized Clipping Activation for Quantized Neural Networks. arXiv, 2018.

[3]
F. Chollet.
Xception  Deep Learning with Depthwise Separable Convolutions.
In CVPR, 2017.  [4] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. arXiv, 2016.
 [5] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and F.F. Li. ImageNet  A largescale hierarchical image database. In CVPR, 2009.
 [6] EENews. Apple describes 7nm a12 bionic chips, 2018.
 [7] S. Han. Efficient Methods and Hardware for Deep Learning. PhD thesis, 2017.
 [8] S. Han, H. Mao, and W. Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In ICLR, 2016.
 [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
 [10] Y. He, J. Lin, Z. Liu, H. Wang, L.J. Li, and S. Han. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In ECCV, 2018.
 [11] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.

[12]
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.
arXiv, 2017.  [13] Imagination. Powervr neural network accelerator, 2018.
 [14] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G. Howard, H. Adam, and D. Kalenichenko. Quantization and Training of Neural Networks for Efficient IntegerArithmeticOnly Inference. In CVPR, 2018.
 [15] D. Kingma and J. Ba. Adam  A Method for Stochastic Optimization. In ICLR, 2015.
 [16] R. Krishnamoorthi. Quantizing deep convolutional networks for efficient inference  A whitepaper. arXiv, 2018.
 [17] T. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.
 [18] J. Lin, Y. Rao, J. Lu, and J. Zhou. Runtime Neural Pruning. In NIPS, 2017.
 [19] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.J. Li, L. FeiFei, A. Yuille, J. Huang, and K. Murphy. Progressive Neural Architecture Search. In ECCV, 2018.
 [20] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017.

[21]
Nvidia.
Nvidia tensor cores, 2018.
 [22] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient Neural Architecture Search via Parameter Sharing. In ICML, 2018.
 [23] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNORNet  ImageNet Classification Using Binary Convolutional Neural Networks. In ECCV, 2016.
 [24] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In CVPR, 2018.
 [25] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh. Bit fusion: Bitlevel dynamically composable architecture for accelerating deep neural network. In ISCA, 2018.
 [26] Y. Umuroglu, L. Rasnayake, and M. Sjalander. Bismo: A scalable bitserial matrix multiplication overlay for reconfigurable computing. In FPL, 2018.
 [27] S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.
 [28] Xilinx. Ultrascale architecture and product data sheet: Overview, 2018.
 [29] Xilinx. Zynq7000 soc data sheet: Overview, 2018.
 [30] T.J. Yang, Y.H. Chen, and V. Sze. Designing energyefficient convolutional neural networks using energyaware pruning. arXiv, 2016.
 [31] T.J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam. Netadapt: Platformaware neural network adaptation for mobile applications. In ECCV, 2018.
 [32] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. DoReFaNet  Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv, 2016.
 [33] C. Zhu, S. Han, H. Mao, and W. Dally. Trained Ternary Quantization. In ICLR, 2017.
 [34] B. Zoph and Q. V. Le. Neural Architecture Search with Reinforcement Learning. In ICLR, 2017.