Log In Sign Up

Design Automation for Efficient Deep Learning Computing

by   Song Han, et al.

Efficient deep learning computing requires algorithm and hardware co-design to enable specialization: we usually need to change the algorithm to reduce memory footprint and improve energy efficiency. However, the extra degree of freedom from the algorithm makes the design space much larger: it's not only about designing the hardware but also about how to tweak the algorithm to best fit the hardware. Human engineers can hardly exhaust the design space by heuristics. It's labor consuming and sub-optimal. We propose design automation techniques for efficient neural networks. We investigate automatically designing specialized fast models, auto channel pruning, and auto mixed-precision quantization. We demonstrate such learning-based, automated design achieves superior performance and efficiency than rule-based human design. Moreover, we shorten the design cycle by 200x than previous work, so that we can afford to design specialized neural network models for different hardware platforms.


page 1

page 2

page 3

page 4


Hardware-Centric AutoML for Mixed-Precision Quantization

Model quantization is a widely used technique to compress and accelerate...

GANDSE: Generative Adversarial Network based Design Space Exploration for Neural Network Accelerator Design

With the popularity of deep learning, the hardware implementation platfo...

HAQ: Hardware-Aware Automated Quantization

Model quantization is a widely used technique to compress and accelerate...

Auto Deep Compression by Reinforcement Learning Based Actor-Critic Structure

Model-based compression is an effective, facilitating, and expanded mode...

Confounding Tradeoffs for Neural Network Quantization

Many neural network quantization techniques have been developed to decre...

DLDNN: Deterministic Lateral Displacement Design Automation by Neural Networks

Size-based separation of bioparticles/cells is crucial to a variety of b...

Improving Neural Network Quantization using Outlier Channel Splitting

Quantization can improve the execution latency and energy efficiency of ...

1 Introduction

Algorithm and hardware co-design plays an important role in efficient deep learning computing. Unlike optimizing on the SPEC2006 benchmark when we can treat the algorithm as a black box, there’s plenty of room at the algorithm level that can improve the hardware efficiency of deep learning. We should open the box and explore model optimization techniques. The benefit usually comes from memory saving and locality improvement. For example, model compression techniques [1] including pruning and quantization can drastically reduce the memory footprint and save energy consumption. Another example is small model design. SqueezeNet [2] and MobileNet [3] have only 4.8MB/4.2MB of model size, which can fit on-chip SRAM and improve the locality. However, efficient model design and compression have a large design space. Many different neural network architectures can lead to similar accuracy, but drastically different hardware efficiency. This is difficult to exhaust by rule-based heuristics, since there is a shortage of deep learning and hardware experts to hand-tune each model to make it run fast. It’s demanding to systematically study how to design efficient neural network with hardware constraints. We propose hardware-centric AutoML techniques that can automatically design neural networks that are hardware efficient [4, 5, 6]. Such joint optimization is systematic and can transfer well between tasks. It requires fewer engineer efforts while designing better neural networks at low cost. We explore three aspects of neural network design automation (Figure 1): auto design specialized model, auto channel pruning, and auto mixed-precision quantization. Each aspect is summarized as follows.

Fig. 1: Design automation for model specialization, channel pruning and mixed-precision quantization.

There is plenty of specialized hardware for neural networks, but little research has been done for specialized neural network architecture for a given hardware architecture (the reverse specialization). There are several advantages for a specialized model: it can fully utilize the parallelism of the target hardware (e.g. fitting the channel size with the PE size). Besides, a specialized model can fully utilize the on-chip buffer and improve locality and reuse. Specialization can also match the model’s computation intensity with the hardware’s roofline model. However, designing a specialized neural network architecture used to be difficult. First, there are limited heuristics. Second, the computation cost used to be prohibitive: even searching a model on CIFAR-10 dataset takes GPU hours [7, 8]

. We cut the search cost by two orders of magnitude (actually more than that, since we directly search on ImageNet). The search cost is reduced by two techniques: path-level pruning and path-level binarization, which saves GPU hours and GPU memory. Cutting the search cost enables us to design specialized the model on the target task and target hardware. On the mobile phone, our searched model

[4] runs 1.8 faster than the best human designed model [9]. After designing a specialized model, compression and pruning is an effective technique to further reduce the memory footprint [1]. Conventional model compression techniques rely on hand-crafted heuristics and rule-based

policies that require domain experts to explore the large design space. We propose an automated design flow that leverages reinforcement learning to give the best model compression policy. This

learning-based compression policy outperforms conventional rule-based compression policy by having a higher compression ratio, better preserving the accuracy and freeing human labor. We applied this automated, push-the-button compression pipeline to MobileNet and achieved 1.81 speedup of measured inference latency on an Android phone and 1.43 speedup on the Titan XP GPU, with only 0.1% loss of ImageNet Top-1 accuracy. The last step is automatic mixed-precision quantization. Emergent DNN hardware accelerators begin to support flexible bitwidth (1-8 bits), which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. We introduce the automated design flow of model quantization, and we take the hardware accelerator’s feedback in the design loop. Our framework can specialize the quantization policy for different hardware architectures. It can effectively reduce the latency by 1.4-1.95 and the energy consumption by 1.9 with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization.

2 Automated Model Specialization

In order to fully utilize the hardware resource, we propose to search a specialized CNN architecture for the given hardware. The model is compact and runs fast. We start with a large design space (Figure 1(a)) that includes many candidate paths to learn which is the best one by gradient descent, rather than hand-picking with rule-based heuristics. Instead of just learning the weight parameter, we jointly learn the architecture parameter (shown in red in Figure 1

(a)). The architecture parameter is the probability of choosing each path. The search space for each block

consists of many choices:

  • ConvOp: mobile inverted bottleneck conv [9] with various kernel sizes and expansion ratios

    • Kernel size: {33, 55, 77}

    • Expansion ratio: {3, 6}

  • ZeroOp: if ZeroOp is chosen at block, it means the block is skipped.

Therefore, the number of possible architectures in the design space is where is the number of blocks (21 in our experiments). Given the vast design space, it is infeasible to rely on domain experts to manually design the CNN model for each hardware platform. So we need to employ automatic architecture design techniques. However, early reinforcement learning-based [7, 8] NAS methods are very expensive to run (e.g., GPU hours) since they need to iteratively sample an architecture, train it from scratch and update the meta-controller. It typically requires tens of thousands of networks to be trained to find a good neural network architecture. We adopt a different approach to improve the efficiency of model specialization [4]. We first build a super network that comprises all candidate architectures. Concretely, it has a similar structure to a CNN model in the design space except that each specific operation is replaced with a mixed operation that has parallel paths. Each path in a mixed operation corresponds to a candidate operation , and we introduce an architecture parameter to each path to learn which paths are redundant and thereby can be pruned (i.e. path-level prunning). In the forward step, to save GPU memory, we allow only one candidate path to actively reside in the GPU memory. This is achieved by hard-thresholding the probability of each candidate path to either 0 or 1 (i.e., path-level binarization). As such the output of a mixed operation is given as


where is sampled according to the multinomial distribution derived from the architecture parameters, i.e., {}. In the backward step, we update the weight parameters of active paths using standard gradient descent. Since the architecture parameters are not directly involved in the computational graph (Eq. 1), we use the gradient w.r.t. binary gates to update the corresponding architecture parameters:

In order to specialize the model for hardware, we need to take the latency running on the hardware as a design reward. However, directly measuring the inference latency suffer from (i) slow (ii) high variance due to different battery condition and thermal throttling (iii) latency is non-differentiable and can’t be directly optimized. To address these, we present our latency prediction model and hardware-aware loss. To build the latency model we pre-compute the latency of each operator with all possible inputs. During search we query the lookup table during the searching process . The overall latency of

block is the weighted sum of the latency of each operator.

Model Top-1 Top-5 GPU Latency
MobileNet-V2 [9] 72.0 91.0 6.1 ms
ResNet-34 [10] 73.3 91.4 8.0 ms
NASNet-A [8] 74.0 91.3 38.3 ms
MnasNet [11] 74.0 91.8 6.1 ms
Specialized model for GPU 75.1 92.5 5.1 ms
TABLE I: ImageNet Accuracy (%) and GPU latency (Tesla V100).

Then we combine the latency and training loss (e.g. cross-entropy loss) using the following formula


where and are hyper-parameters controlling the trade-off between accuracy and latency and

is the target latency. Note our formulation not only provides a fast estimation of the searched model but also makes the search process fully differentiable. We demonstrate the effectiveness of our model specialization on ImageNet dataset with CPU (Xeon E5-2640 v4), GPU (Tesla V100) and mobile phone (Google Pixel-1). We first search for a specialized CNN model for the mobile phone (Figure 

2). Compared to MobileNet-V2 (the state-of-the-art human engineered architecture), our model improves the top-1 accuracy by 2.6% while maintaining a similar latency. Under the same level of top-1 accuracy (around 74.6%), MobileNet-V2 has 143ms latency while ours has only 78ms (1.83 faster). Compared with the state-of-the-art auto designed model, MnasNet [11], our model can achieve 0.6% higher top-1 accuracy with slightly lower mobile latency. More remarkably, we reduced the search cost by 200, from 40,000 GPU hours to only 200 GPU hours. Table I reports the speedup on GPU. our method achieved superior performances compared to both human-designed and automatically searched architectures. Compared general purpose models, our specialized model improves the top-1 accuracy by 1.1% - 3.1% while being 1.2-7.5 faster. Table II compares the specialized models on CPU/GPU/Mobile. As expected, models specialized for GPU do not run fast on CPU and mobile phone, vice versa. It is essential to learn specialized neural networks to cater for different hardware. Our automated design flow designed CNN architectures that were long dismissed as being too inefficient — but in fact, they are very efficient. For instance, engineers have essentially stopped using 77 filters, because they’re computationally more expensive than multiple, smaller filters (one 77 layer has the same receptive field than three 33 layers, but bears 49 weights rather than 27.). However, our AI designed model found that using 77 filter is very efficient on GPUs. That’s because GPUs have high parallelization, and invoking a large kernel call is more efficient than invoking multiple small kernel calls. This design choice goes against previous human thinking. The larger the search space, the more unknown things you can find. You don’t know if something will be better than the past human experience. Let the automated design tool figure it out [12].

Fig. 2: AI automatically designed specialized model consistently outperforms human designed MobileNetV2 under various latency settings.
Model Top-1 GPU CPU Mobile
Specialized for GPU 75.1 5.1ms 204.9ms 124ms
Specialized for CPU 75.3 7.4ms 138.7ms 116ms
Specialized for Mobile 74.6 7.2ms 164.1ms 78ms
TABLE II: Hardware prefers specialized models. Models optimized for GPU does not run fast on CPU and mobile phone, vice versa. Our method provides an efficient solution to search a specialized neural network architecture for a target hardware architecture, while cutting down the search cost by 200 compared with state-of-the-arts [7, 11].

3 Automated Channel Pruning

Pruning [13] is widely used in model compression and acceleration. It is very important to find the optimal sparsity for each layer during pruning. Pruning too much will hurt accuracy; too less will not achieve high compression ratio. This used to be manually determined in previous studies [1]. Our goal is to automatically find out the effective sparsity for each layer. We train an reinforcement learning agent to predict the best sparsity for a give hardware [5]. We evaluate the accuracy and FLOPs after pruning. Then we update the agent by encouraging smaller, faster and more accurate models.

Million MAC Top-1 Acc. Top-5 Acc. GPU Android
569 70.6% 89.5% 0.46ms 2191 fps 123.3ms 8.1 fps 20.1MB
325 68.4% 88.2% 0.34ms 2944 fps 72.3ms 13.8 fps 14.8MB
(50% FLOPs)
285 70.5% 89.3% 0.32ms
3127 fps
14.6 fps
(50% Latency)
272 70.2% 89.2% 0.30ms
3350 fps
16.0 fps
TABLE III: AMC speeds up MobileNet. On Google Pixel-1 CPU, AMC achieves 1.95 measured speedup with batch size one, while saving the memory by 34%. On NVIDIA Titan XP GPU, AMC achieves 1.53 speedup with batch size of 50.

Our automatic model compression (AMC) leverages reinforcement learning to efficiently search the pruning ratio (Figure 1(b)). The reinforcement learning agent receives an embedding state of layer from the environment and then outputs a sparsity ratio as action . The layer is compressed with (rounded to the nearest feasible fraction). Then the agent moves to the next layer , and receives state . After finishing the final layer , the reward accuracy is evaluated on the validation set and returned to the agent. With our framework, we are able to push the expert-tuned limit of fine-grained model pruning. For ResNet-50 on ImageNet, we can push the compression ratio from 3.4 to 5 without loss of accuracy. With further investigation, we find that AMC automatically learns to prune 33 convolution kernels harder than 11 kernels, which is similar to human heuristics since the latter is less redundant.

Policy FLOPs Acc (%)
MobileNet-V1 uniform (0.75-224) [3] 56% -2.5
AMC (ours) 50% -0.4
uniform (0.75-192) [3] 41% -3.7
AMC (ours) 40% -1.7
 MobileNet-V2 uniform (0.75-224) [9]
AMC (ours) -1.0
TABLE IV: Learning-based automated model compression (AMC) outperforms rule-based model compression. Rule-based heuristics are suboptimal.

We also compare AMC with heuristic-based channel reduction method on modern efficient neural networks MobileNet [3] and MobileNet-V2 [9] (Table IV). Since the networks are already very compact, it is convincing to compress these nets. The easiest way to reduce the channels of a model is to use uniform channel shrinkage, i.e. use a width multiplier to uniformly reduce the channels of each layer with a fixed ratio. Both MobileNet and MobileNet-V2 present the performance of different multiplier and input sizes, and we compare our pruned result with models of same computations. The format are denoted as uniform (depth multiplier - input size). We can find that our method consistently outperforms the uniform baselines. Even for the current state-of-the-art efficient model design MobileNet-V2, AMC can still improve its accuracy by

at the same computation. Mobile inference acceleration has drawn people’s attention in recent years. Not only can AMC optimize FLOPs and model size, it can also optimize the inference latency. For all mobile inference experiments, we use TensorFlow Lite framework for timing evaluation. Our experiment platform is Google Pixel 1. Models pruned to 0.5

FLOPs and 0.5 inference time are shown in Table III. For 0.5 FLOPs setting, we achieve 1.81 speed up on a Google Pixel 1 phone. For 0.5 FLOPs setting, we accurately achieve 1.95 speed up, which is very close to actual 2 target, showing that AMC can directly optimize inference time and achieve accurate speed up ratio. On GPUs, we also achieve up to 1.5 speedup, which is less than mobile phone but still significant on an already very compact model. The less speedup is because a GPU has higher degree of parallelism than a mobile phone.

4 Automated Mixed-Precision Quantization

Conventional quantization methods quantize each layer of the model to the same precision. Mixed-precision quantization is more flexible but suffer from a huge design space that’s difficult to explore. Meanwhile, as demonstrated in Table V, the quantization solution optimized on one hardware might not be optimal on the other, which raises the demand for specialized policies for different hardware architectures and further increase the design space. Assuming the bitwidth is between 1 to 8 for both weights and activations, then each layer has choices. If we have different neural network models, each with layers, on different hardware platforms, there are in total possible solutions. Rather than using rule-based heuristics, we propose an automated design flow to quantize different layer with mixed precision. Our hardware-aware automatic quantization (HAQ) [6] models the quantization task as a reinforcement learning problem. We use the actor-critic model to give the quantization policy (#bits per layer) (Figure 1(c)). The goal is not only high accuracy but also low energy and low latency. An intuitive reward can be FLOPs or the model size. However, these proxy signals are indirect. They do not translate to latency or energy improvement. Cache locality, number of kernel calls, memory bandwidth all matters. Instead, we use direct latency and energy feedback from the hardware simulator. Such feedback enables our RL agent to learn the hardware characteristics for different layers: e.g., vanilla convolution has more data reuse and locality, while depthwise convolution has less reuse and worse locality, which makes it memory bounded.

Inference latency on
Best Q. policy for HW1 16.29 ms 85.24 ms 117.44 ms
Best Q. policy for HW2 19.95 ms 64.29 ms 108.64 ms
Best Q. policy for HW3 19.94 ms 66.15 ms 99.68 ms
TABLE V: Inference latency of MobileNet-V1 [3] on three hardware architectures under different quantization policies. The quantization policy that is optimized for one hardware is not optimal for the other. This suggests we need a specialized quantization solution for different hardware architectures. (HW1: spatial accelerator[14], HW2: edge accelerator[15], HW3: cloud accelerator[15], batch = 16).
Fig. 3: Quantization policy under latency constraints for MobileNet-V1.
Fig. 4: HAQ improves the roofline performance of pointwise layers in MobileNet-V1.
Edge Accelerator Cloud Accelerator
Bitwidths Top-1 Latency Top-1 Latency
PACT 4 bits 62.44 45.45 ms 61.39 52.15 ms
Ours flexible 67.40 45.51 ms 66.99 52.12 ms
PACT 5 bits 67.00 57.75 ms 68.84 66.94 ms
Ours flexible 70.58 57.70 ms 70.90 66.92 ms
PACT 6 bits 70.46 70.67 ms 71.25 82.49 ms
Ours flexible 71.20 70.35 ms 71.89 82.34 ms
Original 8 bits 70.82 96.20 ms 71.81 115.84 ms
TABLE VI: Latency-constrained quantization on the edge and cloud accelerator on ImageNet.

In real-world applications, we have limited resource budgets (i.e., latency, energy, and model size). We would like to find the quantization policy with the best performance given the resource constraint. We encourage our agent to meet the computation budget by limiting the action space. After our RL agent gives actions to all layers, we measure the amount of resources that will be used by the quantized model. The feedback is directly obtained from the hardware simulator. If the current policy exceeds our resource budget (on latency, energy or model size), we will sequentially decrease the bitwidth of each layer until the constraint is finally satisfied. We applied HAQ to three different hardware architectures to show the importance of specialized quantization policy. Inferencing on edge devices and cloud severs can be quite different, since (1) the batch size on the cloud servers are larger (2) the edge devices are usually limited to low computation resources and memory bandwidth. We use embedded FPGA Xilinx Zynq-7020 as our edge device, and server FPGA Xilinx VU9P as our cloud device to compare the specialized quantization policies. Compared to fixed 8-bit quantization (PACT [16]), our automated quantization consistently achieved better accuracy under the same latency (see Table VI). With similar accuracy, HAQ can reduce the latency by 1.4-1.95 compared with the baseline. Our agent gave specialized quantization policy for edge and cloud accelerators (Figure 3). The policy is quite different on different hardware. For the activations, the depthwise convolution layers are assigned much less bitwidth than the pointwise layers on the edge device; while on the cloud device, the bitwidth of these two types of layers are similar to each other. For weights, the bitwidth of these types of layers are nearly the same on the edge; while on the cloud, the depthwise convolution layers are assigned much more bitwidth than the pointwise convolution layers. We interpret the quantization policy’s difference between edge and cloud by the roofline model. Operation intensity is defined as operations (MACs in neural networks) per DRAM byte accessed. A lower operation intensity indicates that the model suffers more from the memory access. The bottom of Figure 3 shows the operation intensity (OPs per byte) of convolution layers in the MobileNet-V1. On edge accelerator, which has much less memory bandwidth, our RL agent allocates fewer activation bits to the depthwise convolutions since the activations dominates the memory access. On cloud accelerator, which has more memory bandwidth, our agent allocates more bits to the depthwise convolutions and allocates fewer bits to the pointwise convolutions to prevent it from being computation bounded. Figure 4 shows the roofline model before and after HAQ. HAQ gives more reasonable policy to allocate the bits for each layer and pushes all the points to the upper right corner that is more efficient.

Bitwidth Top-1 Latency
PACT 4 bits 61.39 52.15 ms
Ours (search for V2) flexible 66.99 52.12 ms
Ours (transfer from V1) flexible 65.80 52.06 ms
PACT 5 bits 68.84 66.94 ms
Ours (search for V2) flexible 70.90 66.92 ms
Ours (transfer from V1) flexible 69.90 66.93 ms
TABLE VII: Our RL agent is able to generalize well to different network architectures: the quantization policy transferred from MobileNet-V1 to MobileNet-V2 performs very close to directly searching for MobileNet-V2. Both outperfomed the the fixed-bitwidth baseline.

Finally, we evaluate the transfer ability of our framework: first train our agent on one network (MobileNet-V1), then directly apply the agent to another network (MobileNet-V2) (see Table VII). Our quantization policy transferred from MobileNet-V1 to MobileNet-V2 performs much better than the fixed-bitwidth baseline and is only slightly worse than the quantization policy directly searched for MobileNet-V2. This experiment validates that our RL agent generalizes well to different network architectures. That’s to say, given a new model that the agent hasn’t seen before, it can utilize its past knowledge to give a decent quantization policy, saving the design cycle.

5 Conclusion

We present design automation techniques for efficient deep learning computing. There’s plenty of room at the algorithm level to improve the hardware efficiency, but the large design space makes it difficult to be exhausted by human. We covered three aspects of design automation: specialized model design, compression and pruning, mixed-precision quantization. Such learning based design automation outperformed rule-base heuristics. Our framework reveals that the optimal design policies on different hardware architectures are drastically different, therefore specialization is important. We interpreted those design policies and believe the insights will inspire the future software and hardware co-design for efficient deep learning computing.