Deep Convolution Neural Networks (CNNs) have achieved remarkable accuracy in many problems such as image classification, object detection, and machine translation. These state-of-the-art models are complex and deep, which makes them challenging to implement as a real-time application on edge devices, which are constrained by latency, energy, and model size. Therefore many research directions are being proposed such as hand-crafted efficient DNN models (MobileNet-V2Sandler et al. (2018), ResNet50 He et al. (2016)) or to quantize the weight and activations of models.
For resource efficiency, researchers have found great success in representing the network with quantized bit-widths Zhou et al. (2016); Courbariaux et al. (2014). Most conventional methods quantize the model into a uniform bit-width across all layers Courbariaux et al. (2014). However, since different layers exhibit different properties, they have been shown to have different sensitivity to bit-width quantization Krishnamoorthi (2018). Thus, researchers have started to quantize layers with different bit-width (layer-wise quantization) Wang et al. (2019). However, to determine the optimal quantization bit-width for each layer is an extremely complex problem because of the massive design space. The complexity of the design space is composed of number of layers (N), each with flexible bit-width (32)111Assuming the quantization approach supports 1-bit to single precision 32-bit. For example, a quantized MobileNet-V2 can have a design space of
possible solutions. Therefore many quantization strategies are formulated by rule-based heuristics, and often require domain experts to tune the model. Moreover, this design space is unique for each HW configuration (i.e., a HW with a certain energy, latency, memory budget, and accuracy tolerance), meaning when the underlying HW changes, which happens frequently because of the fast advancement of technology, the entire process has to be repeated again.
A recent work, HAQ Wang et al. (2019), utilized a reinforcement learning method to automate the search process, which leaves humans out of the design optimization iteration. To meet different resource constraints, The authors of HAQ modified the actions made. The HAQ framework permits one to incorporate resource constraints into the reward function as demonstrated by Tan et al. (2019); Tan and Le (2019); Kim et al. (2017). However, both methods have some potential challenges: first, the designer needs the expertise to design rules for good action reduction strategy or designing the parameters in the incorporated reward function; second, the HW configuration can alter frequently in practice, and the RL search process needs to relaunch, which leads to large convergence times.
We believe that an ideal scenario when tuning a model for HW in practice is to have an agent that has learned the mappings from the quantization space to the model accuracy with prior samples such that designers could interact with the educated agent in real-time while tuning the conditions (HW configuration/ resource requirement) flexibly.
In this paper, we take a key step towards realizing this ideal propose a new autonomous framework for quantizing DNN models called Autonomous Quantization GAN (AQGAN). We make two key contributions: First, we enable the generation of quantized networks without the requirement of expert knowledge from both the model and HW perspective; second, with our proposed AQGAN, we provide a new simplified HW-aware tuning flow, which leads to the reduction of the search cost for finding the right quantization across different HW resource budgets.
In this work, we define “response contour” as the the relationship between accuracy and quantization. For each DNN model, we learn the response contour, and employ several overfitting prevention techniques that we describe later. The response contour of design point (quantization configuration) to accuracy is an n-to-1 mapping, that is, several design model configurations can yield the same accuracy. This opens up the opportunity to leverage the inverse 1-to-n mapping property through a generative model. Different HW configurations (e.g., TPU Jouppi et al. (2017), Eyeriss Chen et al. (2016b), and ShiDiaNao Du et al. (2015)) have different HW resource budgets (e.g., memory capacity, bandwidth etc), and hence require different quantization strategies to fit the models. Therefore, we build a conditional GAN (cGAN) based framework, which the designer only needs to specify the conditioned accuracy, and the agent generates a set of different design points for the designer in real-time. With these alternatives, the designer can pick a design point based on the HW configuration at hand. Contrary to RL or optimization-based works, the designer can interact with the agent in real-time with different conditioning accuracy numbers. Following the initial creation of the generative model, no additional training is involved in this interaction. Our proposed Autonomous Quantization GAN (AQGAN) enables the generation of quantization configurations by conditioning on the designer’s input of accuracy number thus enabling data-driven inverse design process.
The primary contribution of this paper includes:
AQGAN: A new DNN model quantization framework is proposed. Based on the algorithm foundation of cGAN, we build a framework that conditions on a continuous value of model accuracy, to generate a set of quantization configuration. We further enhance the work by training a forward model and incorporating them into the training loop of AQGAN.
Hardware-aware: We demonstrate how to manage different resource constraints. By referring to the resource consumption model, the designer picks the feasible one and can interact with the agent to explore different solutions.
Quantitative results: We experiment on various widely-used efficient models, including MobileNet-V2 Sandler et al. (2018), Resnet-18 He et al. (2016), ResNeXt Xie et al. (2017), ProxylessNAS Cai et al. (2018), and MnasNet Tan et al. (2019). We compare the performance with conventional uniform quantization algorithm. We compare the performance and search time with the state-of-the-art autonomous quantization method. The experiment shows across all models, resource constraints, and conditioning accuracy, the generated set of quantized models can achieve an accuracy within 3.5% of that requested.
2 Related Work
Quantization Algorithms.For resource efficiency, reseachers have found great success in representing the network with quantized bit-width. Deep Compression Han et al. (2015) quantized the values into bins, where each bin shares the same weight, and only a small number of indices are required. Courbariaux et al., Courbariaux et al. (2014) directly shifted the floating point to fixed point and integer value. Many works showed 8-bit can be an empirical quantization strategy, and gysel et al., Gysel et al. (2018) presented a fine-tuning method for quantization. To exploit the benefits of low bit-width, DoReFa-Net Zhou et al. (2016) retrained the network after quantization and enabled the quantized backward propagation. To find the best quantization at inference time, the quantized-aware training were widely used, in which Krishnamoorthi et al., Krishnamoorthi (2018) quantized the network to hardware-friendly bit-width, and Benoit et al., Jacob et al. (2017) optimized the co-designed training procedure with integer arithmetic inference. Most works quantize the model in one-step. Zhou et al. (2017) utilizes a multi-step incremental quantization strategy which leads to more accuracy, albeit taking longer.
Autonomous flexible quantization. A fine-grained quantization, layer-wise quantziation could achieve more aggressive quantization. HAQ Wang et al. (2019) drives the search of the flexible quantization with reinforcement learning DDPG Lillicrap et al. (2015)
manner. HAQ uses to figure out the quantization configuration of each layer. HAQ optimizes the configuration to fit the defined target function, which is the Top-1 accuracy difference to the original (full-precision) model. However, the challenges of reinforcement learning method are that they take large number of epochs to converge and they need to be retrained when the target (HW resource budget change). It becomes significant when considering HW-aware quantization in this era of booming new HW platforms.
2.2 Generative Adversarial Networks (GANs)
Generative Adversarial Nets (GANs) Goodfellow et al. (2014)
, based on game theory approach, contains a generative network (generator) and adversarial network (discriminator). The generative model is pitted against the discriminator, which learns to determine whether a sample is from the generator or the training data. Instead of building multiple GAN networks for different class of images, the fundamental GAN can be augmented by incorporating condition Mirza and Osindero (2014); Van den Oord et al. (2016). Both the generator and discriminator take in the condition in the training phase, so that the mapping of different condition to different distribution can be learned. Similar approaches such as ACGAN Odena et al. (2017), infoGAN Chen et al. (2016a), and others Odena (2016) Ramsundar et al. (2015), task the discriminator to reconstruct the class label taken by the generator rather than feeding in condition to discriminator. GANs have tackled the tasks of image generation Goodfellow et al. (2014); Chen et al. (2016a); Odena et al. (2017), image prediction Yoo et al. (2016), text-to-image synthesizing Reed et al. (2016), and sequence generation Yu et al. (2017). Conditional GANs forces the output to be conditioned on the input, and different conditions such as classification label Goodfellow et al. (2014), text Reed et al. (2016), beauty-level of image Diamant et al. (2019), and image itself Isola et al. (2017)
have been applied to different applications. Most of the GAN works has focused on improving and extending the image-based applications such as super-resolution image generationWang et al. (2018) et al. (2017) Almahairi et al. (2018). This work applies GAN to the generation of quantization configuration, conditioned by the user’s input of accuracy number.
We model the DNN quantization problem as a generative model training problem. We build a condition-based GAN infrastructure, Autonomous Quantization GAN (AQGAN), an inverse procedure of the conventional DNN quantization process that conditioning on the targeted accuracy generates a quantization configuration. Figure 1 shows the system overview. We develop and incorporate several useful techniques to prevent overfitting and mode collapse. We evaluate the generative model with the ground-truth environment and show only 3.5% error on average.
3.1 Experience collection
The first step of AQGAN training is to build a ground-truth environment, which is a quantizing procedure of a target model (e.g., quantized MobileNet-V2). On each target model, we apply quantization-aware training to each layer of the target models. We quantize both weights and activations. A complete quantization configuration will describe the bit-width allocation of each layer, which we feed into the built ground-truth model and receive the post-quantized Top-1 accuracy number. This is used to train the generative model in AQGAN.
An example of a design point (X, Y) is as follows. Assuming a L layer DNN, the X is the quantization configuration, a 1-by-N sequence whose value range from 1 to 32, indicating bit-width. The Y is the corresponding accuracy number of the model being quantized by this X configuration. A design point (X, Y) is one point in the dataset, which we further split into training and testing set. We apply random sampling, which has been proven as a competitive approach to efficiently sample the through the response contour Such et al. (2017), and generate a dataset of design point to response pair. With the dataset at hand, we apply the 80-20 split to formulate the training and testing set. The testing set is never seen in any training process of Generator, Discriminator, or Quantizer, which we will describe later. In the vast design space (more than billions of design points), we found that the designed model can achieve competitive testing performance by the left-out testing set while we only collect 20K design points222We empirically find 20K design points are enough (where the marginal performance gain start to flatten) for the models we investigated. as a dataset in each ground-truth environment of modern models, including MobileNet-V2, Resnet-18, ResNeXt, ProxylessNAS, and MnasNet.
3.2 Condition labeling
With the training set at hand, we label the data by normalizing the response accuracy number to the continuous range of [0, 1]. The labels are taken by AQGAN as a condition for generating the quantization configuration.
The designer with a well-trained generator on a target model will serve the targeted accuracy number to AQGAN as an input. AQGAN normalizes the targeted accuracy to its label range [0, 1], condition on this label, and generates the corresponding quantization configuration.
3.3 AQGAN structure
We train our GAN infrastructure with two-step training. First, we train the Quantizer, an accuracy-predicting model that serves as an auxiliary instructor in the next step, which is inspired by the auxiliary classifier in ACGANOdena et al. (2017) and the classification of a discriminator in infoGANChen et al. (2016a). Second, we train the generator and discriminator with Quantizer as an auxiliary instructor.
The Quantizer is built as an MLP model. We train Quantizer with the training dataset as a regression problem, which takes the design points (quantization configuration) and regresses the value of the corresponding label, i.e., the normalized accuracy number. In the evaluation phase, Quantizer takes in a design point in the testing set and predicts its corresponding label. We evaluate the performance of Quantizer by the L1-loss to the true label, and our model achieves under 3.5% error across all the DNN models considered. The intuition of our design is that we first stabilized the predictor before the iterative discriminator and generator training loop to relieve the stress of discriminator, which usually plays a crucial role in the convergence of GAN.
3.3.2 Overfitting prevention.
When training the Quantizer, we include conventional overfitting prevention methods such as batch normalization, dropout, and early stopping to strike the balance between variance and bias owing to the difference of model complexity. In addition, we use an ensemble method comprising of multiple Quantizers to guide the training of the generative model. We train multiple Quantizers with different complexity from thin MLP to wide MLP. In the experiment, we train 4 Quantizers, each with three layers and with 64, 128, 256, 512 nodes on each layer respectively. The number of Quantizers is decided by empirical experiment, where we found 4 Quantizers was sufficient. They are trained individually with the same training set. The intuition is that the thin MLPs can capture more general feature while the wide MLPs are capturing a sophisticated one. After training, the parameters of this group of Quantizers are fixed and they will be the group of instructors for the generator in the next step.
The training of Generator and Discriminator starts at the second step. The Generator is built with MLP infrastructure with output dimension and input dimension, where
is the length of quantization configuration that needs to estimated which is also the number of layers/blocks of the model that we are quantizing;is the dimension of the latent space, which is a dimension gaussian noise, , and we take in our experiments; is the dimension of condition, where we have because the design of one continuous label as condition. When training Generator, we sample a batch of fake labels in the range of [0, 1] as the condition and a batch of noise in as latent space and which is fed to Generator. We the collect the dimension outputs, which is a fake quantization configuration. Assuming quantizing a DNN with layer, the output space of the generative model is a 1-by-N sequence, whose values are range from 1 to 32, indicating the bits-width.
Discriminator is also built with MLP infrastructure with output dimension and input dimension, where is the judgment of true or fake; is the same dimension as Generator outputs. Discriminator is trained as a standard discriminator, which takes the dimension design points from the training set and the dimension outputs from Generator and are tasked to judge whether they are true (from the training set) or fake (from Generator).
3.4 GAN training
In the second step of training, the parameters of Quantizers are not updated. We use three kinds of loss in our framework. Discriminator, Generator and Quantizers losses to estimate the parameters of the Discriminator and the Generator.
3.4.1 Quantizers loss:
The kernel part of the framework that drives the learning process is the next Quantizers loss. In each iteration of training, we use a batch of fake label to let Generator output fake data. We feed the generated fake data into the group of Quantizers, which predict the corresponding label value , …,, where in the our setting. Quantizers together play the instructors’ role to teach Generator to generate fake data that can lead to the good predicted label value. We use mean square error (MSE) loss. The Quantizer loss is the sum of each loss term as follow:
It is worth noting that we have corresponding predicted labels with the same and same Generator. We let each of these instruct the Generator to prevent Generator overfitting to any single Quantizer. Also, the wider Quantizer instruct Generator to deal with sophisticated generation task, and the thinner Quantizer regularized Generator to build more general generation rule.
Discriminator and Generator loss: Discriminator loss is the performance of classifying the training set from the fake set to true and fake, and Generator loss is the performance of Generator making Discriminator misclassify. Discriminator and Generator loss are defined as in a conventional GAN. We compute the loss with Wasserstein distance and leverage gradient penalty to stabilize the training and prevent mode collapse in our experiments.
With the three training losses defined, we train the model iteratively in this second training step. The insight of this framework is that with a good pre-trained instructor, the stress on the Discriminator is relieved so that the Discriminator can keep pace with the improvements of Generator, thus avoiding to mode collapse.
3.5 Generative process and evaluation
After AQGAN is trained, the designer can use the Generator as a conditional generative model.
3.5.1 Evaluation and performance index.
The model is trained on the ground-truth environment, where the target model is defined and the parameters initialization seed is set. During testing, we randomly sample a batch of target accuracy and normalize them into label domain to be between [0, 1], noted as in the context. We feed those label into AQGAN as condition and collect the batch of output from Generator, which are fake configuration (generated quantization configuration). We evaluate those fake configuration by testing them in the ground-truth environment, i.e., we apply those configurations to quantize the target DNN model and gather the corresponding Top-1 accuracy, which is the ground-truth accuracy (). We use L1-loss to measure the performance index of the model. The loss of generative model, , is defined as,
, where K is the batch size.
3.5.2 Generative design process.
The designer specifies a target accuracy as condition to the generative model and the expected batch size to be generated as outputs. The accuracy is normalized internally and AQGAN produces a batch of outputs, which are the set of valid quantization configuration predicted to meet the specified condition.
3.5.3 Implementation details
The backbone of Quantizer, Generator, and Discriminator are MLP with dropout and batch normalization. We apply drop out rate of 50% and utilize a batch size of 256. For Generator, we use the latent space of 10. We use Westeran distance as our loss function.
3.6 Autonomous quantization
The benefits of having a generative model making proposals of designs are two fold: First, is the automation of finding flexible quantization bit-width of each layer. Second, is the simplified HW-aware tuning process.
3.6.1 Autonomous flexible quantization
Owing to the sensitivity of each layer, an uniform quantization implementation of the DNN may not be suffient to meet the accuracy constraints sought by the designer. To flexibly search the design space, an optimal quantization configuration is needed for each layer of the DNN. However, as the search space is massive it is not possible to exhaustively search for the optimal answer. Hence, some intelligent autonomous methods such RL, HAQ have been designed.
Our work, a generative model, is also a framework of generating the quantization configuration autonomously. Taking in the target accuracy, our work generates a set of design proposals that meets the accuracy targetted. The difference between HAQ (RL-based method) and ours (generative model) is the search cost and the number of designs that can be generated with each search. The RL-based method could take hundreds to thousands of epochs to converge to a single design point. However, with a generative model, the multiple design proposals are generated instantaneously during inference. While the RL searches and train simultaneously the generative model relies on previously gathered data, but once trained the model can generate any number of design points during inference and for different accuracy conditions.
3.6.2 HW-aware tuning
The goal of quantization is to create an HW-efficient DNN model, having good property of HW performance such as memory, energy, or latency. Thus, quantization of a DNN model occurs with a specification of the HW budget (resource). In an RL-based method, HW performance index can be incorporated into the target function to drive the search process or the actions recommended can be modified as in HAQ. However, both these approaches require a pre-defined HW budget. The search is, thus, driven by the target function (accuracy number) and the HW budget. However, it is often the case that the HW budget could change dynamically, such as the energy constraints of different systems or different memory capabilities in different platforms. Traditional approaches require a new search for each different HW consideration limiting the ability to transfer any knowledge gained.
To overcome this limitation, we propose a new simplified design flow for HW-aware tuning. A trained generative model, generates a set of proposals with similar accuracy number but different HW requirements. Our HW-aware tuning process is as follows: First, the model generates a set of design proposals based on the designer’s desired target accuracy. Second, the designer ranks the proposals with the HW performance that is to be optimized. Finally, the designer selects one of them that fits the HW budget the best. These three steps could be accomplished with a simple and fast program, which is only executes sorting of the generated design alternatives and selects one or more that meets the hardware constraints.
With our AQGAN-based HW-aware tuning flow, the generative model could be distributed to different designers irrespective of their application, i.e., both a ML practitioner working on IoT devices and a ML practitioner working on cloud acceleration would utilize the same generative model. The only difference, is they run the HW-aware tuning flow with different HW budget / constraints.
4.1 Datasets and models
As our focus is to provide flexible bit-width allocation for different layers, we apply standard range-based linear quantization333 We pick the scaling factor based on the range of the current tensor (weight/ activations). We pick the scaling factor by
We pick the scaling factor based on the range of the current tensor (weight/ activations). We pick the scaling factor by, where is the quantization bit-width. The and are the maximum and minimum values in the tensor. The quantization formula becomes . The and indicate a post-quantized and pre-quantized value in a tensor. for each layer. There are advanced quantization algorithms such as DoReFa Zhou et al. (2016) with low bandwidth gradient, PACT Choi et al. (2018) with parameterized clipping activation, WRPN Mishra et al. (2017) with wide reduced-precision, QIL Jung et al. (2019), and HAWQ-V2 Dong et al. (2019). These papers work on quantization algorithm, while the bit-width need to be exhaustively searched, manual-assigned, or assigned by heuristic. In this work, we have an orthogonal goal. Our focus is to automate the process of bit-width assignment, and we apply the basic ranged-based quantization scheme. However, two techniques can be combined for the extension of this work to reach a better result. HAQ Wang et al. (2019), which uses RL to automate the process and uses the basic quantization scheme, is the most closely related work to ours. Therefore we pick HAQ as our main comparison.
4.2 Comparisons of related works
4.2.1 One-step quantization.
First, we compare our works with the one-step quantization methods as applied in Chen et al. (2015); Gupta et al. (2015) with uniform bit-width, and we name this as One-step-Q. The One-step-Q is performed with the same procedure of our framework, except that they are quantized with uniform bit-width.
4.2.2 Autonomous quantization
The performances of our autonomous quantization method is compared against HAQ, which also searches for the quantization bit-width of each layer. We apply the same setting as in the original paper with the modification that HAQ is able to target different accuracy.
4.3 Memory-constrained quantization
We construct the experiments such that the hardware considered are constrained by the DNN parameter memory computed post-quantization. We compare our results to One-step-Q, which quantizes the model instantly without need for expensive searches. We apply 8-bit, 6-bit, and 4-bit quantization by One-step-Q to quantize five different models and list their parameter memory size and the model accuracy in Table 1. Then we examine under similar parameter memory constraints, the accuracy AQGAN can achieve. The flows is as follows:
The generator proposes the set of model configurations when constrained by the desired target accuracy. Following this, we could apply the HW-aware tuning flow of AQGAN, i.e., we pick the designs that are closest to the memory size of the One-step-Q method and report the accuracy that AQGAN targets. We also evaluate the maximum accuracy that can be achieved given a resource constraint. This is realized by searching through the possible set of accuracies for designs that meet a specified resource constraint.
From Table 1, it is seen that given a memory constraint, our work leads to of designs with higher accuracy. In very few cases, our approach degrades the accuracy by less than 1%. The effectiveness of our works is more apparent when it comes to low bit-width. In 4-bit cases, One-step-Q has bad accuracy, where sometimes the model has an accuracy of 0%. However, with flexible bit-assignment, in most cases, our work can generate a quantized model with much higher accuracy. It is worth noted that unlike DoReFa Zhou et al. (2016), PACT Choi et al. (2018), WRPN Mishra et al. (2017), QIL Jung et al. (2019), HAWQ-V2 Dong et al. (2019), whose bit-width strategy need to be exhaustively searched or manually-assigned, and HAQ Wang et al. (2019), whose bit-width strategy need a search process through RL algorithm, AQGAN generates the bit-width strategy with generative model at inference time without retraining.
4.4 Autonomous flexible quantization
We construct experiments that perform autonomous flexible quantization and target different accuracies. We compare it with HAQ. We construct the target function to target different accuracy numbers being 0.3, 0.4,…, 1 times that of the original accuracy.
Experiment detail: HAQ. As indicated by Table 2, there are 8 tasks (columns) for each algorithm (HAQ, AQGAN). For each task, the algorithm is required to quantize the model such that it is close to the target accuracy number (e.g., target accuracy= for the first task in Table 2). For HAQ, we train it by modifying the reward function. Rather than always maximizing the accuracy (which corresponds to one of the tasks of our experiments – the result in the last task (column)), we define the reward to minimize the distance between target accuracy and achieved accuracy.
Experiment detail: AQGAN.
For our work (conditional GAN-based), we simply use the target accuracy as the condition, sample 50 points from normal distributionas noise, and generate 50 outputs (quantization configuration). We then evaluate the accuracy of each output and gather the accuracy number of the 50 quantization configurations. We calculate their distances to the target accuracy as error range. Therefore, the recorded values become the target value the error range. For example, for the first column in Table 2, the error range is .
Discussion. According to Table 2, HAQ can achieve a model with its accuracy number close to target accuracy in most of the cases. Our work generates a set of model configurations that are within close margins to the accuracy requested as recorded in the table. Both works perform well when targeting the accuracy number. However, for each experiment of target accuracy, HAQ needs about 98 GPU mins to complete the search task, while our work needs less than 1 GPU mins to execute an inference as a generative model. We also conduct a similar experiment on Mobilenet-V2. The search cost for HAQ is about 96 GPU mins, which our work stays under 1 GPU mins.
The search time cost play a significant role when ML practitioners deploys models on different platforms that targeting different use cases with different resource budgets such as energy/latency, where the required accuracy level is different. The search cost is the time they need to pay for a new design point in a new hardware configuration.
4.5 The effectiveness of AQGAN
In this section, we inspect the effectiveness of the set of design proposals generated by our model. In the evaluation, we condition on different target accuracy number and gather the quantization configuration that our model generates. We collect the statistic of the parameter memory size of those proposals. The distribution of their model sizes is shown in Figure 2. The same condition accuracy leads to a bag of a proposal that has a range of different parameter memory sizes. Those are the generated design points (inference result) the ML practitioner can choose from. Similarly, we show the distribution of the associated activation memory size for different target accuracies in Figure 3. Likewise, we could also examine their distribution of latency, energy, or other HW performance that the ML practitioners are targeting. With this generative model at hand, the only task ML practitioners are left is to rank them quickly and select one or more of them.
4.6 In-depth evaluation of AQGAN
We evaluate how our works performs on average in all the five models we apply to. We show the time for each phase of training. We examine two performance index the first being the Top-1 accuracy that the generated model configurations can achieve and the second, is the L1-loss described in subsection 4.6. We measure the L1-loss, , by random sampling a batch of desired accuracies. We use the generative model to generate a set of quantization configurations which are applied to the ground-truth model, thus generating the evaluation data. Finally, we compute the L1-loss by comparing the ground-truth accuracy with that requested by the designer, i.e., the target accuracy.
From Table 4, we show the time taken for each phase of the generative model. The experience collection takes about 108 GPU hours, and training takes about 25 GPU hours. However, these are both one-time cost, once the experience is collected and the generative models are trained, we could use the same model in different hardware settings to automatically generate quantization configuration or executing HW-aware tuning.
For the top-1 accuracy, we could observe there is a samll accuracy drop. It is because of the fact that we apply one-step training and with basic range-based linear quantization algorithm. A better Top-1 accuracy could be achieved when we apply more advanced methods on those two aspects. However, this work is to discuss the effectiveness of the generative model approach and its fast HW-aware tuning flow. Hence, we leave the evaluation of advanced techniques as our future work.
For the mean L1-loss, we observe that each model reports a loss of under 3.5% and on average 2.9% across five models.
Explanation of the results: When the designer is given a trained AQGAN and they condition on a target accuracy number, the model generates a set of quantization configurations whose mean L1-loss to the target accuracy number is under 3%. The designer can follow the HW-aware tuning flow to pick one of them based on the resource constraints at hand.
Quantization is a key technique for enhancing energy-efficiency of DNN models inside accelerators. Unfortunately, the design-space of optimal quantization values for each layer is extremely large as it depends on the number of layers in the model and the number of levels of quantization the hardware can support. Moreover, multiple quantization configurations with the same accuracy can have vastly different memory requirements, making it challenging to design an automated framework to find the right quantization values for the specific hardware platform.
In this work, we propose a generative design method, AQGAN, to autonomously quantize DNN models. Our key novelty is to leverage conditional GANs to learn the relationship between accuracy and quantization, and generate quantization values for a given target accuracy. Leveraging AQGAN, we propose a new simplified HW-aware tuning flow to enable rapid HW-aware DNN deployment - both on ASICs and FPGA platforms.
- Augmented cyclegan: learning many-to-many mappings from unpaired data. arXiv preprint arXiv:1802.10151. Cited by: §2.2.
- Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: item 3.
Compressing neural networks with the hashing trick.
International conference on machine learning, pp. 2285–2294. Cited by: §4.2.1.
- Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.2, §3.3.
- Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solid-state circuits 52 (1), pp. 127–138. Cited by: §1.
- Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §4.1, §4.3.
- Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024. Cited by: §1, §2.1.
- Beholder-gan: generation and beautification of facial images with conditioning on their beauty level. arXiv preprint arXiv:1902.02593. Cited by: §2.2.
- HAWQ-v2: hessian aware trace-weighted quantization of neural networks. arXiv preprint arXiv:1911.03852. Cited by: §4.1, §4.3.
- ShiDianNao: shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 92–104. Cited by: §1.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.2.
- Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746. Cited by: §4.2.1.
- Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Transactions on Neural Networks and Learning Systems 29 (11), pp. 5784–5789. Cited by: §2.1.
- Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.1.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: item 3, §1.
Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.2.
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. arXiv e-prints, pp. arXiv:1712.05877. External Links: Cited by: §2.1.
- In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. Cited by: §1.
- Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4350–4359. Cited by: §4.1, §4.3.
- Nemo: neuro-evolution with multiobjective optimization of deep neural network for speed and accuracy. In ICML 2017 AutoML Workshop, Cited by: §1.
- Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §1, §2.1.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2.1.
- Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.2.
- WRPN: wide reduced-precision networks. arXiv preprint arXiv:1709.01134. Cited by: §4.1, §4.3.
- Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642–2651. Cited by: §2.2, §3.3.
- Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583. Cited by: §2.2.
- Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072. Cited by: §2.2.
- Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §2.2.
- Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: item 3, §1.
Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567. Cited by: §3.1.
- Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: item 3, §1.
- EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1.
- Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §2.2.
- HAQ: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §1, §1, §2.1, §4.1, §4.3.
- Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2.2.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: item 3.
- Pixel-level domain transfer. In European Conference on Computer Vision, pp. 517–532. Cited by: §2.2.
Seqgan: sequence generative adversarial nets with policy gradient.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.2.
- Incremental network quantization: towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044. Cited by: §2.1.
- Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §2.1, §4.1, §4.3.