1 Introduction
Deep Neural Networks (DNN) have seen tremendous success in solving several problems such as regression, image classifications, object detection, sequential decision making, and so on. Owing to this, and the growth in personal and cloud computing capabilities, machine learning (ML) practitioners today design deep learning architectures with millions of parameters that require billions of computations. With the advent of edge computing, there has been a significant thrust towards the deployment of such complex DNNs on devices with limited computing power such as mobiles and embedded systems to perform realtime data analytic on perceived and sensed data, as in the case of object detection in autonomous cars. Realization of such deployments, though, relies on extremely high resource and energyefficiencies, and has led to large number of the recent works addressing the issue of hardware deployment by model compression or quantization
He et al. (2018); Yang et al. (2018); Liu et al. (2017); Han et al. (2015); Zhou et al. (2017).However these model compression technique often relies on a predefined DNN such as MobilenetV2 Sandler et al. (2018), which limit the capability of finding other DNN model structures that may reach better performance within the same constraint. This is a common dilemma in practice since ML designer and hardware practitioners are often in different entities in an organization. There is little, if any, product tuning iteration in the process of model deployment. When iterations do occur, they result in tedious retraining and redeployments that are extremely time consuming.
Therefore many Platformaware Neural Architecture Search (NAS) Lin et al. ; Tan et al. (2019) techniques have been proposed, which search the neural architecture that fits a certain platform. However, the found ML model is often optimized for a specific platform. A new search procedure is needed when the underlying platform changes, which implies the found model is not portable from platform to platform.
Inspired by the wide success in Conditional Generative Adversarial Networks (cGAN), we introduce Neural Architecture GAN (NAGAN), which when conditioned on the expected model performance^{1}^{1}1Model performance in this paper refers to accuracy for classification problems, and Rsquare error for regression problems (details in subsection 5.1).generates multiple feasible network architectures with their quantization bitwidth. NAGAN generates various structure of NNs which shares similar performance. Therefore, when switching the platform, we can choose a different NN that fits the constraint from the generated NNs pool. Thus, NAGAN can generate NNs for unknown platforms, which is essential in the era when the new platforms are coming out every day. Moreover, NAGAN adopts to both highend and lowend platform. While performance is expected to be tradedoff by area and power, NAGAN generate lighter NNs by conditioning on lower expected model performance.
We also present an Inverse Neural Architecture Generation (INAG) workflow, which demonstrates a practical design flow when applying conditional Neural Architecture Search method to the resource constraint problem, as shown in Figure 1.
To verify the proposed method, we apply NAGAN to MultiLevel Perceptron (MLP) and Convolutionbased Neural Network (CNN) architecture search on regression problems and classification problems on both synthetic dataset, and realworlds dataset, including MNIST and CIFAR10. The NAGAN can generate a bag of NNs within 10% difference to the conditioning value. Then we show how to apply INAG workflow to choose feasible NNs. We recognized that there are many unexplored challenges in this framework such as extending to larger dataset, supporting more diverse NNs sturctures, and scalability to more complex problem. We intend to further investigate these topics in our future work. However, this work is a preliminary step to incorporate condition and adversarial technique to the NAS problem, which demonstrates a potential path toward the variety of NAS applications.
The primary contributions of this work are as follows:

This is the first work, to the best of our knowledge, to propose a conditional GANbased neural architecture search.

Our conditional neural architecture search framework optimizes the network architecture and layerwise quantization bitwidth simultaneously.

We present a practical workflow to utilized the conditional neural architecture search method.

Our endtoend workflow learns the mapping between performance and neural architectures and can inversely generate the architecture for the desired performance.
2 Background and Motivation
Resource constraints on edgedevices have led to active interest in designing DNN models which tradeoff their accuracy for lower compute and memory footprint Iandola et al. (2016); Sandler et al. (2018); Zhang et al. (2018). We describe related work in section 6. We also qualitatively compare the contributions of relevant stateoftheart works in Table 1.
2.1 Training Quantizated Networks
Quantized DNN models quantize weights and activations to represent the floating number in lower bitwidths Courbariaux et al. (2015); Rastegari et al. (2016); Zhou et al. (2017)
. In this work, we target edge platforms where floating point arithmetic is expensive. We follow the concept of integerarithmeticonly quantization
Jacob et al. (2017), which presents all parameters arithmetic in integers. We leverage this trainable quantization method in this work to learn the network architecture and quantization bitwidth simultaneously.2.2 Multiobjective NAS
Neural Architecture Search (NAS) automates the search process of the model. Most NAS works can be formulated as a hyperparameter search problem. While most NAS works are optimizing the accuracy, Platformaware NAS Tan et al. (2019); Tan and Le (2019); Kim et al. (2017) jointly optimize accuracy and platform constraint by adding the platform constraint into the target function, called multiobjective optimization.
However, Multiobjective NAS faced two challenges: (1) the objective function need careful manual design to represent different platform constraint; (2) a found platformoptimized NN may not be portable to other platform.
2.3 Generative Adversarial Networks (GAN)
Generative Adversarial Nets (GANs) Goodfellow et al. (2014) contains a generative network (generator) and an adversarial network (discriminator). The generator is pitted against the discriminator, which learns to determine whether a sample is from the generator or the training data. Generator and discriminator are trained iteratively to outperform each other, thus resulting in both models improving over the course of the training process.
Conditional GAN Mirza and Osindero (2014); Van den Oord et al. (2016) takes in the condition in the training phase, so that the mapping of different condition to different distribution can be learned. Similar approaches such as ACGAN Odena et al. (2017), infoGAN Chen et al. (2016), and others Odena (2016); Ramsundar et al. (2015), task the discriminator to reconstruct the condition taken by the generator.
3 Neural Architecture GAN
This section illustrates the architecture of the proposed Neural Architecture GAN (NAGAN), which is the first stage of the proposed INAG workflow, described in section 4.
3.1 NAGAN Overview
NAGAN, takes in the normalized expected model performance as its input, and generates a set of feasible neural architectures, each with a perlayer quantization, as shown in Figure 2. The training of NAGAN contains a generator, a discriminator, and an encoder. The generator and discriminator are inherited from the general framework of conditional GAN, involving modifications that enable them to learn the distribution of neural architectures.
Encoder. As shown in Figure 3
, we add an encoder to enhance the training performance of generator. The encoder is trained by the model definition to model performance pair with regression manner. The main purpose of the encoder is the fast estimation of the quality of the generated data. While quality evaluation in the image generation tasks of GANs are not standardized and mostly rely on human judgement, we found that the quality of the network architecture can be systematically judged by applying the generated network descriptions to the platforms and collecting their performance. However, as the addressed challenge of platformaware NAS
Tan et al. (2019), to interact with the real platform can be timeconsuming, an appropriate surrogate model that predicts the performance is needed for fast interaction. NAGAN trains an encoder to serve as a surrogate model.The encoder is a pretrained network that maps network description to the network performance. The encoder is found to be helpful in several ways in the training process. First, before actually training NAGAN, training the encoder provides a means for the fast estimation of how the NAGAN may perform, e.g., we can notice early if the number of training data is inadequate. Second, the pretrained encoder weights can be a good initiation for the first few layers of discriminator since the encoder and discriminator work on the same feature map.
3.2 Training of NAGAN
The inputs of the generator are a Gaussian noise on latent dimensions and a fake (generated) condition. For example, assuming the dimensions of latent space is 10, and the dimension of condition is 1, whose value ranges from 0 to 1 after normalization, then in the training process, we randomly sample a 10 dimension Gaussian noise and a value from 0 to 1 as inputs of latent space and condition respectively for the generator as shown in bottom left of Figure 3. The discriminator has two source of inputs, the generated data (generated network description) and the training data (a set of real network description with its corresponding normalized performance value). The discriminator is trained to accomplish two tasks: first, to tell whether the input data is from generated set or training set and second, to reconstruct the condition of the input data. As shown in Algorithm 1, we alternate between the training of the generator, discriminator, and encoder by optimizing their value functions.
Value functions. The value function of generator and discriminator conform to the ACGAN Odena et al. (2017) setting. On top of it, we add another term of encoder’s value function in the training process. In our framework, the weights of the encoder are pretrained and fixed. evaluates the loss of generated NNs’ condition approximated by the surrogate model, encoder. The loss of will backpropagates through and trains the generator. Encoder is a teaching model for generator.
Continuous condition. We use the model performance as a continuous condition. We found that it encourages a smooth transition along the dimension of condition. The effectiveness of continuous condition is also seen in prior workDiamant et al. (2019); Pumarola et al. (2018); Souza and Ruiz (2018).
Training Data. Since there is no public training data available for the task of neural architecture generation, we prepare the training data as follows. Each training data for NAGAN is composed of a network description (number of nodes/channels and quantization bitwidth per layer) and its corresponding model performance. We randomly sample the value for each parameter in the network description, deploy and train it on the regression or classification problem, and gather its model performance e.g., accuracy. We execute a Monte Carlo sampling of the network description space to generate sufficient data to compose the training dataset.
3.3 Generating Quantized Neural Architectures
To achieve best data efficiency for the neural architecture, our goal is to not only produce the network architecture but also the quantized bitwidth for each layer of the generated networks. There is existing research that targets uniform quantization of the whole model (modelwise quantization) and reports impressive result on 8bit quantization for tested applications Gysel et al. (2018). In this work, we target perlayer quantization, as shown in the example in Figure 2, which reduce the model size more aggressively than uniform quantization.
4 Inverse Neural Arch Generation
The NAGAN generates a bag of NNs that fit the required condition. Out of these NNs, we pick the one that fits the platform. In this section, we present a practical workflow called Inverse Neural Architecture Generation (INAG) to accomplish the selection process.
4.1 Toolflow and Walkthrough example of INAG.
After NAGAN generated a bag of NNs, the INAG simply select these NNs by different criterion of setting platform constraint. However, when no generated NNs fits the constraint (for example, when we require to put a high complexity NN on small edge device), we condition on lower expected model performance and generate a new bag of lowerend NNs to fit in lowerend platform.
4.2 Selecting Stages
Selecting stages are fast evaluation processes, which output the data points meeting the specified set of constraints. Unlike the stateoftheart multiobjective optimization approaches, we consider the constraint after the candidate feasible data points are generated by NAGAN. Moreover, selecting stages remove the necessity for designing weighting parameters for each constraint in the multiobjective problem which leads to distinct optimum points automatically when the constraint changes.
Confidence selector. Since NAGAN generates the models from Gaussian sampling, the wider accepted confidence range leads to higher diversities of models and likewise larger range of performance difference in a bag of NNs. However, we can evaluate the confidence value of each NN by calculating the normalized distance, , between the input condition and the evaluated condition of generated NN, . INAG uses the encoder as a surrogate model to estimate the performance, . The normalized distance is defined as:
(1) 
where is the normalized factor to normalize between 0 and 1. The confidence selectors use the normalized distance as its confidence for selecting the feasible NNs.
Storage selector. The storage requirement to deploy a DNN on a device is formulated as a function of number of parameters and their bitwidths, which approximates the required memory space. The evaluation function is formulated as:
(2) 
where is the number of layers of model . is the number of weight parameters, is the number of feature map parameters, and is the bitwidth of layer . The INAG workflow presents normalized storage constraint, where the value is normalized to the range of 0 to 1.
Energy selector. Since the MACs dominates the operating energy Yang et al. (2017), and the power of one MAC unit is a function of arithmetic bitwidth, INAG formulates the operating energy consumption as a function of MACs and bitwidth:
(3) 
is the total number of MAC operations of layer . The is the energy evaluate function. Also we used normalized energy value between 0 and 1 in the whole flow.
Output selector. The selector is the laststage of INAG, which selects the highest ranked generated design configuration, based on the criterion of , normalized storage, or energy.
4.3 Extensions
In practice, INAG can easily incorporate additional constraints such as endtoend performance, mainmemory bandwidth, and energy on a target device by enhancing or adding more selecting stages that leverage advanced external analytical tools for estimating latency or throughput Kwon et al. (2018), optimal dataflow Kwon et al. (2018), energy Wu et al. (2019), and so on.
5 Evaluation
5.1 Experiment setup
Dataset. For the regression problems, we use two types of datasets: synthetic and real world. For the synthetic datasets, we generate two datasets with different complexity of the synthetic functions with additive Gaussian noise to the function, which is a common assumption for most regression models, where the synthetic function for Dataset becomes . We have as an Gaussian noise. For the first dataset , we consider a polynomial function of degree of with randomly assigned coefficients. In particular, we have for in the presented result. For the second dataset , we consider a more complex polynomial function of degree of . In particular, we have for . We synthesize these two datasets to showcase that our method is effective in both simple and complex regression problems. For the real word dataset, we apply California Housing 1, whose task is to predict the price of a property given the property attribute. For the image classification problems, we use two datasets: MNIST and CIFAR10.
Target Problems. We construct six different setting of experiments for Regression and Classification with MLPbased and CNNbased structure and named them accordingly in Table 2.
Exp.  Reg.A  Reg.B  Reg. C  Cls.A  Cls.B  Cls.C 
Arch.  MLP  MLP  MLP  MLP  MLP  CNN 
Data  C. H.  MNIST  CIFAR  CIFAR 
When evaluating the performance of the NAGAN generator, we statistically calculate the normalized distance of the generated model and the normalized condition. We sweep the condition from 0.1 to 1.0 using a step size of 0.1. The generator’s performance is evaluated by the mean normalized distance, . In the experiment, we present two testing platforms. The first is the fast evaluation metric by the encoder as shown in subsection 4.2, which gives . We also evaluate the actual performance by applying the generated NNs to the platform, which gives .
Size of Training Data. The neural architecture search space is extremely large. For instance, in our setting, the total number of datapoints in the search space for the MLP architecture generation is more than 17 billions. However, empirically we found that we can train NAGAN with a comparatively small training set. In Figure 4(a), we show the impact of the number of training data samples to the quality of NAGAN via the example of Reg.A. We trained NAGAN with different size of training data and measured the . We found that starts to converge around 1k datapoints. This implies we can train NAGAN with small number of data and achieve good performance close to the one trained with large number of data. In the following experiments, we use training set size as follows: 8,000 datapoints in , 8,000 in , 7,000 in California Housing, 10,000 in MNIST, and 20,000 in CIFAR10.
Perf.(%)  Storage  Energy  Time  
GA  97.3  0.21  0.21  6 hrs 
Bayesian  97.7  0.20  0.213  13 hrs 
INAG  95  0.19  0.19  5 secs 
^{*} Note: The constraint in this experiment is defined as: Normalized storage < 0.21, Normalized Energy < 0.21. Time is the searched time for GA and Bayesian, and the inference time of INAG to get this result while a pretrained NAGAN is trained in around 2 hrs.
5.2 Comparison with GA and Bayesian Optimization
The results of the INAG workflow are compared against the baseline approaches of genetic algorithm (GA) and Bayesian optimization. We perform a constrained optimization using a GA in which the algorithm attempts to maximize the accuracy of the model subject to constraints of storage and energy. The constraint is incorporated into the fitness of each design point using the interior penalty method. We similarly use Bayesian optimizer to search the optimum point by incorporating the constraint into target function. The search time of each method is measured with a desktop with a GTX1080 GPU, as shown in
Table 3.Further, a trained NAGAN is portable between platform. We can use NAGAN to regenerate NNs and go through the fast selection of INAG workflow whenever the platform changes. However the optimizationbased manner such as GA and Bayesian need redesigning of the target function and go through the search process again.
5.3 Results on NAGAN
we also statistically evaluate NAGAN by the defined evaluation metric and as illustrated in subsection 5.1. Across all experiments in Table 2, the average of actual mean normalized distance of generator is below 10 %, as shown in in subsection 5.1. We can observe that is smaller than , which means the surrogate model, encoder, is sometime overly optimistic. However, it stills helps as a fast evaluation in the NAGAN training process and INAG selecting process and helps navigate the NAGAN through the training.
Discussion. From the experiment Cls.C, we found that the generator was capable of capturing the distribution of the CNN, yielding comparable accuracies to its MLP counterpart. Even though both the search space of the CNN is larger and the latent distribution being more complex, we observe that the NAGAN is able to achieve a good of 8.7% on the CIFAR10 dataset. Although we demonstrate the capability of NAGAN to capture the complexities in CNNs, we recognize that the demonstrated results shown in Table 4 is more data intensive requiring 20,000 training datapoints to reach the quoted performance. Hence, there is still plenty of room for exploration, e.g., exploiting of different characteristics of CNN, supporting more complex CNN models. We intend to further investigate these topics in our future work.
Exp  Reg.A  Reg.B  Reg.C  Cls.A  Cls.B  Cls.C 

0.3  1.2  0.5  0.3  0.4  4.8  
12.0  6.2  11.3  8.1  11.7  8.7 
5.4 Results on INAG
We demonstrate the example of INAG by Reg.A to show its advantage of wide Pareto frontier. Figure 4(b) shows a scatter plot of normalized performance to storage when we sweep the expected normalize performance from 0.0 to 1.0. It shows that INAG can actually generate different NNs with different storage consumption conditioning on its expected normalize performance. Also the generated NNs forms a wide Pareto frontier.
6 Related Works
Quantization. Deep Compression Han et al. (2015) quantized the values into bins, where each bin shares the same weight, and only a small number of indices are required. Courbariaux et al., Courbariaux et al. (2014) directly shifted the floating point to fixed point and integer value. DoReFaNet Zhou et al. (2016) retrained the network after quantization and enabled the quantized backward propagation. HAQ Wang et al. (2019)
approached the quantized network training with reinforcement learning.
Neural Architecture Search (NAS). The genetic algorithms (GA) and evolution algorithms Stanley et al. (2019) have been studied for decades for parameter optimization. NASBOT Kandasamy et al. (2018) applied Bayesian optimisation on architecture search. NAS with reinforcement learning are the most general framework currently such as NASNet Zoph et al. (2018) . DARTS Liu et al. (2018) used differential architecture to facilitate the search process. Platformaware NAS, such as MnasNet Tan et al. (2019), MONAS Hsu et al. (2018), HAS Lin et al. , and NetAdapt Yang et al. (2018) included platform constraint in the search process. Unlike the previous efforts of NAS, this work utilized the conditional method to jointly generate the network architecture and quantized bitwidth conditioning on different expected performance for different platforms.
7 Conclusion
We propose a conditionalbased NAS method to provide platform portability. Also, we present a workflow to select the feasible NNs for the targeted platform. Finally, we validate our method on both regression and classification problem.
References
 [1] California housing data set. Note: https://www.kaggle.com/camnugent/californiahousingprices Cited by: §5.1.
 Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.3.
 Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024. Cited by: §6.
 Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §2.1.
 Beholdergan: generation and beautification of facial images with conditioning on their beauty level. arXiv preprint arXiv:1902.02593. Cited by: §3.2.
 Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.3.

Ristretto: a framework for empirical study of resourceefficient inference in convolutional neural networks
. IEEE Transactions on Neural Networks and Learning Systems 29 (11), pp. 5784–5789. Cited by: §3.3.  Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §6.

Amc: automl for model compression and acceleration on mobile devices.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 784–800. Cited by: §1.  Monas: multiobjective neural architecture search using reinforcement learning. arXiv preprint arXiv:1806.10332. Cited by: §6.
 SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §2.
 Quantization and Training of Neural Networks for Efficient IntegerArithmeticOnly Inference. arXiv eprints, pp. arXiv:1712.05877. External Links: 1712.05877 Cited by: §2.1.
 Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pp. 2016–2025. Cited by: §6.
 Nemo: neuroevolution with multiobjective optimization of deep neural network for speed and accuracy. In ICML 2017 AutoML Workshop, Cited by: §2.2.
 An analytic model for costbenefit analysis of dataflows in dnn accelerators. arXiv preprint arXiv:1805.02566. Cited by: §4.3.
 [16] Neuralhardware architecture search. Cited by: §1, §6.
 Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §6.
 Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §1.
 Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.3.

Conditional image synthesis with auxiliary classifier gans
. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2642–2651. Cited by: §2.3, §3.2.  Semisupervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583. Cited by: §2.3.
 Ganimation: anatomicallyaware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 818–833. Cited by: §3.2.
 Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072. Cited by: §2.3.
 Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §2.1.

Mobilenetv2: inverted residuals and linear bottlenecks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4510–4520. Cited by: §1, §2.  GANbased realistic face pose synthesis with continuous latent code. In The ThirtyFirst International Flairs Conference, Cited by: §3.2.
 Designing neural networks through neuroevolution. Nature Machine Intelligence 1 (1), pp. 24–35. Cited by: §6.
 Mnasnet: platformaware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §1, §2.2, §3.1, §6.
 EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §2.2.
 Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §2.3.
 HAQ: hardwareaware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §6.
 Accelergy: An ArchitectureLevel Energy Estimation Methodology for Accelerator Designs. In IEEE/ACM International Conference On Computer Aided Design (ICCAD), Cited by: §4.3.
 Designing energyefficient convolutional neural networks using energyaware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5687–5695. Cited by: §4.2.
 Netadapt: platformaware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 285–300. Cited by: §1, §6.
 Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §2.
 Incremental network quantization: towards lossless cnns with lowprecision weights. arXiv preprint arXiv:1702.03044. Cited by: §1, §2.1.
 Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §6.
 Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §6.
Comments
There are no comments yet.