1 Introduction
Deep neural networks have demonstrated their extraordinary power for automatic feature engineering, which however involves extensive human efforts in finding good network architectures. To eliminate such handcraft architecture design, neural architecture search (NAS) was recently proposed to automatically discover suitable networks by searching over a vast architecture space. Recent endeavors have well demonstrated the superior ability of NAS in finding more effective network architecture, which has achieved stateoftheart performance in various computer vision tasks and beyond, such as image classification
Zoph and Le (2016), semantic segmentation Chen et al. (2018); Liu et al. (2019) and language modeling Liu et al. (2018b); Zoph et al. (2018). Despite the remarkable progress, existing NAS methods are limited by intensive computation and memory costs in the offline architecture search. For example, reinforcement learning (RL) based methods
Zoph et al. (2018); Zoph and Le (2016) train and evaluate more than neural networks across GPUs over days. To accelerate this training, recent methods like DARTS Liu et al. (2018b) reduce the search time by formulating the task in a differentiable manner, where the search space is relaxed to a continuous space. Thus, the objective function is optimized by gradient descent, which well reduces the search time to days, while retaining a comparable accuracy. However, DARTS still suffers from high GPU memory consumption, which increases linearly with the size of the candidate search set. Therefore, the need for speedup NAS algorithms retains urgent when applying to various realworld applications.A conventional NAS method consists of three parts Elsken et al. (2018): search space, search strategy, and performance estimation. Most NAS methods share the same search space, and have intensive computational requirements in the search strategy and performance estimation. In terms of search strategy, reinforcement learning Sutton and Barto (2018)
Back (1996) are widely used in the literature, which require a large number of structureperformance pairs to find the optimal architecture. In terms of performance estimation, most NAS methods Zoph et al. (2018); Chen et al. (2018) use fullfledged training and validation over each searched architecture, which is computationally expensive and thus limits the search exploration. To reduce the computational cost, Zoph et al. (2018); Zela et al. (2018) propose to use early stopping to estimate the performance with a shorter training time. However, extensive experiments in Ying et al. (2019) show that the performance ranking is not consistent during different training epochs, which indicates that the early stopping may result in suboptimal architectures.In this paper, we propose a Dynamic Distribution Pruning method for extremely efficient Neural Architecture Search (termed as DDPNAS), which considers architectures as sampled from a dynamic joint categorical distribution. More specifically, we introduce a dynamic distribution to control the choices of inputs and operations, and a specific network structure is directly obtained via sampling. In the searching process, we generate different samples and train them on the training set for a few epochs. Then, the evaluation results on the validation set are used to estimate the parameters of the distribution. The element with the lowest probability will be dynamically pruned. Finally, the best architecture is achieved when there is only one architecture left in the search space. Fig. 1 shows the overall framework of the proposed DDPNAS.
We validate the search efficiency and performance of our architecture for the classification task on CIFAR10 Krizhevsky and Hinton (2009) and ImageNet2012 Russakovsky et al. (2015). Our architecture reaches the stateoftheart test error on CIFAR10 (i.e., %). On ImageNet, out model achieves % top1 accuracy under the MobileNet setting (MobileNet V1/V2 Howard et al. (2017); Sandler et al. (2018)). Our contributions are summarized as follows:

We introduce a novel NAS strategy, referred as DDPNAS, which is memoryefficient and flexible on large datasets. For the first time, we enable NAS to have a similar computational cost to the training of conventional CNNs.

A new theoretical perspective is introduced in NAS. Rather than optimizing a proxy like other methods Zoph et al. (2018); Liu et al. (2018b), we directly optimize in the NAS search space. Our model can thus be easily incorporated into most existing NAS algorithms to speedup the search process. A theoretical analysis is further provided.

In the experiments on CIFAR10 and ImageNet, we show that DDPNAS achieves remarkable search efficiency, e.g., % test error on CIFAR10 after hours searching with one Tesla V100 (more than faster compared with the stateoftheart algorithms Zoph et al. (2018); Real et al. (2018)). When evaluating on ImageNet, DDPNAS can directly search over the full ImageNet dataset within 2 days, which achieves % top1 accuracy under the MobileNet settings.
2 Related Work
Neural architecture search is an automatic architecture engineering technique, which has received significant attention over the last few years. For a given dataset, architectures with high accuracy or low latency are obtained by performing a heuristic search in a predefined search space. For image classification, most humandesigned networks are built by stacking
reduction (i.e., the spatial dimension is reduced and the channel size is increased) and norm (i.e., the spatial and channel dimensions are preserved) cells He et al. (2016); Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Huang et al. (2017); Hu et al. (2018). Therefore, existing NAS methods Zoph and Le (2016); Zoph et al. (2018); Liu et al. (2018a, b) can search architectures under the same settings to work on a small search space.Many different search algorithms have been proposed to explore the neural architecture space using specific search strategies. One popular method is to model NAS as a reinforcement learning (RL) problem Zoph and Le (2016); Zoph et al. (2018); Baker et al. (2016); Cai et al. (2018a); Liu et al. (2018a); Cai et al. (2018b). Zoph et al.Zoph et al. (2018)
employs a recurrent neural network as the policy function to sequentially generate a string that encodes the specific neural architecture. The policy network can be trained with the policy gradient algorithm or the proximal policy optimization. Cai
et al. Cai et al. (2018a, b) propose a method that regards the architecture search space as a tree structure for network transformation. In this method, new network architectures can be generated by a father network with some predefined transformations, which reduces the search space and thus speeds up the search. Another alternative way to explore the architecture space is through evolutionary based methods, which evolve a population of network structures using evolutionary algorithms Xie and Yuille (2017); Real et al. (2018). Although the above architecture search algorithms have achieved stateoftheart results on various tasks, a large amount of computational resources are still needed.To overcome this problem, several recent works have proposed to accelerate NAS in a oneshot setting, which has demonstrated the possibility to find the optimal network architecture within a few GPU days. In this oneshot architecture search, each architecture in the search space is considered as a subgraph sampled from a supergraph, and the search process can be accelerated by parameter sharing Pham et al. (2018). Liu et al. Liu et al. (2018b) jointly optimized the weights within two nodes with the hyperparameters under continuous relaxation. Both the weights in the graph and the hyperparameters are updated via standard gradient descent. However, the method in Liu et al. (2018b) still suffers from large GPU memory footprints, and the search complexity is still not applicable to realworld applications. To this end, Cai et al. Cai et al. (2018c) adopte the differentiable framework and proposed to search architectures without any proxy. However, the method still keeps the same search algorithms as the previous work Liu et al. (2018b).
Different from the previous methods, we consider NAS in another way: The operation selection is considered as a sample from a dynamic categorical distribution. Thus the optimal architecture can be obtained through distribution pruning, which achieves an extreme efficiency as quantized in Sec.4.
3 The Methodology
In this section, we present the proposed dynamic distribution pruning method for neural architecture search. We first describe the architecture search space in Sec. 3.1. Then, the proposed dynamic distribution pruning framework is introduced in Sec. 3.2. Finally, a theoretical analysis of the error bound of the proposed method is provided in Sec. 3.3.
3.1 Architecture Search Space
We follow the same architecture search space as in Liu et al. (2018b); Zoph and Le (2016); Zoph et al. (2018). A network consists of a predefined number of cells Zoph and Le (2016), which can be either norm cells or reduction cells. Each cell takes the outputs of the two previous cells as input. A cell is a fullyconnected directed acyclic graph (DAG) of nodes, i.e., . Each node takes the dependent nodes as input, and generates an output through a sum operation
Here each node is a specific tensor (
e.g.,a feature map in convolutional neural networks) and each directed edge
between and denotes an operation , which is sampled from the corresponding search space . Note that the constraint ensures there will be no cycles in a cell. Each cell takes outputs of two dependent cells as input, and we set the two input node as and for simplicity. Following Liu et al. (2018b), the operation search space consists of operations: max pooling, no connection (zero), average pooling, skip connection (identity), dilated convolution with rate , dilated convolution with rate , depthwise separable convolution, and depthwise separable convolution. Therefore, the size of the whole search space is , where is the set of possible edges with intermediate nodes in the fullyconnected DAG. In our case with , together with the two input nodes, the total number of cell structures in the search space is , which is an extremely large space to search, and thus requires efficient optimization strategies.3.2 Dynamic Distribution Pruning
As illustrated in Fig. 1, the architecture search is formulated as a staged conditional sampling process in our approach. More specifically, for a given edge , we introduce a dynamic categorical distribution , with . In other words, the sampling process begins from the latent state . Then each operation is selected to be the th operation with a probability . We follow the stateoftheart works in Liu et al. (2018b); Pham et al. (2018), and use an overparametrized parent network containing all possible operations at each edge with a weighted probability : where
. This design allows the neural architecture search to be optimized through stochastic gradient descent (SGD) by using an EMlike algorithm,
i.e., iteratively fixing to update the network parameters , and fixing to update . While the realvalue weights bring convenience in optimization Liu et al. (2018b), it also requires every possible operation in to be evaluated, which directly causes impractically long training time. Instead, we setas a onehot indicator vector:
(1) 
which can be sampled from a categorical distribution, . While being able to bring significant speedup, this discrete weight design also leads to difficulty in optimization. Here we propose to optimize it through the validation likelihood as a proxy, which has nice theoretical properties and is one of the core contributions of this paper. Given a set of indicator variables , a network structure is determined. Based on the training data , we are then able to train the model to get the parameter set , which finally allows us to test on the validation set to obtain the labels , as shown in Fig. 1. While our ultimate goal is to find an optimal network architecture that can be fully represented by , we show in the following theorem that it is equivalent to maximize the likelihood of the validation target .
Theorem 1.
In a certain training epoch, the structure variable directly determines the validation performance, specifically:
Proof.
As illustrated in Fig. 1, the function can be formulated as:
(2) 
where and are the inputs and labels from the training set, is the set of network weights, and denotes the training epochs. Since are observed variables, during a specific training epoch , Eq. 2 can be further simplified to:
(3) 
To simplify the analysis, without loss of generality, we assume the network weights are initialized as constants, which means is fixed given a certain structure and training epoch, we can further simplify Eq. 3 to
(4) 
As shown in Eq. 4, the structure variable directly determines the validation performance, i.e., if a structure shows a better performance on the validation set, the corresponding holds a high probability, and vice versa. Therefore, the theorem is true during any specific training epoch. ∎
Based on Theorem 1, can be optimized by optimizing , which involves the standard sampling, training, evaluating and updating processes. While we reduce the computation requirement by times by using discrete s, such a procedure is still timeconsuming considering the large search space and the complexity of network training. Inspired by Ying et al. (2019), we further propose to use a dynamic pruning process to boost the efficiency by a large margin. Ying et al. Ying et al. (2019) did a seires of experiments showing that in the early stage of training, the validation accuracy ranking of different network architectures is not a reliable indicator of the final architecture quality. However, we observe that the experiment results actually suggest a nice property that if an architecture performs bad in the beginning of training, there is little hope that it can be the final optimal model. As the training progresses, this observation shows less uncertainty. Based on this intuition, we derive a simple yet effective pruning process: During training, along with the increasing epoch index , we progressively prune the worse performing model. Further theoretical analysis shows that this strategy has nice theoretical bound, as will be introduced in Sec. 3.3.
Specifically, as illustrated in Fig. 1, we first sample the network structure by sampling a set of from . Then, these structures are trained with epochs, and the probability of is estimated by
(5) 
Using Theorem 1, the distribution of the latent state is updated by softmax:
(6) 
Note a nonzero denotes that the structure selects the th operation at edge . Finally, we prune the categorical distribution with minimal probability in : where . The optimal structure is obtained when there is only one architecture in the distribution. Our dynamic distribution pruning algorithm is presented in Alg. 1.
3.3 Theoretical Analysis
To ensure the greedy pruning to work well, we should have an accurate early estimation of in Eq. 5. To achieve this goal, a theoretical upper bound is given as below.
Corollary 1.
In the
th epoch, the standard deviation
of the estimation error in Eq. 5 is(7) 
where are two constants, and is the epoch when convergence is reached.
The corollary is a generalized conclusion from Ying et al. (2019), which demonstrates the validation performance at th epoch and that at the th epoch, where convergence is met, have an increasing Spearman Rank Correlation when is approaching . And shows strong significance in linear relationship with , with the assumption that is a constant. Considering this assumption is widely true in popular learning rate reduction schemes such as cosine annealed schedule, we generalize this empirical formulation in a formal mathematical language in Eq. 7 by introducing a deviation function of the estimation error , e.g., a low rank correlation corresponds to a high deviation.
We consider pruning makes a mistake when the pruned architecture is actually the optimal one, which means the prediction error is considerably large. While the s may vary case by case, when taking all possible architectures in into consideration, there exists a threshold such that if we consider pruning makes a mistake when , we can get the same error rate, i.e. probability of pruning making mistakes. Following Eq. 7, the threshold has the form
(8) 
Theorem 2.
The upper bound of the error rate of Alg. 1 is
(9) 
Proof.
Following the discussion above, we have the error rate equivalent to . From Chebyshev’s inequality, we have
(10) 
While the bound above is for one epoch only, if we consider a series of pruning operations until one architecture is left, the overall bound of the total error rate is
(11) 
Based on and defined in Eq. 7 and Eq. 8, the error bound can be further formulated as
(12) 
∎
Theorem 2 quantitatively demonstrates the rationale of the dynamic pruning design. The error bound is decided by and . On one hand, when the training just begins, is large, and we have to be conservative not to prune the architecture early. On the other hand, when gets closer to , we can prune more aggressively with a guaranteed low risk of missing the optimal architecture.
4 Experiments
In this section, we compare our approach with stateoftheart methods on both effectiveness and efficiency in terms of CIFAR10 and ImageNet. First, we conduct experiments under the same settings as previous methods Liu et al. (2018b); Cai et al. (2018b); Zoph et al. (2018); Liu et al. (2018a) to evaluate the generalization capability, i.e., first searching on CIFAR10 dataset, then stacking the optimal cells to deeper networks. Second, we further perform experiments to search architectures directly on ImageNet under the mobile settings by following Cai et al. (2018c). Our results show that we can obtain the network architecture with comparable performance but with much fewer GPU hours.
4.1 Search on CIFAR10 and Transfer
In this experiment setting, we first search neural architectures on an overparameterized network, and then evaluate the best architecture with a stacked deeper network. We ran the experiment multiple times and found that the result architectures only showed slight variance in performance, which demonstrates the stability of the proposed method.
4.1.1 Experiment Settings
We use the same datasets and evaluation metrics as existing NAS works
Liu et al. (2018b); Cai et al. (2018b); Zoph et al. (2018); Liu et al. (2018a). First, most experiments are conducted on CIFAR10 Krizhevsky and Hinton (2009), which has K training images and K testing images with resolution and from classes. The color intensities of all images are normalized to . During architecture search, we randomly select K images from the training set as a validation set. To further evaluate the generalization capability, we stack the discovered optimal cell on CIFAR10 into a deeper network, and then evaluate the classification accuracy on ILSVRC 2012 ImageNet Russakovsky et al. (2015), which consists of classes with M training images and K validation images. Here, we consider the mobile setting where the input image size is and the number of multiplyadd operations is less than 600M.In the search process, we consider a total of cells in the network, where the reduction cells are inserted in the second and the third layers, with internal nodes in each cell. The search epoch correlates to the estimating epoch . In our experiment, we set , so the network is trained less than 150 epochs, with a batch size of 512 (due to the shallow network and few operation samplings), and an initial channels of . We use SGD with momentum to optimize the network weights , with an initial learning rate of (annealed down to zero following a cosine schedule), a momentum of 0.9, and a weight decay of . The learning rate of category parameters is set to . The search takes only GPU hours with only one Tesla V100 on CIFAR10.
In the architecture evaluation step, our experimental settings are similar to Liu et al. (2018b); Zoph et al. (2018); Pham et al. (2018). A large network of cells is trained for epochs with a batch size of and additional regularization, such as cutout DeVries and Taylor (2017)
. When stacking cells to evaluate on ImageNet, we use two initial convolutional layers of stride
before stacking cells with scale reduction at the st, nd, th and th cells. The total number of FLOPs is determined by the initial number of channels. The network is trained for 250 epochs with a batch size of 512, a weight decay of, and an initial SGD learning rate of 0.1. All the experiments and models are implement in PyTorch
Paszke et al. (2017).Architecture  Test Error  Params  Search Cost  Search 
(%)  (M)  (GPU days)  Method  
ResNet18 He et al. (2016)  3.53  11.1    Manual 
DenseNet Huang et al. (2017)  4.77  1.0    Manual 
SENet Hu et al. (2018)  4.05  11.2    Manual 
NASNetA Zoph et al. (2018)  2.65  3.3  1800  RL 
AmoebaNetA Real et al. (2018)  3.34  3.2  3150  Evolution 
PNAS Liu et al. (2018a)  3.41  3.2  225  SMBO 
ENAS Pham et al. (2018)  2.89  4.6  0.5  RL 
Pathlevel NAS Cai et al. (2018b)  3.64  3.2  8.3  RL 
DARTS(first order) Liu et al. (2018b)  2.94  3.1  1.5  Gradientbased 
DARTS(second order) Liu et al. (2018b)  2.83  3.4  4  Gradientbased 
Random Sample Liu et al. (2018b)  3.49  3.1     
DDPNAS  2.58  3.4  0.06  Pruning 
DDPNAS(large)  1.9  4.8  0.06  Pruning 
4.1.2 Results on CIFAR10
We compare our method with both manually designed networks and other NAS networks. The manually designed networks include ResNet He et al. (2016), DenseNet Huang et al. (2017) and SENet Hu et al. (2018)
. For NAS networks, we classify them according to different search methods, such as RL methods (NASNet
Zoph et al. (2018), ENAS Pham et al. (2018) and Pathlevel NAS Cai et al. (2018b)), evolutional algorithms (AmoebaNet Real et al. (2018)), Sequential Model Based Optimization (SMBO) (PNAS Liu et al. (2018a)), and gradientbased methods (DARTS Liu et al. (2018b)).The summarized results for convolutional architectures on CIFAR10 are presented in Tab. 1. In addition, we define an enhanced training variant, where a larger network with initial channels, is trained for epochs with AutoAugmentation Cubuk et al. (2018) and dropout of probability Srivastava et al. (2014). It is worth noting that the proposed method outperforms the stateofthearts Zoph et al. (2018); Liu et al. (2018b) in accuracy and is with much less computation consumption (only GPU days in Zoph et al. (2018)). We attribute our superior results to our novel way of solving the problem with pruning, as well as the fast learning procedure: The network architecture can be directly obtained from the distribution when it converges. In contrary, previous methods Zoph et al. (2018) evaluate architectures only when the training process is complete, which is highly inefficient. Another notable observation in Tab. 1 is that, even with random sampling in the search space, the test error rate in Liu et al. (2018b) is only %, which is comparable with the previous methods in the same search space. We can therefore conclude that high performance of the previous methods is partially from the search space that is dedicatedly and manually designed with specific expert knowledge. Meanwhile, the proposed method quickly explores the search space and generates a better architecture. We also report the results of handcrafted networks in Tab. 1. Clearly, our method shows a notable enhancement, which indicates its superiority in both resource consumption and test accuracy.
4.1.3 Results on ImageNet
We further compare our method under the mobile setting on ImageNet to demonstrate the generalization capability. The best architecture obtained by our algorithm on CIFAR10 is transferred to ImageNet, which follows the same experimental setting as the works in Zoph et al. (2018); Pham et al. (2018); Cai et al. (2018b). Results in Tab. 2 show that the best cell architecture on CIFAR10 is transferable to ImageNet. The proposed method achieves comparable accuracy to stateoftheart methods Zoph et al. (2018); Real et al. (2018); Liu et al. (2018a); Real et al. (2018); Liu et al. (2018a); Pham et al. (2018); Liu et al. (2018b); Cai et al. (2018b) while using much less computational resource.
Architecture  Accuracy (%)  Params  Search Cost  Search  
Top1  Top5  (M)  (GPU days)  Method  
MobileNetV2 Sandler et al. (2018)  72.0  91.0  3.4    Manual 
ShuffleNetV2 2x (V2) Ma et al. (2018)  73.7    5    Manual 
NASNetA Zoph et al. (2018)  74.0  91.6  5.3  1800  RL 
AmoebaNetA Real et al. (2018)  74.5  92.0  5.1  3150  Evolution 
AmoebaNetC Real et al. (2018)  75.7  92.4  6.4  3150  Evolution 
PNAS Liu et al. (2018a)  74.2  91.9  5.1  225  SMBO 
DARTS Liu et al. (2018b)  73.1  91.0  4.9  4  Gradientbased 
DDPNAS (Ours)  74.3  91.8  4.51  0.06  Pruning 
Model  Top1  Search time  GPU latency 

GPU days  
MobileNetV2  72.0    6.1ms 
ShuffleNetV2  72.6    7.3ms 
Proxyless (GPU) Cai et al. (2018c)  74.8  4  5.1ms 
Proxyless (CPU) Cai et al. (2018c)  74.1  4  7.4ms 
DDPNAS  75.2  2  6.09ms 
4.2 Search on ImageNet
The minimal time and GPU memory consumption make applying our algorithm on ImageNet feasible. We further conduct a search experiment on ImageNet by following Cai et al. (2018c). In particular, we employ a set of mobile convolutional layers with various kernels and expanding ratios . To further accelerate the search, we directly use the network with the CPU and GPU structure obtained in Cai et al. (2018c). In this way, the zero and identity layers in the search space are abandoned. And we only search the hyperparameters related to the convolutional layers.
On ImageNet, we keep the same search hyperparameters as on CIFAR10. We follow training settings in Cai et al. (2018c), train the models for epochs with a learning rate (annealed down to zero following a cosine schedule), and a batch size of across Tesla V100 GPUs. Experimental results are reported in Tab. 3, where our DDPNAS achieves superior performance compared to both humandesigned and automating searched architectures with much less computation cost.
5 Conclusion
In this paper, we presented DDPNAS, the first pruningbased architecture search algorithm based on dynamic distributions for convolutional networks, which is able to reduce the search time by pruning the search space in early training stage. DDPNAS can drastically reduce the computation cost while achieving excellent model accuracies on CIFAR10 and ImageNet comparing with other NAS methods. Furthermore, DDPNAS can directly search on ImageNet, outperforming humandesigned networks and other NAS methods under mobile settings.
References

Back [1996]
Thomas Back.
Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms
. Oxford university press, 1996.  Baker et al. [2016] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
 Cai et al. [2018a] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018a.
 Cai et al. [2018b] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Pathlevel network transformation for efficient architecture search. arXiv preprint arXiv:1806.02639, 2018b.
 Cai et al. [2018c] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018c.
 Chen et al. [2018] LiangChieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jon Shlens. Searching for efficient multiscale architectures for dense image prediction. In Advances in Neural Information Processing Systems, pages 8713–8724, 2018.
 Cubuk et al. [2018] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
 DeVries and Taylor [2017] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
 Elsken et al. [2018] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377, 2018.

He et al. [2016]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Hu et al. [2018] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
 Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
 Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Liu et al. [2018a] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision, pages 19–34, 2018a.
 Liu et al. [2019] Chenxi Liu, LiangChieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, and Li FeiFei. Autodeeplab: Hierarchical neural architecture search for semantic image segmentation. arXiv preprint arXiv:1901.02985, 2019.
 Liu et al. [2018b] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018b.
 Ma et al. [2018] Ningning Ma, Xiangyu Zhang, HaiTao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision, pages 116–131, 2018.
 Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 Pham et al. [2018] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
 Real et al. [2018] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
 Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
 Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Srivastava et al. [2014]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research
, 15(1):1929–1958, 2014.  Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Xie and Yuille [2017] Lingxi Xie and Alan Yuille. Genetic cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1379–1388, 2017.
 Ying et al. [2019] Chris Ying, Aaron Klein, Esteban Real, Eric Christiansen, Kevin Murphy, and Frank Hutter. Nasbench101: Towards reproducible neural architecture search. arXiv preprint arXiv:1902.09635, 2019.
 Zela et al. [2018] Arber Zela, Aaron Klein, Stefan Falkner, and Frank Hutter. Towards automated deep learning: Efficient joint neural architecture and hyperparameter search. arXiv preprint arXiv:1807.06906, 2018.
 Zoph and Le [2016] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 Zoph et al. [2018] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.