1 Introduction
Deep learning have made significant sucess in classification [15, 12], retrieval [48, 47] and detection [42, 46, 20]
. To this end, neural architecture search (NAS) aims to automatically discover a suitable neural network architecture by exploring over a tremendous architecture search space, which has shown remarkable performance over manual designs in various computer vision tasks
[9, 51, 28, 52, 10, 26].Despite the extensive success, previous methods are still defective in intensive computation resources, which severely restricts its application prospect and flexibility. For instance, reinforcement learning (RL) based methods [52, 51] search a suitable architecture on CIFAR10 by training and evaluating more than architectures by using GPUs over
days. For another instance, the evolutionary algorithm (EA) based method in
[34] needs GPU days to find an optimal architecture on CIFAR10.A NAS method generally consists of three components, i.e., search space, search strategy and performance estimation. As established by [51], cell based search space is now well adopted [49, 44, 11, 31, 34, 33, 51, 52], which is predefined and fixed during the architecture search to ensure a fair comparison among different NAS methods. On the other hand, as illustrated in Fig. 1, different search strategies (RL or EA) have similar runtime (after subtracting the performance estimation cost), which can also be well accelerated with GPU packages. Therefore the major computational consumption of NAS lies in the performance estimation (PE) step, as validated in Fig. 1. However, few works have been devoted to the efficiency issue of PE, which is crucial to cope with the explosive growth of dataset size and model complexity. Moreover, it is highly desirable to conduct fast architecture search under different datasets for deployment in emerging applications like selfdriving cars [5].
In this paper, we propose a novel and efficient performance estimation under the resourceconstrained regime, termed budgeted performance estimation (BPE), which is the first of its kind in the NAS community. The BPE essentially controls the hyperparameters of training, network designing and dataset processing, such as number of channels, number of layers, learning rate and image size. Rather than pursuing model precision for a specific dataset, BPE aims to learn the most achievable relative precision order of different neural architectures in a specific architecture space. In other words, a good network structure still has a relatively high ranking on an accurate BPE. We argue that the missing of accurate and efficient BPE remains as the main barrier for the wide usage of NAS research. However, finding an accurate and effective BPE is extremely challenging compared to other blackbox optimization problems. First, BPE needs to carefully deal with the discrete (like layers or channels) and continuous (like learning rate) hyperparameters. Second, evaluating a specific BPE needs to train a large number of neural networks e.g., networks in the cellbased archietcture search space [28].
As implicitly employed in previous NAS [52, 22, 33, 28, 7, 31] methods, most BPE methods only leverage intuitive tricks including early stopping [52], dataset sampling [22] and lower resolution dataset, or using a proxy search network with fewer filters per layer and fewer cells [52, 28]. While such methods can reduce the computational cost to a certain extent (which is still time consuming [52, 34]), noise is also introduced into PE to underestimate its corresponding performance. Little work investigates the relative performance rank between approximated evaluations and full evaluations, which is traditionally considered as a merited trick [28, 52, 34]. However, as subsequently validated in Sec. 5, such a relative rank can change dramatically under a tiny difference in the training condition.
In this paper, we present a unified, fast and effective framework, termed Minimum Importance Pruning (MIP), to find an optimal BPE on a specific architecture search space such as cellbased search space [49, 44, 11, 31, 34, 33, 51, 52], as illustrated in Fig. 2. In particular, for a given largescale hyperparameter search space, we first sample examples with the lowest time consumption. The sampled examples are then used to estimate the hyperparameter importance using random forest [6, 16]. The hyperparameter of the lowest importance is set to the value with the minimum time cost. The algorithm stops when every hyperparameter is set. The contributions of this paper include:

It is the first work to systematically investigate the performance estimation in NAS under the resourceconstrained regime. We seek an optimal budgeted PE (BPE) by designing a spearman correlation loss function on a group of key hyperparameters.

A novel hyperparameter optimization method, termed Minimum Importance Pruning (MIP), is proposed, which is effective for blackbox optimization with extremely time consuming on the evaluation step.

The proposed MIPBPE generalizes well to various architecture search methods, including Reinforcement Learning (RL), Evolutionary Algorithms (EA), Random Search (RS) and DARTS. MIPBPE achieves remarkable performance on both CIFAR10 and ImageNet, while accelerating the search process by
.
2 Related Work
2.1 Performance Estimation in NAS
Performance estimation refers to estimating the performance of a specific architecture in the architecture search space. A conventional option is to perform a standard training and validation process of this architecture on the dataset, which is computationally expensive and limits the number of architectures that can be explored. To accelerate performance estimation, most NAS methods only provide simple intuitive cues such as early stopping [52], dataset sampling [22] and lower resolution dataset, or using a proxy search network with fewer filters and fewer cells [52, 28].
Another possibility of estimating the architecture performance is oneshot based methods [50, 28, 1], which consider each individual in the search space as a subgraph sampled from a supergraph. In this way, they accelerate the search process by parameter sharing [31]. Chen et al. [11] proposed to progressively grow the depth of searched architectures during the training procedure. Xu et al. [44] presented a partially connected method by sampling a small part of the supernet to reduce the redundancy in the network space, which thereby performs a more efficient search without comprising the performance. However, these methods do not deeply investigate the influence of different hyperparameters, which has introduced large noise as validated in Sec. 5.
2.2 Hyperparameter Optimization
Hyperparameter optimization [41] aims to automatically optimize the hyperparameters during the learning process [4, 19, 39, 45]. To this end, gird search and random search [4] are the two simplest and most straightforward approaches. Note that these methods do not consider to use the experience (sampled examples in the search process). Subsequently, sequential modelbased optimization (SMBO) [19] is proposed to learn a proxy function from the experience and estimate the performance for unknown hyperparameters. As one of the most popular methods, Bayesian optimization [39] learns a Gaussian process with the sampled examples, and then decides the best hyperparameter for the next trial by maximizing the corresponding improvement function.
However, all these methods mostly deal with hyperparameters for particular machine learning models, which cannot handle the optimization of BPE with such an expensive evaluation step. Different from the previous methods, we evaluate and estimate the importance of the hyperparameters by sampling examples with the minimum time consumption, where hyperparameters of minimum importances are then pruned in the next iteration, which is extremely effective and efficient to find the optimal BPE.
3 Preliminaries
3.1 NAS Pipeline
Given a training set, conventional NAS algorithms [52, 49, 25] first sample an architecture in the predefined search space by a certain search strategy like Reinforcement Learning (RL) or Evolution Algorithm (EA). Then the sampled neural architecture is passed to the performance estimation (PE), which returns the performance of the architecture to the search algorithm.
In most NAS methods [49, 28, 44], PE is accelerated by using a group of lowercost hyperparameters (like smaller image size, less channel and shallower network) in the search space , termed budgeted PE (BPE), which contains sorts of training hyperparameters including the number of training epochs, batch size, learning rate, the number of layers, float point precision, channels, cutout [13] and image size. For instance, Liu et al. [28] proposed to estimate the performance of an architecture on a small network of layers trained for epochs, with batch size and initial number of channels . After the search process, the optimal neural architecture is then evaluated by a fully and timeconsuming training hyperparameter set . In the existing works [49, 28, 44], controls the final evaluation hyperparameters of the optimal architecture, i.e., a large network of layers is trained for epochs with a batch size of and an additional regularization such as cutout [13].
However, in this pipeline, the BPE and the final evaluation phase are decoupled. There is no guarantee that the BPE is correlated to the final evaluation step, i.e., the same architectures may have large ranking distances under different training conditions. Most NAS methods [28, 52] intuitively change BPE with fewer channels or layers. Nevertheless, extensive experiments in Sec. 5 show that the effectiveness of BPE is very sensitive, which means that it needs to carefully select and analyze the corresponding hyperparameters in NAS. Indeed, we believe, and validated in Sec. 5, that BPE is a crucial component, while unfortunately there are no corresponding works devoted to this area.
3.2 Cell based Architecture Search Space
As mentioned in Sec. 1, BPE aims to find optimal training hyperparameters on a specific architecture search space. In this paper, we follow the widelyused cellbased architecture search space in [49, 44, 11, 31, 34, 33, 51, 52, 50]: A network consists of a predefined number of cells [51], which can be either norm cells or reduction cells. Each cell takes the outputs of the two previous cells as input. A cell is a fullyconnected directed acyclic graph (DAG) of nodes, i.e., . Each node takes the dependent nodes as input, and generates an output through a sum operation
Here each node is a specific tensor (
e.g.,a feature map in convolutional neural networks) and each directed edge
between and denotes an operation , which is sampled from the corresponding operation search space . Note that the constraint ensures no cycles in a cell. Each cell takes the outputs of two dependent cells as input, and the two input nodes are set as and for simplicity. Following [28], the operation search space consists of operations: dilated convolution with rate , dilated convolution with rate , depthwise separable convolution, depthwise separable convolution, max pooling, average pooling, no connection (zero), and a skip connection (identity). Therefore, the size of the whole search space is , where is the set of possible edges with intermediate nodes in the fullyconnected DAG. In our case with the total number of cell structures in the search space is , which is an extremely large space to search.4 The Proposed Method
In this section, we first describe the formal setting of BPE in Sec. 4.1. We then present the proposed minimum importance pruning (MIP) to find the optimal BPE.
4.1 Budgeted Performance Estimation
The performance estimation is a training algorithm with hyperparameters in a domain . Given an architecture set sampled from , we address the following optimization problem:
(1) 
where , calculates the Spearman Rank Correlation between and . and are the performance on validation set of every architecture in with full training hyperparameter and BPE parameter , respectively. We aim to find the optimal with less average training consumption on .
Optimizing Eq. 1 is extremely challenging, as we need to train over architectures to validate one example in . This large set of models to be trained and evaluated prevent most NAS methods to be widely deployed. Fortunately, Radosavovic et al. [32] observed that sampling about models form a given architecture search space is sufficient to perform robust estimation, which is also validated in our work. Specifically, we randomly sample neural architectures in the cellbased search architecture space to construct the architecture set . Then and are obtained by training and validate every architecture in with the hyperparameters and , respectively.
4.2 Minimum Importance Pruning
Although the time consumption of the validation step has been drastically reduced, it is still very difficult to optimize Eq. 1, i.e., in our evaluation, the average training time of an architecture from for different hyperparameters is hours on CIFAR10 benchmark. In this case, each example needs to train networks, and the time consumption for one BPE example is hours. Such a time consumption is still difficult for finding an optimal BPE efficiently.
To handle this issue we propose a minimum importance pruning (MIP) as illustrated in Fig. 2. We first sample the hyperparameter examples around the lowest time cost. Then the sampled examples are trained to estimate the hyperparameter importance by using random forest [16, 6]. After that, the hyperparameter with the lowest importance is pruned by setting the value with the minimum time cost. The pruning step is ceased when there is only one hyperparameter in the search space, and the optimal BPE is the example with the maximum .
Lowest time cost sampling. For each element in , we introduce a category distribution related to the computational cost:
(2) 
where denotes the th element of the th hyperparameter . The function is the number of floating point operations. We set with the th element and fix other hyperparameters in by the value with the minimum time cost. An example
is generated by sampling the joint probability in Eq.
2, e.g., . Then, we obtain by training every architecture in using the sampled , and the objective is calculated with by using Eq. 1.Random forest training. After repeating previous steps over times, we get a set with different BPEs and corresponding objective values, which is used as a training set for the random forest. In random forest, each tree is built from a set drawn with a replacement sampling from . Training random forest is to train multiple regression trees. Given a training set with
and the corresponding spearman rank correlation vector
sampled from , a regression tree in the random forest recursively partitions the space such that the examples in with similar values are grouped together. When training the regression tree, we need to consider how to measure and choose the partition feature (hyperparameter in our case). Specifically, let the data at node be represented by . For each candidate partition consisting of hyperparameter and threshold , we partition the data into and subsets as follows:(3) 
We further define the impurity function for a given split set as
(4) 
where , denotes the number of examples in set . And the impurity for a specific partition is the weighted sum of the impurity function:
(5) 
We adopt the exhaustion method to find the optimal partition, that is, iterate through all possible partitions and select the partition with the minimum impurity:
(6) 
Hyperparameter importance. For every node in the regression tree, we calculate the parameter importance as the decrease in node impurity, which is weighted by the number of samples that reach the node. The parameter importance for node is defined as:
(7) 
The importance for each is the summation of the importance through the node in the random forest, which uses as the partition parameter:
(8) 
Parameter pruning. After the importance estimation process in Eq. 8, the hyperparameter with the lowest probability is pruned by setting
(9) 
is the value of the lowest FLOPs in hyperparameter when . Otherwise, is the corresponding parameter value with the maximum in . The pruning step significantly improves the search efficiency. By setting the less important hyperparameter to a value with less resource consumption, we can allocate more computational resource on important parameters. Our minimum importance pruning algorithm is presented in Alg. 1.
Hyperparameter  BPE1  BPE2  DARTS[28] 

Epoch  10  30  50 
Batch size  128  128  64 
Learning rate  0.03  0.03  0.025 
N_Layers  6  16  8 
Channels  8  16  16 
Image Size  16  16  32 
Correlation  0.50  0.63  0.57 
Training Time  0.08  0.55  1.38 
Architecture  Test Error  Params  Search Cost  Search 

(%)  (M)  (GPU days)  Method  
ResNet18 [15]  3.53  11.1    Manual 
DenseNet [18]  4.77  1.0    Manual 
SENet [17]  4.05  11.2    Manual 
NASNetA [52]  2.65  3.3  1800  RL 
ENAS [31]  2.89  4.6  0.5  RL 
Pathlevel NAS [8]  3.64  3.2  8.3  RL 
RL+BPE1 (Ours)  2.66 0.05  2.7  0.33  RL 
RL+BPE2 (Ours)  2.65 0.12  2.9  2  RL 
AmoebaNetB [34]  2.55  2.8  3150  Evolution 
EA+BPE1 (Ours)  2.68 0.09  2.46  0.33  Evolution 
EA+BPE2 (Ours)  2.66 0.07  2.87  2  Evolution 
DARTS [28]  2.7 0.01  3.1  1.5  Gradientbased 
GDAS [14]  2.93 0.07  3.4  0.8  Gradientbased 
PDARTS [11]  2.75 0.06  3.4  0.3  Gradientbased 
SNAS [43]  2.85 0.02  2.8  1.5  Gradientbased 
DARTS + BPE1 (Ours)  2.89 0.0  3.9  0.05  Gradientbased 
DARTS + BPE2 (Ours)  2.72 0.0  4.04  0.33  Gradientbased 
Random Sample 100  2.55  2.9  108  Random Search 
Random Sample 100 + BPE1 (Ours)  2.68 0.09  2.7  0.33(337)  Random Search 
Random Sample 100 + BPE2 (Ours)  2.68 0.05  1.9  2 (54)  Random Search 
5 Experiment
As we mentioned before, the average time consumption for evaluating BPE examples is GPU hours, which means that with similar sample magnitude (), methods such as bayesian optimization or random search need about GPU hours (almost infeasible). In contrast, our method needs only GPU hours. Therefore, we do not compare these methods in our paper.
We first combine BPE with different search strategies including Reinforcement Learning (RL) [21], Evolutionary Algorithm (EA) [2], Random Search (RS) [4] and Different Architecture Search (DARTS) [28]. As shown in Sec. 5.1, we compare with stateoftheart methods in terms of both effectiveness and efficiency using CIFAR10 [23] and ImageNet [35]. In Sec. 5.2, we investigate the effect of each hyperparameter in BPE, as well as the efficiency of using Spearman Rank Correlation as the objective function. Although many works [38, 25] pointed out that the oneshot based method [37, 31, 28, 3, 50] could not effectively estimate the performance throughout the entire search space, in Sec. 5.3 we have found that these methods are indeed effective in the local search space, which reasonably explains the reproducibility and effectiveness, i.e., the corresponding algorithms are actually able to find good architectures, and the optimal architectures are quite different in different runs due to the local information.
5.1 Comparing with Stateofthearts
We first search neural architectures by using the found BPE1 and BPE2 in Tab. 1, and then evaluate the best architecture with a stacked deeper network. To ensure the stability of the proposed method, we run each experiment
times and find that the resulting architectures only show a slight variance in performance.
5.1.1 Experimental Settings
We use the same datasets and evaluation metrics for existing NAS methods
[28, 8, 52, 27]. First, most experiments are conducted on CIFAR10 [24], which has K training images and K testing images from 10 classes with a resolution . During the architecture search, we randomly select K images from the training set as a validation set. To further evaluate the generalization capability, we stack the optimal cell discovered on CIFAR10 into a deeper network, and then evaluate the classification accuracy on ILSVRC 2012 [35], which consists of classes with M training images and K validation images. Here, we consider the mobile setting, where the input image size is and the FLOPs is less than 600M.In the search process, we directly use the found BPE1 and BPE2 in Tab. 1 as the performance estimation with other search algorithms. After finding the optimal architecture in the search space, we validate the final accuracy on a large network of cells is trained for epochs with a batch size of and additional regularization such as cutout [13], which are similar to [28, 52, 31]
. When stacking cells to evaluate on ImageNet, we use two initial convolutional layers of stride
before stacking cells with the scale reduction at the st, nd, th and th cells. The total number of FLOPs is determined by the initial number of channels. The network is trained for 250 epochs with a batch size of 512, a weight decay of, and an initial SGD learning rate of 0.1. All the experiments and models are implemented in PyTorch
[30].Model  Top1  Params  Search time 
(M)  (GPU days)  
MobileNetV2 [36]  72.0  3.4   
ShuffleNetV2 2x (V2) [29]  73.7  5   
NASNetA [52]  74.0  5.3  1800 
AmoebaNetA [34]  74.5  5.1  3150 
MnasNet92 [40]  74.8  4.4   
SNAS [43]  72.7  4.3  522 
DARTS [28]  73.1  4.9  4 
RL + BPE1 (Ours)  74.18  5.5  0.33 
EA + BPE1 (Ours)  74.56  5.0  0.33 
RS + BPE1 (Ours)  74.2  5.5  0.33 
DARTS [28] + BPE1 (Ours)  74.0  5.9  0.05 
5.1.2 Result on CIFAR10
We compare our method with both manually designed networks and NAS networks. The manually designed networks include ResNet [15], DenseNet [18] and SENet [17]. We evaluate on four categories of NAS methods, i.e., RL methods (NASNet [52], ENAS [31] and Pathlevel NAS [8]), evolutional algorithms (AmoebaNet [34]), gradientbased methods (DARTS [28]) and Random Search.
The results for convolutional architectures on CIFAR10 are presented in Tab. 2. It is worth noting that the found BPE combining with various search algorithms outperform various stateoftheart search algorithms [52, 28, 33] in accuracy, with much lower computational consumption (only GPU days in [34]). We attribute our superior results to the found BPE. Another notable observation from Tab. 2 is that, even with random search in the search space, the test error rate is only %, which outperforms previous methods in the same search space. Conclusively, with the found BPE, search algorithms can quickly explore the architecture search space and generates a better architecture. We also report the results of handcrafted networks in Tab. 2. Clearly, our method shows a notable enhancement.
5.1.3 Results on ImageNet
We further compare our method under the mobile settings on ImageNet to demonstrate the generalizability. The best architecture on CIFAR10 is transferred to ImageNet, which follows the same experimental settings in [52, 31, 8]. Results in Tab. 3 show that the best cell architecture on CIFAR10 is transferable to ImageNet. The proposed method achieves comparable accuracy to the stateoftheart methods [52, 34, 27, 34, 27, 31, 28, 8] while using far less computational resources, e.g., times faster comparing to EA, and times faster comparing to RL.
5.2 Deep Analysis in Performance Estimation
We further study the efficiency of using Spearman Rank Correlation as the objective function in Fig. 3. In Fig. 4, we also provide a deep analysis about the importance of every hyperparameter. One can make full use of this analysis to transfer the found BPEs to other datasets and tasks.
We randomly select hyperparameter settings in and apply them on the DARTS [28] search algorithm to find optimal architectures. Fig. 3 illustrates the relationship between and the accuracy of the optimal architecture found by the corresponding setting. The performance is highly correlated to (with a correlation), which denotes the efficiency of the proposed objective function in Eq. 1.
After exploring the BPE space by the proposed method, we get a dataset w.r.t. each and , which is used as the training set to train a random forest regression predictor [16] for each . We then report the estimated by the predictor and importance for each hyperparameter in Fig. 4. As illustrated in Fig. 4, the steepness and importance are highly correlated, i.e., the more important the parameter is, the steeper of the corresponding curve is, vice versa. At the same time, for the two most important parameters (epoch and layer), we get a high with a small range. This means that we only need to carefully finetune these two parameters in a small range when transferring to other datasets.
Epoch  

50  200  400  600  
Fair  Global  0.10  0.06  0.13  0.03 
Local  0.13  0.50  0.31  0.31  
Random  Global  0.10  0.05  0.30  0.19 
Local  0.14  0.52  0.61  0.01  
Random_10  Global  0.0  0.0  0.02  0.08 
Local  0.26  0.11  0.57  0.58 
5.3 Understanding Oneshot based Methods
Previous works [25, 38] have reported that oneshot based methods such as DARTS do not work well (in some cases even no better than random search). There are two main questions which are not been explained yet: (1) Oneshot based methods can not make a good estimate of performance, but they can search for good neural architectures. (2) The instability of oneshot based methods, that is, the found networks are different with different random seeds. With the found BPE, we can effectively investigate every search phase in these methods.
To understand and explain such questions, we first train the same hypergraph with different settings: (1) Fair training, each operation in an edge is trained with exactly the same epoch; (2) Random training, each operation in an edge is trained randomly at different random levels. In Tab. 4, we report the global and local in the case of fair training and random training. The global denotes that we use our trained hypergraph to get the validation performance for the networks in , and then calculate the with . The local is obtained by the following steps: When training the hypergraph, we save the sampled network architectures and the corresponding validation performance at epoch . The local is then obtained by using the found BPE2 and Eq. 1 i.e., BPE2. As illustrated in Tab. 4, oneshot based methods have a poor performance estimation in global , which is consistent with previous works [25, 38]. However, these methods have a high local , which means that these methods are essentially using the local information. That is to say, each epoch in the search phase can only perceive and optimize by using local information, which reasonably explains the instability of DARTS.
6 Conlusion
In this paper, we present the first systematic analysis of the budgeted performance estimation (BPE) in NAS, and propose a minimum importance pruning (MIP) towards optimal PE. The proposed MIP gradually reduces the number of BPE hyperparameters, which allocates more computation resources on more important hyperparameters. The found MIPBPE is generalized to various search algorithms, including reinforcement learning, random search, evolution algorithm and gradientbased methods. Combining the found BPE with various NAS algorithms, we have reached the stateoftheart test error on CIFAR10 with much fewer search time, which also helps us to better understand the widelyused oneshot based methods.
Acknowledgements.
This work is supported by the Nature Science Foundation of China (No.U1705262, No.61772443, No.61572410, No.61802324 and No.61702136), National Key R&D Program (No.2017YFC0113000, and No.2016YFB1001503), and Nature Science Foundation of Fujian Province, China (No. 2017J01125 and No. 2018J01106).
References
 [1] (2019) Adaptive stochastic natural gradient method for oneshot neural architecture search. In ICML, Cited by: §2.1.

[2]
(1996)
Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms
. Oxford university press. Cited by: §5.  [3] (2018) Understanding and simplifying oneshot architecture search. In ICML, Cited by: §5.
 [4] (2012) Random search for hyperparameter optimization. JMLR. Cited by: §2.2, §5.
 [5] (2016) End to end learning for selfdriving cars. arXiv. Cited by: §1.
 [6] (2001) Random forests. Machine learning. Cited by: §1, §4.2.
 [7] (2018) Efficient architecture search by network transformation. In AAAI, Cited by: §1.
 [8] (2018) Pathlevel network transformation for efficient architecture search. arXiv. Cited by: Table 2, §5.1.1, §5.1.2, §5.1.3.
 [9] (2019) Binarized neural architecture search. arXiv preprint arXiv:1911.10862. Cited by: §1.
 [10] (2018) Searching for efficient multiscale architectures for dense image prediction. In NeurIPS, Cited by: §1.
 [11] (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. ICCV. Cited by: §1, §1, §2.1, §3.2, Table 2.

[12]
(201906)
Local to global learning: gradually adding classes for training deep neural networks.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §1.  [13] (2017) Improved regularization of convolutional neural networks with cutout. arXiv. Cited by: §3.1, Table 2, §5.1.1.
 [14] (2019) Searching for a robust neural architecture in four gpu hours. In CVPR, Cited by: Table 2.
 [15] (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, Table 2, §5.1.2.

[16]
(2014)
An efficient approach for assessing hyperparameter importance
. In ICML, pp. 754–762. Cited by: §1, §4.2, §5.2.  [17] (2018) Squeezeandexcitation networks. In CVPR, Cited by: Table 2, §5.1.2.
 [18] (2017) Densely connected convolutional networks. In CVPR, Cited by: Table 2, §5.1.2.
 [19] (2011) Sequential modelbased optimization for general algorithm configuration. In LION, Cited by: §2.2.
 [20] (2019) Semisupervised adversarial monocular depth estimation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
 [21] (1996) Reinforcement learning: a survey. JAIR. Cited by: §5.
 [22] (2016) Fast bayesian optimization of machine learning hyperparameters on large datasets. arXiv. Cited by: §1, §2.1.
 [23] (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §5.
 [24] (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §5.1.1.
 [25] (2019) Random search and reproducibility for neural architecture search. arXiv. Cited by: §3.1, §5.3, §5.3, §5.
 [26] (2019) Autodeeplab: hierarchical neural architecture search for semantic image segmentation. CVPR. Cited by: §1.
 [27] (2018) Progressive neural architecture search. In ECCV, Cited by: §5.1.1, §5.1.3.
 [28] (2019) Darts: differentiable architecture search. ICLR. Cited by: Figure 1, §1, §1, §1, §2.1, §2.1, §3.1, §3.1, §3.2, Table 1, Table 2, §5.1.1, §5.1.1, §5.1.2, §5.1.2, §5.1.3, §5.2, Table 3, §5.
 [29] (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In ECCV, Cited by: Table 3.
 [30] (2017) Automatic differentiation in pytorch. Cited by: §5.1.1.
 [31] (2018) Efficient neural architecture search via parameter sharing. ICML. Cited by: §1, §1, §1, §2.1, §3.2, Table 2, §5.1.1, §5.1.2, §5.1.3, §5.
 [32] (2019) On network design spaces for visual recognition. arXiv. Cited by: §4.1.

[33]
(2019)
Aging evolution for image classifier architecture search
. In AAAI, Cited by: §1, §1, §1, §3.2, §5.1.2.  [34] (2019) Regularized evolution for image classifier architecture search. AAAI. Cited by: Figure 1, §1, §1, §1, §1, §3.2, Table 2, §5.1.2, §5.1.2, §5.1.3, Table 3.
 [35] (2015) Imagenet large scale visual recognition challenge. IJCV. Cited by: §5.1.1, §5.
 [36] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: Table 3.
 [37] (2016) Convolutional neural fabrics. In NeurlPS, Cited by: §5.
 [38] (2019) Evaluating the search phase of neural architecture search. arXiv. Cited by: §5.3, §5.3, §5.
 [39] (2012) Practical bayesian optimization of machine learning algorithms. In NeurIPS, pp. 2951–2959. Cited by: §2.2.
 [40] (2019) MnasNet: platformaware neural architecture search for mobile. In CVPR, Cited by: Table 3.
 [41] (2013) Autoweka: combined selection and hyperparameter optimization of classification algorithms. In SIGKDD, Cited by: §2.2.
 [42] (2019) Cmil: continuation multiple instance learning for weakly supervised object detection. In IEEE CVPR, pp. 2199–2208. Cited by: §1.
 [43] (2019) SNAS: stochastic neural architecture search. In ICLR, Cited by: Table 2, Table 3.
 [44] (2019) Pcdarts: partial channel connections for memoryefficient differentiable architecture search. arXiv. Cited by: §1, §1, §2.1, §3.1, §3.2.
 [45] (2018) Towards automated deep learning: efficient joint neural architecture and hyperparameter search. ICML Workshop. Cited by: §2.2.
 [46] (2019) FreeAnchor: learning to match anchors for visual object detection. In NeurIPS, pp. 147–155. Cited by: §1.
 [47] (2018) Centralized ranking loss with weakly supervised localization for finegrained object retrieval.. In IJCAI, pp. 1226–1233. Cited by: §1.
 [48] (2019) Towards optimal fine grained retrieval via decorrelated centralized loss with normalizescale layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9291–9298. Cited by: §1.
 [49] (2019) Dynamic distribution pruning for efficient network architecture search. arXiv. Cited by: §1, §1, §3.1, §3.1, §3.2.
 [50] (2019) Multinomial distribution learning for effective neural architecture search. In ICCV, Cited by: §2.1, §3.2, §5.
 [51] (2016) Neural architecture search with reinforcement learning. arXiv. Cited by: §1, §1, §1, §1, §3.2.
 [52] (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: Figure 1, §1, §1, §1, §1, §1, §2.1, §3.1, §3.1, §3.2, Table 2, §5.1.1, §5.1.1, §5.1.2, §5.1.2, §5.1.3, Table 3.