. To this end, neural architecture search (NAS) aims to automatically discover a suitable neural network architecture by exploring over a tremendous architecture search space, which has shown remarkable performance over manual designs in various computer vision tasks[9, 51, 28, 52, 10, 26].
Despite the extensive success, previous methods are still defective in intensive computation resources, which severely restricts its application prospect and flexibility. For instance, reinforcement learning (RL) based methods [52, 51] search a suitable architecture on CIFAR10 by training and evaluating more than architectures by using GPUs over
days. For another instance, the evolutionary algorithm (EA) based method in needs GPU days to find an optimal architecture on CIFAR10.
A NAS method generally consists of three components, i.e., search space, search strategy and performance estimation. As established by , cell based search space is now well adopted [49, 44, 11, 31, 34, 33, 51, 52], which is pre-defined and fixed during the architecture search to ensure a fair comparison among different NAS methods. On the other hand, as illustrated in Fig. 1, different search strategies (RL or EA) have similar run-time (after subtracting the performance estimation cost), which can also be well accelerated with GPU packages. Therefore the major computational consumption of NAS lies in the performance estimation (PE) step, as validated in Fig. 1. However, few works have been devoted to the efficiency issue of PE, which is crucial to cope with the explosive growth of dataset size and model complexity. Moreover, it is highly desirable to conduct fast architecture search under different datasets for deployment in emerging applications like self-driving cars .
In this paper, we propose a novel and efficient performance estimation under the resource-constrained regime, termed budgeted performance estimation (BPE), which is the first of its kind in the NAS community. The BPE essentially controls the hyper-parameters of training, network designing and dataset processing, such as number of channels, number of layers, learning rate and image size. Rather than pursuing model precision for a specific dataset, BPE aims to learn the most achievable relative precision order of different neural architectures in a specific architecture space. In other words, a good network structure still has a relatively high ranking on an accurate BPE. We argue that the missing of accurate and efficient BPE remains as the main barrier for the wide usage of NAS research. However, finding an accurate and effective BPE is extremely challenging compared to other black-box optimization problems. First, BPE needs to carefully deal with the discrete (like layers or channels) and continuous (like learning rate) hyper-parameters. Second, evaluating a specific BPE needs to train a large number of neural networks e.g., networks in the cell-based archietcture search space .
As implicitly employed in previous NAS [52, 22, 33, 28, 7, 31] methods, most BPE methods only leverage intuitive tricks including early stopping , dataset sampling  and lower resolution dataset, or using a proxy search network with fewer filters per layer and fewer cells [52, 28]. While such methods can reduce the computational cost to a certain extent (which is still time consuming [52, 34]), noise is also introduced into PE to underestimate its corresponding performance. Little work investigates the relative performance rank between approximated evaluations and full evaluations, which is traditionally considered as a merited trick [28, 52, 34]. However, as subsequently validated in Sec. 5, such a relative rank can change dramatically under a tiny difference in the training condition.
In this paper, we present a unified, fast and effective framework, termed Minimum Importance Pruning (MIP), to find an optimal BPE on a specific architecture search space such as cell-based search space [49, 44, 11, 31, 34, 33, 51, 52], as illustrated in Fig. 2. In particular, for a given large-scale hyper-parameter search space, we first sample examples with the lowest time consumption. The sampled examples are then used to estimate the hyper-parameter importance using random forest [6, 16]. The hyper-parameter of the lowest importance is set to the value with the minimum time cost. The algorithm stops when every hyper-parameter is set. The contributions of this paper include:
It is the first work to systematically investigate the performance estimation in NAS under the resource-constrained regime. We seek an optimal budgeted PE (BPE) by designing a spearman correlation loss function on a group of key hyper-parameters.
A novel hyper-parameter optimization method, termed Minimum Importance Pruning (MIP), is proposed, which is effective for black-box optimization with extremely time consuming on the evaluation step.
The proposed MIP-BPE generalizes well to various architecture search methods, including Reinforcement Learning (RL), Evolutionary Algorithms (EA), Random Search (RS) and DARTS. MIP-BPE achieves remarkable performance on both CIFAR-10 and ImageNet, while accelerating the search process by.
2 Related Work
2.1 Performance Estimation in NAS
Performance estimation refers to estimating the performance of a specific architecture in the architecture search space. A conventional option is to perform a standard training and validation process of this architecture on the dataset, which is computationally expensive and limits the number of architectures that can be explored. To accelerate performance estimation, most NAS methods only provide simple intuitive cues such as early stopping , dataset sampling  and lower resolution dataset, or using a proxy search network with fewer filters and fewer cells [52, 28].
Another possibility of estimating the architecture performance is one-shot based methods [50, 28, 1], which consider each individual in the search space as a sub-graph sampled from a super-graph. In this way, they accelerate the search process by parameter sharing . Chen et al.  proposed to progressively grow the depth of searched architectures during the training procedure. Xu et al.  presented a partially connected method by sampling a small part of the super-net to reduce the redundancy in the network space, which thereby performs a more efficient search without comprising the performance. However, these methods do not deeply investigate the influence of different hyper-parameters, which has introduced large noise as validated in Sec. 5.
2.2 Hyper-parameter Optimization
Hyper-parameter optimization  aims to automatically optimize the hyper-parameters during the learning process [4, 19, 39, 45]. To this end, gird search and random search  are the two simplest and most straightforward approaches. Note that these methods do not consider to use the experience (sampled examples in the search process). Subsequently, sequential model-based optimization (SMBO)  is proposed to learn a proxy function from the experience and estimate the performance for unknown hyper-parameters. As one of the most popular methods, Bayesian optimization  learns a Gaussian process with the sampled examples, and then decides the best hyper-parameter for the next trial by maximizing the corresponding improvement function.
However, all these methods mostly deal with hyper-parameters for particular machine learning models, which cannot handle the optimization of BPE with such an expensive evaluation step. Different from the previous methods, we evaluate and estimate the importance of the hyper-parameters by sampling examples with the minimum time consumption, where hyper-parameters of minimum importances are then pruned in the next iteration, which is extremely effective and efficient to find the optimal BPE.
3.1 NAS Pipeline
Given a training set, conventional NAS algorithms [52, 49, 25] first sample an architecture in the pre-defined search space by a certain search strategy like Reinforcement Learning (RL) or Evolution Algorithm (EA). Then the sampled neural architecture is passed to the performance estimation (PE), which returns the performance of the architecture to the search algorithm.
In most NAS methods [49, 28, 44], PE is accelerated by using a group of lower-cost hyper-parameters (like smaller image size, less channel and shallower network) in the search space , termed budgeted PE (BPE), which contains sorts of training hyper-parameters including the number of training epochs, batch size, learning rate, the number of layers, float point precision, channels, cutout  and image size. For instance, Liu et al.  proposed to estimate the performance of an architecture on a small network of layers trained for epochs, with batch size and initial number of channels . After the search process, the optimal neural architecture is then evaluated by a fully and time-consuming training hyper-parameter set . In the existing works [49, 28, 44], controls the final evaluation hyper-parameters of the optimal architecture, i.e., a large network of layers is trained for epochs with a batch size of and an additional regularization such as cutout .
However, in this pipeline, the BPE and the final evaluation phase are decoupled. There is no guarantee that the BPE is correlated to the final evaluation step, i.e., the same architectures may have large ranking distances under different training conditions. Most NAS methods [28, 52] intuitively change BPE with fewer channels or layers. Nevertheless, extensive experiments in Sec. 5 show that the effectiveness of BPE is very sensitive, which means that it needs to carefully select and analyze the corresponding hyper-parameters in NAS. Indeed, we believe, and validated in Sec. 5, that BPE is a crucial component, while unfortunately there are no corresponding works devoted to this area.
3.2 Cell based Architecture Search Space
As mentioned in Sec. 1, BPE aims to find optimal training hyper-parameters on a specific architecture search space. In this paper, we follow the widely-used cell-based architecture search space in [49, 44, 11, 31, 34, 33, 51, 52, 50]: A network consists of a pre-defined number of cells , which can be either norm cells or reduction cells. Each cell takes the outputs of the two previous cells as input. A cell is a fully-connected directed acyclic graph (DAG) of nodes, i.e., . Each node takes the dependent nodes as input, and generates an output through a sum operation
Here each node is a specific tensor (e.g.,
a feature map in convolutional neural networks) and each directed edgebetween and denotes an operation , which is sampled from the corresponding operation search space . Note that the constraint ensures no cycles in a cell. Each cell takes the outputs of two dependent cells as input, and the two input nodes are set as and for simplicity. Following , the operation search space consists of operations: dilated convolution with rate , dilated convolution with rate , depth-wise separable convolution, depth-wise separable convolution, max pooling, average pooling, no connection (zero), and a skip connection (identity). Therefore, the size of the whole search space is , where is the set of possible edges with intermediate nodes in the fully-connected DAG. In our case with the total number of cell structures in the search space is , which is an extremely large space to search.
4 The Proposed Method
In this section, we first describe the formal setting of BPE in Sec. 4.1. We then present the proposed minimum importance pruning (MIP) to find the optimal BPE.
4.1 Budgeted Performance Estimation
The performance estimation is a training algorithm with hyper-parameters in a domain . Given an architecture set sampled from , we address the following optimization problem:
where , calculates the Spearman Rank Correlation between and . and are the performance on validation set of every architecture in with full training hyper-parameter and BPE parameter , respectively. We aim to find the optimal with less average training consumption on .
Optimizing Eq. 1 is extremely challenging, as we need to train over architectures to validate one example in . This large set of models to be trained and evaluated prevent most NAS methods to be widely deployed. Fortunately, Radosavovic et al.  observed that sampling about models form a given architecture search space is sufficient to perform robust estimation, which is also validated in our work. Specifically, we randomly sample neural architectures in the cell-based search architecture space to construct the architecture set . Then and are obtained by training and validate every architecture in with the hyper-parameters and , respectively.
4.2 Minimum Importance Pruning
Although the time consumption of the validation step has been drastically reduced, it is still very difficult to optimize Eq. 1, i.e., in our evaluation, the average training time of an architecture from for different hyper-parameters is hours on CIFAR10 benchmark. In this case, each example needs to train networks, and the time consumption for one BPE example is hours. Such a time consumption is still difficult for finding an optimal BPE efficiently.
To handle this issue we propose a minimum importance pruning (MIP) as illustrated in Fig. 2. We first sample the hyper-parameter examples around the lowest time cost. Then the sampled examples are trained to estimate the hyper-parameter importance by using random forest [16, 6]. After that, the hyper-parameter with the lowest importance is pruned by setting the value with the minimum time cost. The pruning step is ceased when there is only one hyper-parameter in the search space, and the optimal BPE is the example with the maximum .
Lowest time cost sampling. For each element in , we introduce a category distribution related to the computational cost:
where denotes the th element of the th hyper-parameter . The function is the number of floating point operations. We set with the th element and fix other hyper-parameters in by the value with the minimum time cost. An example
is generated by sampling the joint probability in Eq.2, e.g., . Then, we obtain by training every architecture in using the sampled , and the objective is calculated with by using Eq. 1.
Random forest training. After repeating previous steps over times, we get a set with different BPEs and corresponding objective values, which is used as a training set for the random forest. In random forest, each tree is built from a set drawn with a replacement sampling from . Training random forest is to train multiple regression trees. Given a training set with
and the corresponding spearman rank correlation vectorsampled from , a regression tree in the random forest recursively partitions the space such that the examples in with similar values are grouped together. When training the regression tree, we need to consider how to measure and choose the partition feature (hyper-parameter in our case). Specifically, let the data at node be represented by . For each candidate partition consisting of hyper-parameter and threshold , we partition the data into and subsets as follows:
We further define the impurity function for a given split set as
where , denotes the number of examples in set . And the impurity for a specific partition is the weighted sum of the impurity function:
We adopt the exhaustion method to find the optimal partition, that is, iterate through all possible partitions and select the partition with the minimum impurity:
Hyper-parameter importance. For every node in the regression tree, we calculate the parameter importance as the decrease in node impurity, which is weighted by the number of samples that reach the node. The parameter importance for node is defined as:
The importance for each is the summation of the importance through the node in the random forest, which uses as the partition parameter:
Parameter pruning. After the importance estimation process in Eq. 8, the hyper-parameter with the lowest probability is pruned by setting
is the value of the lowest FLOPs in hyper-parameter when . Otherwise, is the corresponding parameter value with the maximum in . The pruning step significantly improves the search efficiency. By setting the less important hyper-parameter to a value with less resource consumption, we can allocate more computational resource on important parameters. Our minimum importance pruning algorithm is presented in Alg. 1.
|Architecture||Test Error||Params||Search Cost||Search|
|Path-level NAS ||3.64||3.2||8.3||RL|
|RL+BPE-1 (Ours)||2.66 0.05||2.7||0.33||RL|
|RL+BPE-2 (Ours)||2.65 0.12||2.9||2||RL|
|EA+BPE-1 (Ours)||2.68 0.09||2.46||0.33||Evolution|
|EA+BPE-2 (Ours)||2.66 0.07||2.87||2||Evolution|
|DARTS ||2.7 0.01||3.1||1.5||Gradient-based|
|GDAS ||2.93 0.07||3.4||0.8||Gradient-based|
|P-DARTS ||2.75 0.06||3.4||0.3||Gradient-based|
|SNAS ||2.85 0.02||2.8||1.5||Gradient-based|
|DARTS + BPE-1 (Ours)||2.89 0.0||3.9||0.05||Gradient-based|
|DARTS + BPE-2 (Ours)||2.72 0.0||4.04||0.33||Gradient-based|
|Random Sample 100||2.55||2.9||108||Random Search|
|Random Sample 100 + BPE-1 (Ours)||2.68 0.09||2.7||0.33(337)||Random Search|
|Random Sample 100 + BPE-2 (Ours)||2.68 0.05||1.9||2 (54)||Random Search|
As we mentioned before, the average time consumption for evaluating BPE examples is GPU hours, which means that with similar sample magnitude (), methods such as bayesian optimization or random search need about GPU hours (almost infeasible). In contrast, our method needs only GPU hours. Therefore, we do not compare these methods in our paper.
We first combine BPE with different search strategies including Reinforcement Learning (RL) , Evolutionary Algorithm (EA) , Random Search (RS)  and Different Architecture Search (DARTS) . As shown in Sec. 5.1, we compare with state-of-the-art methods in terms of both effectiveness and efficiency using CIFAR10  and ImageNet . In Sec. 5.2, we investigate the effect of each hyper-parameter in BPE, as well as the efficiency of using Spearman Rank Correlation as the objective function. Although many works [38, 25] pointed out that the one-shot based method [37, 31, 28, 3, 50] could not effectively estimate the performance throughout the entire search space, in Sec. 5.3 we have found that these methods are indeed effective in the local search space, which reasonably explains the reproducibility and effectiveness, i.e., the corresponding algorithms are actually able to find good architectures, and the optimal architectures are quite different in different runs due to the local information.
5.1 Comparing with State-of-the-arts
We first search neural architectures by using the found BPE-1 and BPE-2 in Tab. 1, and then evaluate the best architecture with a stacked deeper network. To ensure the stability of the proposed method, we run each experiment
times and find that the resulting architectures only show a slight variance in performance.
5.1.1 Experimental Settings
We use the same datasets and evaluation metrics for existing NAS methods[28, 8, 52, 27]. First, most experiments are conducted on CIFAR-10 , which has K training images and K testing images from 10 classes with a resolution . During the architecture search, we randomly select K images from the training set as a validation set. To further evaluate the generalization capability, we stack the optimal cell discovered on CIFAR-10 into a deeper network, and then evaluate the classification accuracy on ILSVRC 2012 , which consists of classes with M training images and K validation images. Here, we consider the mobile setting, where the input image size is and the FLOPs is less than 600M.
In the search process, we directly use the found BPE-1 and BPE-2 in Tab. 1 as the performance estimation with other search algorithms. After finding the optimal architecture in the search space, we validate the final accuracy on a large network of cells is trained for epochs with a batch size of and additional regularization such as cutout , which are similar to [28, 52, 31]
. When stacking cells to evaluate on ImageNet, we use two initial convolutional layers of stridebefore stacking cells with the scale reduction at the st, nd, th and th cells. The total number of FLOPs is determined by the initial number of channels. The network is trained for 250 epochs with a batch size of 512, a weight decay of
, and an initial SGD learning rate of 0.1. All the experiments and models are implemented in PyTorch.
|ShuffleNetV2 2x (V2) ||73.7||5||-|
|RL + BPE-1 (Ours)||74.18||5.5||0.33|
|EA + BPE-1 (Ours)||74.56||5.0||0.33|
|RS + BPE-1 (Ours)||74.2||5.5||0.33|
|DARTS  + BPE-1 (Ours)||74.0||5.9||0.05|
5.1.2 Result on CIFAR10
We compare our method with both manually designed networks and NAS networks. The manually designed networks include ResNet , DenseNet  and SENet . We evaluate on four categories of NAS methods, i.e., RL methods (NASNet , ENAS  and Path-level NAS ), evolutional algorithms (AmoebaNet ), gradient-based methods (DARTS ) and Random Search.
The results for convolutional architectures on CIFAR-10 are presented in Tab. 2. It is worth noting that the found BPE combining with various search algorithms outperform various state-of-the-art search algorithms [52, 28, 33] in accuracy, with much lower computational consumption (only GPU days in ). We attribute our superior results to the found BPE. Another notable observation from Tab. 2 is that, even with random search in the search space, the test error rate is only %, which outperforms previous methods in the same search space. Conclusively, with the found BPE, search algorithms can quickly explore the architecture search space and generates a better architecture. We also report the results of hand-crafted networks in Tab. 2. Clearly, our method shows a notable enhancement.
5.1.3 Results on ImageNet
We further compare our method under the mobile settings on ImageNet to demonstrate the generalizability. The best architecture on CIFAR-10 is transferred to ImageNet, which follows the same experimental settings in [52, 31, 8]. Results in Tab. 3 show that the best cell architecture on CIFAR10 is transferable to ImageNet. The proposed method achieves comparable accuracy to the state-of-the-art methods [52, 34, 27, 34, 27, 31, 28, 8] while using far less computational resources, e.g., times faster comparing to EA, and times faster comparing to RL.
5.2 Deep Analysis in Performance Estimation
We further study the efficiency of using Spearman Rank Correlation as the objective function in Fig. 3. In Fig. 4, we also provide a deep analysis about the importance of every hyper-parameter. One can make full use of this analysis to transfer the found BPEs to other datasets and tasks.
We randomly select hyper-parameter settings in and apply them on the DARTS  search algorithm to find optimal architectures. Fig. 3 illustrates the relationship between and the accuracy of the optimal architecture found by the corresponding setting. The performance is highly correlated to (with a correlation), which denotes the efficiency of the proposed objective function in Eq. 1.
After exploring the BPE space by the proposed method, we get a dataset w.r.t. each and , which is used as the training set to train a random forest regression predictor  for each . We then report the estimated by the predictor and importance for each hyper-parameter in Fig. 4. As illustrated in Fig. 4, the steepness and importance are highly correlated, i.e., the more important the parameter is, the steeper of the corresponding curve is, vice versa. At the same time, for the two most important parameters (epoch and layer), we get a high with a small range. This means that we only need to carefully finetune these two parameters in a small range when transferring to other datasets.
5.3 Understanding One-shot based Methods
Previous works [25, 38] have reported that one-shot based methods such as DARTS do not work well (in some cases even no better than random search). There are two main questions which are not been explained yet: (1) One-shot based methods can not make a good estimate of performance, but they can search for good neural architectures. (2) The instability of one-shot based methods, that is, the found networks are different with different random seeds. With the found BPE, we can effectively investigate every search phase in these methods.
To understand and explain such questions, we first train the same hypergraph with different settings: (1) Fair training, each operation in an edge is trained with exactly the same epoch; (2) Random training, each operation in an edge is trained randomly at different random levels. In Tab. 4, we report the global and local in the case of fair training and random training. The global denotes that we use our trained hypergraph to get the validation performance for the networks in , and then calculate the with . The local is obtained by the following steps: When training the hypergraph, we save the sampled network architectures and the corresponding validation performance at epoch . The local is then obtained by using the found BPE-2 and Eq. 1 i.e., BPE-2. As illustrated in Tab. 4, one-shot based methods have a poor performance estimation in global , which is consistent with previous works [25, 38]. However, these methods have a high local , which means that these methods are essentially using the local information. That is to say, each epoch in the search phase can only perceive and optimize by using local information, which reasonably explains the instability of DARTS.
In this paper, we present the first systematic analysis of the budgeted performance estimation (BPE) in NAS, and propose a minimum importance pruning (MIP) towards optimal PE. The proposed MIP gradually reduces the number of BPE hyper-parameters, which allocates more computation resources on more important hyper-parameters. The found MIP-BPE is generalized to various search algorithms, including reinforcement learning, random search, evolution algorithm and gradient-based methods. Combining the found BPE with various NAS algorithms, we have reached the state-of-the-art test error on CIFAR10 with much fewer search time, which also helps us to better understand the widely-used one-shot based methods.
This work is supported by the Nature Science Foundation of China (No.U1705262, No.61772443, No.61572410, No.61802324 and No.61702136), National Key R&D Program (No.2017YFC0113000, and No.2016YFB1001503), and Nature Science Foundation of Fujian Province, China (No. 2017J01125 and No. 2018J01106).
-  (2019) Adaptive stochastic natural gradient method for one-shot neural architecture search. In ICML, Cited by: §2.1.
Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford university press. Cited by: §5.
-  (2018) Understanding and simplifying one-shot architecture search. In ICML, Cited by: §5.
-  (2012) Random search for hyper-parameter optimization. JMLR. Cited by: §2.2, §5.
-  (2016) End to end learning for self-driving cars. arXiv. Cited by: §1.
-  (2001) Random forests. Machine learning. Cited by: §1, §4.2.
-  (2018) Efficient architecture search by network transformation. In AAAI, Cited by: §1.
-  (2018) Path-level network transformation for efficient architecture search. arXiv. Cited by: Table 2, §5.1.1, §5.1.2, §5.1.3.
-  (2019) Binarized neural architecture search. arXiv preprint arXiv:1911.10862. Cited by: §1.
-  (2018) Searching for efficient multi-scale architectures for dense image prediction. In NeurIPS, Cited by: §1.
-  (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. ICCV. Cited by: §1, §1, §2.1, §3.2, Table 2.
Local to global learning: gradually adding classes for training deep neural networks.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2017) Improved regularization of convolutional neural networks with cutout. arXiv. Cited by: §3.1, Table 2, §5.1.1.
-  (2019) Searching for a robust neural architecture in four gpu hours. In CVPR, Cited by: Table 2.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, Table 2, §5.1.2.
An efficient approach for assessing hyperparameter importance. In ICML, pp. 754–762. Cited by: §1, §4.2, §5.2.
-  (2018) Squeeze-and-excitation networks. In CVPR, Cited by: Table 2, §5.1.2.
-  (2017) Densely connected convolutional networks. In CVPR, Cited by: Table 2, §5.1.2.
-  (2011) Sequential model-based optimization for general algorithm configuration. In LION, Cited by: §2.2.
-  (2019) Semi-supervised adversarial monocular depth estimation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
-  (1996) Reinforcement learning: a survey. JAIR. Cited by: §5.
-  (2016) Fast bayesian optimization of machine learning hyperparameters on large datasets. arXiv. Cited by: §1, §2.1.
-  (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §5.
-  (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §5.1.1.
-  (2019) Random search and reproducibility for neural architecture search. arXiv. Cited by: §3.1, §5.3, §5.3, §5.
-  (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. CVPR. Cited by: §1.
-  (2018) Progressive neural architecture search. In ECCV, Cited by: §5.1.1, §5.1.3.
-  (2019) Darts: differentiable architecture search. ICLR. Cited by: Figure 1, §1, §1, §1, §2.1, §2.1, §3.1, §3.1, §3.2, Table 1, Table 2, §5.1.1, §5.1.1, §5.1.2, §5.1.2, §5.1.3, §5.2, Table 3, §5.
-  (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In ECCV, Cited by: Table 3.
-  (2017) Automatic differentiation in pytorch. Cited by: §5.1.1.
-  (2018) Efficient neural architecture search via parameter sharing. ICML. Cited by: §1, §1, §1, §2.1, §3.2, Table 2, §5.1.1, §5.1.2, §5.1.3, §5.
-  (2019) On network design spaces for visual recognition. arXiv. Cited by: §4.1.
Aging evolution for image classifier architecture search. In AAAI, Cited by: §1, §1, §1, §3.2, §5.1.2.
-  (2019) Regularized evolution for image classifier architecture search. AAAI. Cited by: Figure 1, §1, §1, §1, §1, §3.2, Table 2, §5.1.2, §5.1.2, §5.1.3, Table 3.
-  (2015) Imagenet large scale visual recognition challenge. IJCV. Cited by: §5.1.1, §5.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: Table 3.
-  (2016) Convolutional neural fabrics. In NeurlPS, Cited by: §5.
-  (2019) Evaluating the search phase of neural architecture search. arXiv. Cited by: §5.3, §5.3, §5.
-  (2012) Practical bayesian optimization of machine learning algorithms. In NeurIPS, pp. 2951–2959. Cited by: §2.2.
-  (2019) MnasNet: platform-aware neural architecture search for mobile. In CVPR, Cited by: Table 3.
-  (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In SIGKDD, Cited by: §2.2.
-  (2019) C-mil: continuation multiple instance learning for weakly supervised object detection. In IEEE CVPR, pp. 2199–2208. Cited by: §1.
-  (2019) SNAS: stochastic neural architecture search. In ICLR, Cited by: Table 2, Table 3.
-  (2019) Pc-darts: partial channel connections for memory-efficient differentiable architecture search. arXiv. Cited by: §1, §1, §2.1, §3.1, §3.2.
-  (2018) Towards automated deep learning: efficient joint neural architecture and hyperparameter search. ICML Workshop. Cited by: §2.2.
-  (2019) FreeAnchor: learning to match anchors for visual object detection. In NeurIPS, pp. 147–155. Cited by: §1.
-  (2018) Centralized ranking loss with weakly supervised localization for fine-grained object retrieval.. In IJCAI, pp. 1226–1233. Cited by: §1.
-  (2019) Towards optimal fine grained retrieval via decorrelated centralized loss with normalize-scale layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9291–9298. Cited by: §1.
-  (2019) Dynamic distribution pruning for efficient network architecture search. arXiv. Cited by: §1, §1, §3.1, §3.1, §3.2.
-  (2019) Multinomial distribution learning for effective neural architecture search. In ICCV, Cited by: §2.1, §3.2, §5.
-  (2016) Neural architecture search with reinforcement learning. arXiv. Cited by: §1, §1, §1, §1, §3.2.
-  (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: Figure 1, §1, §1, §1, §1, §1, §2.1, §3.1, §3.1, §3.2, Table 2, §5.1.1, §5.1.1, §5.1.2, §5.1.2, §5.1.3, Table 3.