Rethinking Performance Estimation in Neural Architecture Search

05/20/2020 ∙ by Xiawu Zheng, et al. ∙ Xiamen University HUAWEI Technologies Co., Ltd. Peking University 0

Neural architecture search (NAS) remains a challenging problem, which is attributed to the indispensable and time-consuming component of performance estimation (PE). In this paper, we provide a novel yet systematic rethinking of PE in a resource constrained regime, termed budgeted PE (BPE), which precisely and effectively estimates the performance of an architecture sampled from an architecture space. Since searching an optimal BPE is extremely time-consuming as it requires to train a large number of networks for evaluation, we propose a Minimum Importance Pruning (MIP) approach. Given a dataset and a BPE search space, MIP estimates the importance of hyper-parameters using random forest and subsequently prunes the minimum one from the next iteration. In this way, MIP effectively prunes less important hyper-parameters to allocate more computational resource on more important ones, thus achieving an effective exploration. By combining BPE with various search algorithms including reinforcement learning, evolution algorithm, random search, and differentiable architecture search, we achieve 1, 000x of NAS speed up with a negligible performance drop comparing to the SOTA

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Time cost with/without budgeted performance estimation (BPE-1). (a) Previous methods did not optimize the huge computation cost in PE. (b) By incorporating the BPE, we can largely accelerate NAS methods including reinforcement learning (RL) [52], evolution algorithm (EA) [34], random search (RS) and DARTS [28] (One-shot) with a negligible performance drop.
Figure 2:

The overall framework of the proposed Minimum Importance Pruning for finding an optimal Budgeted Performance Estimation. The search space of BPE is built from the training hyper-parameters including training epoch, batch size, learning rate, layer number, float point, channels, cutout and image size. We first sample the example with the lowest time cost. Then the sampled example is used to train the random forest, which is used to evaluate the importance of the corresponding hyper-parameters. The hyper-parameter with the lowest importance is pruned by assigning the value with a minimum time cost.

Deep learning have made significant sucess in classification [15, 12], retrieval [48, 47] and detection [42, 46, 20]

. To this end, neural architecture search (NAS) aims to automatically discover a suitable neural network architecture by exploring over a tremendous architecture search space, which has shown remarkable performance over manual designs in various computer vision tasks

[9, 51, 28, 52, 10, 26].

Despite the extensive success, previous methods are still defective in intensive computation resources, which severely restricts its application prospect and flexibility. For instance, reinforcement learning (RL) based methods [52, 51] search a suitable architecture on CIFAR10 by training and evaluating more than architectures by using GPUs over

days. For another instance, the evolutionary algorithm (EA) based method in

[34] needs GPU days to find an optimal architecture on CIFAR10.

A NAS method generally consists of three components, i.e., search space, search strategy and performance estimation. As established by [51], cell based search space is now well adopted [49, 44, 11, 31, 34, 33, 51, 52], which is pre-defined and fixed during the architecture search to ensure a fair comparison among different NAS methods. On the other hand, as illustrated in Fig. 1, different search strategies (RL or EA) have similar run-time (after subtracting the performance estimation cost), which can also be well accelerated with GPU packages. Therefore the major computational consumption of NAS lies in the performance estimation (PE) step, as validated in Fig. 1. However, few works have been devoted to the efficiency issue of PE, which is crucial to cope with the explosive growth of dataset size and model complexity. Moreover, it is highly desirable to conduct fast architecture search under different datasets for deployment in emerging applications like self-driving cars [5].

In this paper, we propose a novel and efficient performance estimation under the resource-constrained regime, termed budgeted performance estimation (BPE), which is the first of its kind in the NAS community. The BPE essentially controls the hyper-parameters of training, network designing and dataset processing, such as number of channels, number of layers, learning rate and image size. Rather than pursuing model precision for a specific dataset, BPE aims to learn the most achievable relative precision order of different neural architectures in a specific architecture space. In other words, a good network structure still has a relatively high ranking on an accurate BPE. We argue that the missing of accurate and efficient BPE remains as the main barrier for the wide usage of NAS research. However, finding an accurate and effective BPE is extremely challenging compared to other black-box optimization problems. First, BPE needs to carefully deal with the discrete (like layers or channels) and continuous (like learning rate) hyper-parameters. Second, evaluating a specific BPE needs to train a large number of neural networks e.g., networks in the cell-based archietcture search space [28].

As implicitly employed in previous NAS [52, 22, 33, 28, 7, 31] methods, most BPE methods only leverage intuitive tricks including early stopping [52], dataset sampling [22] and lower resolution dataset, or using a proxy search network with fewer filters per layer and fewer cells [52, 28]. While such methods can reduce the computational cost to a certain extent (which is still time consuming [52, 34]), noise is also introduced into PE to underestimate its corresponding performance. Little work investigates the relative performance rank between approximated evaluations and full evaluations, which is traditionally considered as a merited trick [28, 52, 34]. However, as subsequently validated in Sec. 5, such a relative rank can change dramatically under a tiny difference in the training condition.

In this paper, we present a unified, fast and effective framework, termed Minimum Importance Pruning (MIP), to find an optimal BPE on a specific architecture search space such as cell-based search space [49, 44, 11, 31, 34, 33, 51, 52], as illustrated in Fig. 2. In particular, for a given large-scale hyper-parameter search space, we first sample examples with the lowest time consumption. The sampled examples are then used to estimate the hyper-parameter importance using random forest [6, 16]. The hyper-parameter of the lowest importance is set to the value with the minimum time cost. The algorithm stops when every hyper-parameter is set. The contributions of this paper include:

  • It is the first work to systematically investigate the performance estimation in NAS under the resource-constrained regime. We seek an optimal budgeted PE (BPE) by designing a spearman correlation loss function on a group of key hyper-parameters.

  • A novel hyper-parameter optimization method, termed Minimum Importance Pruning (MIP), is proposed, which is effective for black-box optimization with extremely time consuming on the evaluation step.

  • The proposed MIP-BPE generalizes well to various architecture search methods, including Reinforcement Learning (RL), Evolutionary Algorithms (EA), Random Search (RS) and DARTS. MIP-BPE achieves remarkable performance on both CIFAR-10 and ImageNet, while accelerating the search process by

    .

2 Related Work

2.1 Performance Estimation in NAS

Performance estimation refers to estimating the performance of a specific architecture in the architecture search space. A conventional option is to perform a standard training and validation process of this architecture on the dataset, which is computationally expensive and limits the number of architectures that can be explored. To accelerate performance estimation, most NAS methods only provide simple intuitive cues such as early stopping [52], dataset sampling [22] and lower resolution dataset, or using a proxy search network with fewer filters and fewer cells [52, 28].

Another possibility of estimating the architecture performance is one-shot based methods [50, 28, 1], which consider each individual in the search space as a sub-graph sampled from a super-graph. In this way, they accelerate the search process by parameter sharing [31]. Chen et al. [11] proposed to progressively grow the depth of searched architectures during the training procedure. Xu et al. [44] presented a partially connected method by sampling a small part of the super-net to reduce the redundancy in the network space, which thereby performs a more efficient search without comprising the performance. However, these methods do not deeply investigate the influence of different hyper-parameters, which has introduced large noise as validated in Sec. 5.

2.2 Hyper-parameter Optimization

Hyper-parameter optimization [41] aims to automatically optimize the hyper-parameters during the learning process [4, 19, 39, 45]. To this end, gird search and random search [4] are the two simplest and most straightforward approaches. Note that these methods do not consider to use the experience (sampled examples in the search process). Subsequently, sequential model-based optimization (SMBO) [19] is proposed to learn a proxy function from the experience and estimate the performance for unknown hyper-parameters. As one of the most popular methods, Bayesian optimization [39] learns a Gaussian process with the sampled examples, and then decides the best hyper-parameter for the next trial by maximizing the corresponding improvement function.

However, all these methods mostly deal with hyper-parameters for particular machine learning models, which cannot handle the optimization of BPE with such an expensive evaluation step. Different from the previous methods, we evaluate and estimate the importance of the hyper-parameters by sampling examples with the minimum time consumption, where hyper-parameters of minimum importances are then pruned in the next iteration, which is extremely effective and efficient to find the optimal BPE.

3 Preliminaries

3.1 NAS Pipeline

Given a training set, conventional NAS algorithms [52, 49, 25] first sample an architecture in the pre-defined search space by a certain search strategy like Reinforcement Learning (RL) or Evolution Algorithm (EA). Then the sampled neural architecture is passed to the performance estimation (PE), which returns the performance of the architecture to the search algorithm.

In most NAS methods [49, 28, 44], PE is accelerated by using a group of lower-cost hyper-parameters (like smaller image size, less channel and shallower network) in the search space , termed budgeted PE (BPE), which contains sorts of training hyper-parameters including the number of training epochs, batch size, learning rate, the number of layers, float point precision, channels, cutout [13] and image size. For instance, Liu et al. [28] proposed to estimate the performance of an architecture on a small network of layers trained for epochs, with batch size and initial number of channels . After the search process, the optimal neural architecture is then evaluated by a fully and time-consuming training hyper-parameter set . In the existing works [49, 28, 44], controls the final evaluation hyper-parameters of the optimal architecture, i.e., a large network of layers is trained for epochs with a batch size of and an additional regularization such as cutout [13].

However, in this pipeline, the BPE and the final evaluation phase are decoupled. There is no guarantee that the BPE is correlated to the final evaluation step, i.e., the same architectures may have large ranking distances under different training conditions. Most NAS methods [28, 52] intuitively change BPE with fewer channels or layers. Nevertheless, extensive experiments in Sec. 5 show that the effectiveness of BPE is very sensitive, which means that it needs to carefully select and analyze the corresponding hyper-parameters in NAS. Indeed, we believe, and validated in Sec. 5, that BPE is a crucial component, while unfortunately there are no corresponding works devoted to this area.

3.2 Cell based Architecture Search Space

As mentioned in Sec. 1, BPE aims to find optimal training hyper-parameters on a specific architecture search space. In this paper, we follow the widely-used cell-based architecture search space in [49, 44, 11, 31, 34, 33, 51, 52, 50]: A network consists of a pre-defined number of cells [51], which can be either norm cells or reduction cells. Each cell takes the outputs of the two previous cells as input. A cell is a fully-connected directed acyclic graph (DAG) of nodes, i.e., . Each node takes the dependent nodes as input, and generates an output through a sum operation

Here each node is a specific tensor (

e.g.,

a feature map in convolutional neural networks) and each directed edge

between and denotes an operation , which is sampled from the corresponding operation search space . Note that the constraint ensures no cycles in a cell. Each cell takes the outputs of two dependent cells as input, and the two input nodes are set as and for simplicity. Following [28], the operation search space consists of operations: dilated convolution with rate , dilated convolution with rate , depth-wise separable convolution, depth-wise separable convolution, max pooling, average pooling, no connection (zero), and a skip connection (identity). Therefore, the size of the whole search space is , where is the set of possible edges with intermediate nodes in the fully-connected DAG. In our case with the total number of cell structures in the search space is , which is an extremely large space to search.

4 The Proposed Method

In this section, we first describe the formal setting of BPE in Sec. 4.1. We then present the proposed minimum importance pruning (MIP) to find the optimal BPE.

4.1 Budgeted Performance Estimation

The performance estimation is a training algorithm with hyper-parameters in a domain . Given an architecture set sampled from , we address the following optimization problem:

(1)

where , calculates the Spearman Rank Correlation between and . and are the performance on validation set of every architecture in with full training hyper-parameter and BPE parameter , respectively. We aim to find the optimal with less average training consumption on .

Optimizing Eq. 1 is extremely challenging, as we need to train over architectures to validate one example in . This large set of models to be trained and evaluated prevent most NAS methods to be widely deployed. Fortunately, Radosavovic et al. [32] observed that sampling about models form a given architecture search space is sufficient to perform robust estimation, which is also validated in our work. Specifically, we randomly sample neural architectures in the cell-based search architecture space to construct the architecture set . Then and are obtained by training and validate every architecture in with the hyper-parameters and , respectively.

4.2 Minimum Importance Pruning

Although the time consumption of the validation step has been drastically reduced, it is still very difficult to optimize Eq. 1, i.e., in our evaluation, the average training time of an architecture from for different hyper-parameters is hours on CIFAR10 benchmark. In this case, each example needs to train networks, and the time consumption for one BPE example is hours. Such a time consumption is still difficult for finding an optimal BPE efficiently.

To handle this issue we propose a minimum importance pruning (MIP) as illustrated in Fig. 2. We first sample the hyper-parameter examples around the lowest time cost. Then the sampled examples are trained to estimate the hyper-parameter importance by using random forest [16, 6]. After that, the hyper-parameter with the lowest importance is pruned by setting the value with the minimum time cost. The pruning step is ceased when there is only one hyper-parameter in the search space, and the optimal BPE is the example with the maximum .

Lowest time cost sampling. For each element in , we introduce a category distribution related to the computational cost:

(2)

where denotes the th element of the th hyper-parameter . The function is the number of floating point operations. We set with the th element and fix other hyper-parameters in by the value with the minimum time cost. An example

is generated by sampling the joint probability in Eq. 

2, e.g., . Then, we obtain by training every architecture in using the sampled , and the objective is calculated with by using Eq. 1.

Input: Architecture search space ; hyper-parameter space ; sampling time
Output: Optimal BPE hyper-parameter .
1 The number of parameter in ;
2 {Sample network architectures in };
3 Train with fully training condition ;
4 {’s performance on validation set};
5 ;
6 while ()  do
7        {Randomly sample examples in by distribution in Eq. 2 };
8        for  in  do
9               Train with BPE ;
10               {’s performance on validation set};
11               Spearman Rank Correlation between and ;
12              
13        end for
14       Train random forest by using Eq. 5 and Eq. 6 on ;
15        Calculate the importance by Eq. 8;
16        Pruning space by Eq. 9;
17        ;
18 end while
Algorithm 1 Minimum Importance Pruning

Random forest training. After repeating previous steps over times, we get a set with different BPEs and corresponding objective values, which is used as a training set for the random forest. In random forest, each tree is built from a set drawn with a replacement sampling from . Training random forest is to train multiple regression trees. Given a training set with

and the corresponding spearman rank correlation vector

sampled from , a regression tree in the random forest recursively partitions the space such that the examples in with similar values are grouped together. When training the regression tree, we need to consider how to measure and choose the partition feature (hyper-parameter in our case). Specifically, let the data at node be represented by . For each candidate partition consisting of hyper-parameter and threshold , we partition the data into and subsets as follows:

(3)

We further define the impurity function for a given split set as

(4)

where , denotes the number of examples in set . And the impurity for a specific partition is the weighted sum of the impurity function:

(5)

We adopt the exhaustion method to find the optimal partition, that is, iterate through all possible partitions and select the partition with the minimum impurity:

(6)

Hyper-parameter importance. For every node in the regression tree, we calculate the parameter importance as the decrease in node impurity, which is weighted by the number of samples that reach the node. The parameter importance for node is defined as:

(7)

The importance for each is the summation of the importance through the node in the random forest, which uses as the partition parameter:

(8)

Parameter pruning. After the importance estimation process in Eq. 8, the hyper-parameter with the lowest probability is pruned by setting

(9)

is the value of the lowest FLOPs in hyper-parameter when . Otherwise, is the corresponding parameter value with the maximum in . The pruning step significantly improves the search efficiency. By setting the less important hyper-parameter to a value with less resource consumption, we can allocate more computational resource on important parameters. Our minimum importance pruning algorithm is presented in Alg. 1.

Hyper-parameter BPE-1 BPE-2 DARTS[28]
Epoch 10 30 50
Batch size 128 128 64
Learning rate 0.03 0.03 0.025
N_Layers 6 16 8
Channels 8 16 16
Image Size 16 16 32
Correlation 0.50 0.63 0.57
Training Time 0.08 0.55 1.38
Table 1: Detailed hyper-parameters of the best settings discovered On CIFAR10 by using MIP. The found BPE-1 and BPE-2 show a better correlation with less average training time (GPU Hours).
Architecture Test Error Params Search Cost Search
(%) (M) (GPU days) Method
ResNet-18 [15] 3.53 11.1 - Manual
DenseNet [18] 4.77 1.0 - Manual
SENet [17] 4.05 11.2 - Manual
NASNet-A [52] 2.65 3.3 1800 RL
ENAS [31] 2.89 4.6 0.5 RL
Path-level NAS [8] 3.64 3.2 8.3 RL
RL+BPE-1 (Ours) 2.66 0.05 2.7 0.33 RL
RL+BPE-2 (Ours) 2.65 0.12 2.9 2 RL
AmoebaNet-B [34] 2.55 2.8 3150 Evolution
EA+BPE-1 (Ours) 2.68 0.09 2.46 0.33 Evolution
EA+BPE-2 (Ours) 2.66 0.07 2.87 2 Evolution
DARTS [28] 2.7 0.01 3.1 1.5 Gradient-based
GDAS [14] 2.93 0.07 3.4 0.8 Gradient-based
P-DARTS [11] 2.75 0.06 3.4 0.3 Gradient-based
SNAS [43] 2.85 0.02 2.8 1.5 Gradient-based
DARTS + BPE-1 (Ours) 2.89 0.0 3.9 0.05 Gradient-based
DARTS + BPE-2 (Ours) 2.72 0.0 4.04 0.33 Gradient-based
Random Sample 100 2.55 2.9 108 Random Search
Random Sample 100 + BPE-1 (Ours) 2.68 0.09 2.7 0.33(337) Random Search
Random Sample 100 + BPE-2 (Ours) 2.68 0.05 1.9 2 (54) Random Search
Table 2: Comparing of test error rates for our discovered architecture, human-designed networks and other NAS architectures on CIFAR-10. For a fair comparison, we select the architectures and results with similar parameters ( M) and the same training condition (all the networks are trained with Cutout [13] ). Values are across runs.

5 Experiment

As we mentioned before, the average time consumption for evaluating BPE examples is GPU hours, which means that with similar sample magnitude (), methods such as bayesian optimization or random search need about GPU hours (almost infeasible). In contrast, our method needs only GPU hours. Therefore, we do not compare these methods in our paper.

We first combine BPE with different search strategies including Reinforcement Learning (RL) [21], Evolutionary Algorithm (EA) [2], Random Search (RS) [4] and Different Architecture Search (DARTS) [28]. As shown in Sec. 5.1, we compare with state-of-the-art methods in terms of both effectiveness and efficiency using CIFAR10 [23] and ImageNet [35]. In Sec. 5.2, we investigate the effect of each hyper-parameter in BPE, as well as the efficiency of using Spearman Rank Correlation as the objective function. Although many works [38, 25] pointed out that the one-shot based method [37, 31, 28, 3, 50] could not effectively estimate the performance throughout the entire search space, in Sec. 5.3 we have found that these methods are indeed effective in the local search space, which reasonably explains the reproducibility and effectiveness, i.e., the corresponding algorithms are actually able to find good architectures, and the optimal architectures are quite different in different runs due to the local information.

5.1 Comparing with State-of-the-arts

We first search neural architectures by using the found BPE-1 and BPE-2 in Tab. 1, and then evaluate the best architecture with a stacked deeper network. To ensure the stability of the proposed method, we run each experiment

times and find that the resulting architectures only show a slight variance in performance.

5.1.1 Experimental Settings

We use the same datasets and evaluation metrics for existing NAS methods

[28, 8, 52, 27]. First, most experiments are conducted on CIFAR-10 [24], which has K training images and K testing images from 10 classes with a resolution . During the architecture search, we randomly select K images from the training set as a validation set. To further evaluate the generalization capability, we stack the optimal cell discovered on CIFAR-10 into a deeper network, and then evaluate the classification accuracy on ILSVRC 2012 [35], which consists of classes with M training images and K validation images. Here, we consider the mobile setting, where the input image size is and the FLOPs is less than 600M.

In the search process, we directly use the found BPE-1 and BPE-2 in Tab. 1 as the performance estimation with other search algorithms. After finding the optimal architecture in the search space, we validate the final accuracy on a large network of cells is trained for epochs with a batch size of and additional regularization such as cutout [13], which are similar to [28, 52, 31]

. When stacking cells to evaluate on ImageNet, we use two initial convolutional layers of stride

before stacking cells with the scale reduction at the st, nd, th and th cells. The total number of FLOPs is determined by the initial number of channels. The network is trained for 250 epochs with a batch size of 512, a weight decay of

, and an initial SGD learning rate of 0.1. All the experiments and models are implemented in PyTorch

[30].

Model Top-1 Params Search time
(M) (GPU days)
MobileNetV2 [36] 72.0 3.4 -
ShuffleNetV2 2x (V2) [29] 73.7 5 -
NASNet-A [52] 74.0 5.3 1800
AmoebaNet-A [34] 74.5 5.1 3150
MnasNet-92 [40] 74.8 4.4 -
SNAS [43] 72.7 4.3 522
DARTS [28] 73.1 4.9 4
RL + BPE-1 (Ours) 74.18 5.5 0.33
EA + BPE-1 (Ours) 74.56 5.0 0.33
RS + BPE-1 (Ours) 74.2 5.5 0.33
DARTS [28] + BPE-1 (Ours) 74.0 5.9 0.05
Table 3: Comparison with the state-of-the-art image classification methods on ImageNet. All the NAS networks in this table are searched on CIFAR10, and are then directly transferred to ImageNet.

5.1.2 Result on CIFAR10

We compare our method with both manually designed networks and NAS networks. The manually designed networks include ResNet [15], DenseNet [18] and SENet [17]. We evaluate on four categories of NAS methods, i.e., RL methods (NASNet [52], ENAS [31] and Path-level NAS [8]), evolutional algorithms (AmoebaNet [34]), gradient-based methods (DARTS [28]) and Random Search.

The results for convolutional architectures on CIFAR-10 are presented in Tab. 2. It is worth noting that the found BPE combining with various search algorithms outperform various state-of-the-art search algorithms [52, 28, 33] in accuracy, with much lower computational consumption (only GPU days in [34]). We attribute our superior results to the found BPE. Another notable observation from Tab. 2 is that, even with random search in the search space, the test error rate is only %, which outperforms previous methods in the same search space. Conclusively, with the found BPE, search algorithms can quickly explore the architecture search space and generates a better architecture. We also report the results of hand-crafted networks in Tab. 2. Clearly, our method shows a notable enhancement.

Figure 3: The relationship between and performance with random sampled BPEs in . The x-axis measures the Spearman Rank Correlation by Eq. 1, and the y-axis measures the real architecture performance found by DARTS+BPE on CIFAR10. The correlation between and performance is .
Figure 4: The importance (within brackets) and regression predict curves with mean and variance learned by the random forest for each hyper-parameter. Importance is highly correlated to the curve steepnes. For the two most important parameters (epoch and layer), we can get a high within a small range.

5.1.3 Results on ImageNet

We further compare our method under the mobile settings on ImageNet to demonstrate the generalizability. The best architecture on CIFAR-10 is transferred to ImageNet, which follows the same experimental settings in [52, 31, 8]. Results in Tab. 3 show that the best cell architecture on CIFAR10 is transferable to ImageNet. The proposed method achieves comparable accuracy to the state-of-the-art methods [52, 34, 27, 34, 27, 31, 28, 8] while using far less computational resources, e.g., times faster comparing to EA, and times faster comparing to RL.

5.2 Deep Analysis in Performance Estimation

We further study the efficiency of using Spearman Rank Correlation as the objective function in Fig. 3. In Fig. 4, we also provide a deep analysis about the importance of every hyper-parameter. One can make full use of this analysis to transfer the found BPEs to other datasets and tasks.

We randomly select hyper-parameter settings in and apply them on the DARTS [28] search algorithm to find optimal architectures. Fig. 3 illustrates the relationship between and the accuracy of the optimal architecture found by the corresponding setting. The performance is highly correlated to (with a correlation), which denotes the efficiency of the proposed objective function in Eq. 1.

After exploring the BPE space by the proposed method, we get a dataset w.r.t. each and , which is used as the training set to train a random forest regression predictor [16] for each . We then report the estimated by the predictor and importance for each hyper-parameter in Fig. 4. As illustrated in Fig. 4, the steepness and importance are highly correlated, i.e., the more important the parameter is, the steeper of the corresponding curve is, vice versa. At the same time, for the two most important parameters (epoch and layer), we get a high with a small range. This means that we only need to carefully finetune these two parameters in a small range when transferring to other datasets.

Epoch
50 200 400 600
Fair Global 0.10 -0.06 0.13 -0.03
Local 0.13 0.50 0.31 0.31
Random Global 0.10 -0.05 -0.30 -0.19
Local -0.14 -0.52 0.61 -0.01
Random_10 Global 0.0 0.0 0.02 -0.08
Local 0.26 0.11 0.57 0.58
Table 4: Comparison of the Global and Local under different training conditions. “Fair” denotes each operation in an edge is trained with exactly the same epoch. “Random” denotes each operation in an edge is trained randomly with different random level. Global and local denote we use the trained model to evaluate the performance estimation globally and locally, respectively.

5.3 Understanding One-shot based Methods

Previous works [25, 38] have reported that one-shot based methods such as DARTS do not work well (in some cases even no better than random search). There are two main questions which are not been explained yet: (1) One-shot based methods can not make a good estimate of performance, but they can search for good neural architectures. (2) The instability of one-shot based methods, that is, the found networks are different with different random seeds. With the found BPE, we can effectively investigate every search phase in these methods.

To understand and explain such questions, we first train the same hypergraph with different settings: (1) Fair training, each operation in an edge is trained with exactly the same epoch; (2) Random training, each operation in an edge is trained randomly at different random levels. In Tab. 4, we report the global and local in the case of fair training and random training. The global denotes that we use our trained hypergraph to get the validation performance for the networks in , and then calculate the with . The local is obtained by the following steps: When training the hypergraph, we save the sampled network architectures and the corresponding validation performance at epoch . The local is then obtained by using the found BPE-2 and Eq. 1 i.e., BPE-2. As illustrated in Tab. 4, one-shot based methods have a poor performance estimation in global , which is consistent with previous works [25, 38]. However, these methods have a high local , which means that these methods are essentially using the local information. That is to say, each epoch in the search phase can only perceive and optimize by using local information, which reasonably explains the instability of DARTS.

6 Conlusion

In this paper, we present the first systematic analysis of the budgeted performance estimation (BPE) in NAS, and propose a minimum importance pruning (MIP) towards optimal PE. The proposed MIP gradually reduces the number of BPE hyper-parameters, which allocates more computation resources on more important hyper-parameters. The found MIP-BPE is generalized to various search algorithms, including reinforcement learning, random search, evolution algorithm and gradient-based methods. Combining the found BPE with various NAS algorithms, we have reached the state-of-the-art test error on CIFAR10 with much fewer search time, which also helps us to better understand the widely-used one-shot based methods.

Acknowledgements.

This work is supported by the Nature Science Foundation of China (No.U1705262, No.61772443, No.61572410, No.61802324 and No.61702136), National Key R&D Program (No.2017YFC0113000, and No.2016YFB1001503), and Nature Science Foundation of Fujian Province, China (No. 2017J01125 and No. 2018J01106).

References

  • [1] Y. Akimoto, S. Shirakawa, N. Yoshinari, K. Uchida, S. Saito, and K. Nishida (2019) Adaptive stochastic natural gradient method for one-shot neural architecture search. In ICML, Cited by: §2.1.
  • [2] T. Back (1996)

    Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms

    .
    Oxford university press. Cited by: §5.
  • [3] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018) Understanding and simplifying one-shot architecture search. In ICML, Cited by: §5.
  • [4] J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. JMLR. Cited by: §2.2, §5.
  • [5] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. (2016) End to end learning for self-driving cars. arXiv. Cited by: §1.
  • [6] L. Breiman (2001) Random forests. Machine learning. Cited by: §1, §4.2.
  • [7] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang (2018) Efficient architecture search by network transformation. In AAAI, Cited by: §1.
  • [8] H. Cai, J. Yang, W. Zhang, S. Han, and Y. Yu (2018) Path-level network transformation for efficient architecture search. arXiv. Cited by: Table 2, §5.1.1, §5.1.2, §5.1.3.
  • [9] H. Chen, L. Zhuo, B. Zhang, X. Zheng, J. Liu, D. Doermann, and R. Ji (2019) Binarized neural architecture search. arXiv preprint arXiv:1911.10862. Cited by: §1.
  • [10] L. Chen, M. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens (2018) Searching for efficient multi-scale architectures for dense image prediction. In NeurIPS, Cited by: §1.
  • [11] X. Chen, L. Xie, J. Wu, and Q. Tian (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. ICCV. Cited by: §1, §1, §2.1, §3.2, Table 2.
  • [12] H. Cheng, D. Lian, B. Deng, S. Gao, T. Tan, and Y. Geng (2019-06) Local to global learning: gradually adding classes for training deep neural networks. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §1.
  • [13] T. DeVries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. arXiv. Cited by: §3.1, Table 2, §5.1.1.
  • [14] X. Dong and Y. Yang (2019) Searching for a robust neural architecture in four gpu hours. In CVPR, Cited by: Table 2.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, Table 2, §5.1.2.
  • [16] H. Hoos and K. Leyton-Brown (2014)

    An efficient approach for assessing hyperparameter importance

    .
    In ICML, pp. 754–762. Cited by: §1, §4.2, §5.2.
  • [17] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, Cited by: Table 2, §5.1.2.
  • [18] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, Cited by: Table 2, §5.1.2.
  • [19] F. Hutter, H. H. Hoos, and K. Leyton-Brown (2011) Sequential model-based optimization for general algorithm configuration. In LION, Cited by: §2.2.
  • [20] R. Ji, K. Li, Y. Wang, X. Sun, F. Guo, X. Guo, Y. Wu, F. Huang, and J. Luo (2019) Semi-supervised adversarial monocular depth estimation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
  • [21] L. P. Kaelbling, M. L. Littman, and A. W. Moore (1996) Reinforcement learning: a survey. JAIR. Cited by: §5.
  • [22] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter (2016) Fast bayesian optimization of machine learning hyperparameters on large datasets. arXiv. Cited by: §1, §2.1.
  • [23] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §5.
  • [24] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §5.1.1.
  • [25] L. Li and A. Talwalkar (2019) Random search and reproducibility for neural architecture search. arXiv. Cited by: §3.1, §5.3, §5.3, §5.
  • [26] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. CVPR. Cited by: §1.
  • [27] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In ECCV, Cited by: §5.1.1, §5.1.3.
  • [28] H. Liu, K. Simonyan, and Y. Yang (2019) Darts: differentiable architecture search. ICLR. Cited by: Figure 1, §1, §1, §1, §2.1, §2.1, §3.1, §3.1, §3.2, Table 1, Table 2, §5.1.1, §5.1.1, §5.1.2, §5.1.2, §5.1.3, §5.2, Table 3, §5.
  • [29] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In ECCV, Cited by: Table 3.
  • [30] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §5.1.1.
  • [31] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. ICML. Cited by: §1, §1, §1, §2.1, §3.2, Table 2, §5.1.1, §5.1.2, §5.1.3, §5.
  • [32] I. Radosavovic, J. Johnson, S. Xie, W. Lo, and P. Dollár (2019) On network design spaces for visual recognition. arXiv. Cited by: §4.1.
  • [33] E. Real, A. Aggarwal, Y. Huang, and Q. Le (2019)

    Aging evolution for image classifier architecture search

    .
    In AAAI, Cited by: §1, §1, §1, §3.2, §5.1.2.
  • [34] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019) Regularized evolution for image classifier architecture search. AAAI. Cited by: Figure 1, §1, §1, §1, §1, §3.2, Table 2, §5.1.2, §5.1.2, §5.1.3, Table 3.
  • [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV. Cited by: §5.1.1, §5.
  • [36] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: Table 3.
  • [37] S. Saxena and J. Verbeek (2016) Convolutional neural fabrics. In NeurlPS, Cited by: §5.
  • [38] C. Sciuto, K. Yu, M. Jaggi, C. Musat, and M. Salzmann (2019) Evaluating the search phase of neural architecture search. arXiv. Cited by: §5.3, §5.3, §5.
  • [39] J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In NeurIPS, pp. 2951–2959. Cited by: §2.2.
  • [40] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) MnasNet: platform-aware neural architecture search for mobile. In CVPR, Cited by: Table 3.
  • [41] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In SIGKDD, Cited by: §2.2.
  • [42] F. Wan, C. Liu, W. Ke, X. Ji, J. Jiao, and Q. Ye (2019) C-mil: continuation multiple instance learning for weakly supervised object detection. In IEEE CVPR, pp. 2199–2208. Cited by: §1.
  • [43] S. Xie, H. Zheng, C. Liu, and L. Lin (2019) SNAS: stochastic neural architecture search. In ICLR, Cited by: Table 2, Table 3.
  • [44] Y. Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, and H. Xiong (2019) Pc-darts: partial channel connections for memory-efficient differentiable architecture search. arXiv. Cited by: §1, §1, §2.1, §3.1, §3.2.
  • [45] A. Zela, A. Klein, S. Falkner, and F. Hutter (2018) Towards automated deep learning: efficient joint neural architecture and hyperparameter search. ICML Workshop. Cited by: §2.2.
  • [46] X. Zhang, F. Wan, C. Liu, R. Ji, and Q. Ye (2019) FreeAnchor: learning to match anchors for visual object detection. In NeurIPS, pp. 147–155. Cited by: §1.
  • [47] X. Zheng, R. Ji, X. Sun, Y. Wu, F. Huang, and Y. Yang (2018) Centralized ranking loss with weakly supervised localization for fine-grained object retrieval.. In IJCAI, pp. 1226–1233. Cited by: §1.
  • [48] X. Zheng, R. Ji, X. Sun, B. Zhang, Y. Wu, and F. Huang (2019) Towards optimal fine grained retrieval via decorrelated centralized loss with normalize-scale layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9291–9298. Cited by: §1.
  • [49] X. Zheng, R. Ji, L. Tang, Y. Wan, B. Zhang, Y. Wu, Y. Wu, and L. Shao (2019) Dynamic distribution pruning for efficient network architecture search. arXiv. Cited by: §1, §1, §3.1, §3.1, §3.2.
  • [50] X. Zheng, R. Ji, L. Tang, B. Zhang, J. Liu, and Q. Tian (2019) Multinomial distribution learning for effective neural architecture search. In ICCV, Cited by: §2.1, §3.2, §5.
  • [51] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv. Cited by: §1, §1, §1, §1, §3.2.
  • [52] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: Figure 1, §1, §1, §1, §1, §1, §2.1, §3.1, §3.1, §3.2, Table 2, §5.1.1, §5.1.1, §5.1.2, §5.1.2, §5.1.3, Table 3.