A Surgery of the Neural Architecture Evaluators

08/07/2020 ∙ by Xuefei Ning, et al. ∙ Tsinghua University 14

Neural architecture search (NAS) recently received extensive attention due to its effectiveness in automatically designing effective neural architectures. A major challenge in NAS is to conduct a fast and accurate evaluation of neural architectures. Commonly used fast architecture evaluators include one-shot evaluators (including weight sharing and hypernet-based ones) and predictor-based evaluators. Despite their high evaluation efficiency, the evaluation correlation of these evaluators is still questionable. In this paper, we conduct an extensive assessment of both the one-shot and predictor-based evaluator on the NAS-Bench-201 benchmark search space, and break up how and why different factors influence the evaluation correlation and other NAS-oriented criteria. Codes are available at https://github.com/walkerning/aw_nas.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Architecture Search (NAS) has received extensive attention due to its capability to discover neural network architectures in an automated manner. Studies have shown that the automatically discovered architectures by NAS can outperform the hand-crafted architectures for various applications, such as classification 

[nayman2019xnas, zoph2016neural], detection [ghiasi2019fpn, chen2019detnas], video understanding [ryoo2019assemblenet], text modeling [zoph2016neural], etc. The vanilla NAS algorithm [zoph2016neural] suffers from the extremely heavy computational burden, since the evaluation of neural architectures is slow. Thus, how to evaluate a neural architecture in a fast and accurate way is vital for addressing the computational challenge of NAS.

A neural architecture evaluator takes an architecture as input, and output the evaluated score that indicates the quality of the architecture. In both hyper-parameter optimization and neural architecture search algorithms, the straightforward solution to evaluate a configuration or an architecture is to train a model from scratch to convergence and then test it on the validation dataset, which is extremely time-consuming. Instead of exactly evaluating them on the target task, researchers usually construct a proxy model with fewer layers or fewer channels [enas, real2019regularized, wu2019fbnet], and train this model to solve a proxy task of smaller scales [cai2018efficient, elsken2018efficient, klein2017fast, wu2019fbnet]

, e.g., smaller dataset or subsets of dataset, training or finetuning for fewer epochs.

Traditional evaluators conduct separate training phases to approximately discover the weights that are suitable for each architecture. In contrast, one-shot evaluation amortized the training cost of different architectures through parameter sharing or a global hypernetwork, thus significantly reduce the architecture evaluation cost. enas construct an over-parametrized super network such that all architectures in the search space are sub-architectures of the super-net. Throughout the search process, the shared parameters in the super network are updated on the training dataset split, and each architecture is evaluated by directly using the corresponding subset of the weights in the super network. Afterwards, the parameter sharing technique is widely used for architecture search in different search spaces [wu2019fbnet], or incorporated with different searching strategies  [darts, nayman2019xnas, xie2018snas, yang2019cars]. Hypernetwork based evaluation is another type of one-shot evaluation strategy. brock2018smash, zhang2018graph utilized hypernetworks to generate proper weights for evaluating each architecture.

Whether or not the one-shot strategies can provide highly-correlated evaluation results for different architectures is essential for the efficacy of the NAS process. bender2018understanding

conduct a first study on the correlation between the standalone architecture performances and the evaluated metrics with parameter sharing. A more recent study 

[sciuto2019evaluating] finds out that parameter sharing evaluation cannot reflect the true performance ranking of architectures well enough. However, they only conduct the experiments in a toy search space with only 32 architectures in total.

Besides one-shot evaluation strategies, predictor-based evaluation strategies [nao2018, liu2018progressive, deng2017peephole, sun2019e2pp, alphax, xu2019renas, ning2020generic] use a performance predictor that takes the architecture description as inputs and outputs a predicted performance score. The performance predictor should be trained using “ground-truth” architecture performances. Thus, in a predictor-based NAS framework, there should be another “oracle” evaluator that provides the instruction signals for training the predictor. nao2018 makes an attempt to use the noisy signals provided by parameter sharing based one-shot evaluator to train the predictor. They find that compared with using the more expensive traditional evaluation, using one-shot evaluation to provide the instruction signal can only discover much worse architecture. In this paper, we try to answer the question of whether it is reasonable to use a one-shot evaluator as the oracle evaluator to train a predictor.

Figure 1: The overview of the neural architecture evaluators

Current fast evaluation strategies of neural architectures are summarized in Fig. 1, including shared weights, hypernetworks, and predictor-based ones. And this paper aims at revealing the status and shortcomings of current architecture evaluation strategies systematically. Specifically, we conduct a controlled and comprehensive evaluation of different neural architecture evaluation strategies with various criteria, and also analyze the evaluation results to identify the architectures being under- or over-estimated.

2 Related Work

2.1 Training Acceleration with Proxy Tasks and Models

Since the exact evaluation of a hyperparameter setting or a neural architecture is time-consuming, it is a common practice to evaluate smaller proxy models 

[enas, real2019regularized, wu2019fbnet] on smaller proxy tasks [cai2018efficient, elsken2018efficient, klein2017fast, wu2019fbnet], e.g., smaller dataset or subsets of the dataset, training or finetuning for fewer epochs. When the proxy setting is too aggressive, the resulting proxy metrics can be weakly correlated with the true architecture performances. One method of using low-fidelity proxy metrics is to only partially train a hyperparameter setting or a neural architecture by early stopping the training process. On the basis of that, domhan2015speeding, baker2017accelerating proposed to conduct training curve extrapolation to predict the final performance. Given the overall search budget, there are also researches [li2017hyperband, falkner2018bohb] that explore the strategies for trading off between the number of explored configurations and the training resources for a single configuration.

To shorten the finetuning time with better weight initialization, network morphism based methods [cai2018efficient, elsken2018efficient] conduct search based on architecture mutation decisions, and initialize the weights of each candidate architecture by inheriting and transforming weights from its parent model.

In this paper, we investigate the correlation gaps brought by several types of proxy tasks or models, including channel number reduction, layer number reduction, and training epoch reduction.

2.2 One-shot Evaluators

One-shot evaluation mainly consists of two types of strategies: 1) weight sharing [enas, wu2019fbnet, darts, nayman2019xnas, xie2018snas, yang2019cars], 2) hypernetworks [brock2018smash, zhang2018graph]. These two strategies both amortize the training cost of different architectures via the sharing of the network or hypernetwork parameters.

The ranking correlation gaps of existing shared weights evaluators are brought by two factors: 1) proxy model and task: due to the memory constraint, a proxy super network (supernet) [darts, enas] with fewer channels or layers is usually used; 2) weight sharing. To alleviate the first factor, there are some studies [cai2018proxylessnas, chen2019progressive] that aim at making one-shot evaluation more memory efficient, thus the one-shot search could be carried without using a proxy supernet. For the second factor, there are a few studies that carried out correlation evaluation for one-shot evaluators. zhang2018graph (GHN) conducted a correlation comparison between the GHN hypernetwork evaluator, shared weights evaluator, and several small proxy tasks. However, the correlation is evaluated using 100 architectures randomly sampled from a large search space, which is not a convincing and consistent benchmark metric. luo2019understanding

did a preliminary investigation into why weight sharing evaluation fails to provide correlated evaluations, and proposed to increase the sample probabilities of the large models. Their evaluation is also conducted on dozens of architectures sampled from the search space.

zela2020bench propose a benchmark framework to assess weight sharing NAS on Nas-Bench-101, and compare the correlation of different search strategies. sciuto2019evaluating conduct weight sharing NAS in a toy RNN search space with only 32 architectures in total, and discover that the weight sharing rankings do not correlate with the true rankings of architectures. To improve the correlation of one-shot evaluation, chu2019fairnas proposed a sampling strategy in a layer-wise search space.

In summary, a correlation evaluation of various one-shot evaluation strategies on all architectures in more meaningful search spaces is still missing in the literature. In this paper, we analyze the ranking correlation gaps that are brought by the model proxy (difference in the number of channels and layers) and the one-shot evaluation, respectively. Moreover, after investigating the factors that influence the evaluation correlation, we give some suggestions on improving the evaluation quality.

2.3 Predictor-based Evaluators

An architecture performance predictor takes the architecture descriptions as inputs, and outputs the predicted performance scores without training the architectures. Actually, in the overall NAS framework, the predictor-based evaluator plays a different role from the traditional or one-shot evaluators. Fig. 6 shows the general workflow of predictor-based NAS, and we can see that the predictor should be trained using “ground-truth” architecture performances, which are provided by another “oracle” evaluator. Usually, expensive traditional evaluators that can provide relatively accurate architecture performances are chosen as the oracle evaluators [kandasamy2018bayesian, liu2018progressive, nao2018]. Utilizing a good predictor, we can choose architectures that are more worth evaluating to be evaluated by the expensive oracle evaluator. Thus, the fitness of the performance predictor is vital to the efficacy of the NAS algorithm, as fewer architectures need to be trained when incorporated with a good predictor.

An architecture performance predictor takes the architecture descriptions as inputs, and outputs the predicted performance scores without training the architectures. There are two factors that are crucial to the fitness of the predictors: 1) embedding space; 2) training technique. On one hand, to embed neural architectures into a continuous space and get a meaningful embedding space, there are studies that propose different architecture encoders, e.g., sequence-based [nao2018, liu2018progressive, deng2017peephole, sun2019e2pp, alphax], graph-based [shi2019multi, ning2020generic]. As for nonparametric predictors, kandasamy2018bayesian design a kernel function in the architecture space and exploits gaussian process to get the posterior of the architecture performances. shi2019multi combined a graph-based encoder and nonparametric gaussian process to construct the performance predictor. On the other hand, from the aspect of training techniques, nao2018 employed a encoder-decoder structure and used a reconstruction loss as an auxiliary loss term. xu2019renas, ning2020generic employed learning-to-rank techniques to train the predictors.

3 Evaluation Criteria

In this section, we introduce the evaluation criteria used in this paper. We denote the search space size as , the true performances and approximated evaluated scores of architectures as and , respectively. And we denote the ranking of the architecture performance and the evaluated performance as and ( indicates that is the best architecture in the search space). Firstly, the correlation criteria adopted in our paper are

  • Linear correlation: The pearson correlation coefficient .

  • Kendall’s Tau ranking correlation: The relative difference of concordant pairs and discordant pairs .

  • Spearman’s ranking correlation: The pearson correlation coefficient between the rank variables .

Besides these correlation criteria, we also adopt several criteria that emphasize more on the relative order of architectures with good performances. Denoting as the set of architectures whose evaluated scores is among the top portion of the search space, these two criteira are

  • Precision@K (P@K) : The proportion of true top-K proportion architectures in the top-K architectures according to the scores.

  • BestRanking@K (BR@K) : The best normalized ranking among the top K proportion of architectures according to the scores.

The two criteria are similar to those used in ning2020generic, except that rankings and architecture numbers are all normalized with respect to the search space size .

The above criteria are used to compare the fitness of various architecture evaluators with different configurations. Not only do we want to choose appropriate configurations of architecture evaluators, we’d also like to interpret their evaluation results. To identify which architectures are under- or over-estimated by various evaluators, and analyze the reasons accordingly, we investigate the relationship of the true-predicted ranking differences and the architecture properties such as the FLOPs: .

4 One-shot Evaluators

In this section, we introduce the ranking correlation gaps of one-shot evaluators and evaluate the influence of several sampling strategies and training techniques.

4.1 Experimental Setup

One-shot evaluators mainly include weight sharing evaluators and hypernetwork evaluators. Since hypernetwork solutions are not generic currently, we concentrate on the evaluation of weight sharing evaluators (i.e., evaluate each architecture using a weight-sharing supernet) in this paper111From now on, “weight-sharing supernet” and “one-shot evaluator” are used interchangeably.. During the training process of the supernet, candidate architectures are randomly sampled, and their corresponding weights are updated in every iteration.

We conduct our experiments on CIFAR-10 using a recent NAS benchmarking search space, NAS-Bench-201 

[Dong2020NAS-Bench-201]. NAS-Bench-201 is a NAS benchmark that provides the performances of all the 15625 architectures in a cell-based search space. However, there are architectures with different matrix representations that are actually isomorphic in this search space. As also reported by their original paper, there are 6466 unique topology structures in the de-isomorphic search space.

The hyper-parameters used to train all the weight-sharing supernets are summarized in Tab. 1. We train weight sharing evaluators via momentum SGD with momentum 0.9 and weight decay 0.0005. The batch size is set to 512. The learning rate is set to 0.05 initially and decayed by 0.5 each time the supernet accuracy stops to increase for 30 epochs. During training, the dropout rate is set to 0.1, and the gradient norm is clipped to be less than 5.0.

optimizer SGD initial LR 0.05
momentum 0.9 LR schedule ReduceLROnPlateau222


weight decay 0.0005 LR decay 0.5
batch size 512 LR patience 30
dropout rate 0.1 grad norm clip 5.0
Table 1: Training hyper-parameter

4.2 Trend of Different Indicators

Figure 2: Comparison of different indicators
Figure 3: (a) Influence of channel proxy. (b) Influence layer proxy. (c) Kendall-tau and average rank difference of architectures in different flops groups

We inspect how BR@K, P@K, and the correlation criteria converge during the training process. We train a weight sharing model with 17 layers and 16 initial channels on the de-isomorphic NAS-Bench-201 search space. Shown in Fig. 2(a), the speed of convergence is highly different. N@K converges in a very short time. P@K converges in around 250 epochs and then even gradually decreases. Meanwhile, linear correlation, Kendall’s Tau and Spearman correlation are still growing till 500 epochs, while the weight sharing accuracy grows during the whole 1000 epochs. This indicates that the models with different rankings change at different speeds as the training progresses, and the top-ranked models stand out faster. Another evidence is shown in Fig. 2(b) that P@5% converges much faster than P@50%. Another unexpected fact to note in Fig. 2(b) (also see Tab. 3) is that P@5% usually shows a decreasing trends from 200 epochs on. This is due to that, while the architectures with best performances stand out very fast in one-shot training, their one-shot performances will be caught up with by other architectures as the training goes on.

4.3 Sampling Strategy

The NAS-Bench-201 search space includes many isomorphic architectures. We expect that one-shot evaluators could handle isomorphic architectures, which means that we expect the accuracy of isomorphic architectures should be as close as possible. We calculate the average variances of test accuracy and ranking in isomorphic groups during the training process, as shown in Tab. 

2. As the training progresses, the variance within the group gradually shrinks, which indicates the more sufficient training makes one-shot evaluator handle isomorphic architectures better.

epochs GT 200 400 600 800 1000
Accuracy std (%)
Ranking std
Table 2:

Average standard deviation of accuracies and rankings in architecture isomorphic groups. “GT” (ground-truth) stands for the deviation of training the same architecture with different seeds

We compare the results of sampling with/without isomorphic architectures during training. The results are shown in Tab. 3. From the results, if de-isomorphism sampling is not used in the training process, the criterion is worse (2.515% V.S. 0.015% of de-isomorphism sampling). In this case, we find that the top-ranked cell architectures are simple architectures (e.g., a single convolution). That is to say, weight sharing training without de-isomorphism training might over-estimate simple architectures. We suppose that this might be due to that the equivalent sampling probability is larger for these architectures with many isomorphic counterparts. We also compare de-isomorphism sampling in training with post de-isomorphism, in which the performances of architectures inside each isomorphic group are averaged during testing, while no changes are incorporated in the training process. And, we find that post de-isomorphism results are almost as good as de-isomorphism sampling.

Epochs criterion 200 400 600 800 1000
No De-isomorphism
Post De-isomorphism
Table 3: Comparison of (no) de-isomorphism sampling in supernet training
Equivalent 1000 epochs
1 3 5 Fair-NAS [chu2019fairnas]
BR@5% 0.093% 0.139% 0.495% 0.139%
P@5% 24.77% 9.60% 20.74% 11.76%
0.7226 0.7128 0.6714 0.7137
1000 epochs 1 3 5 Fair-NAS [chu2019fairnas]
BR@5% 0.093% 0.015% 0.124% 0.031%
P@5% 24.77% 14.24% 17.03% 15.17%
0.7226 0.7025 0.7018 0.6965
Table 4: Comparison of using different numbers of architecture Monte-Carlo samples in every supernet training step. Upper: The training epochs of models with MC samples and Fair-NAS are . Lower: The training epochs of models with MC samples and Fair-NAS are all . All these results are tested with post de-isomorphism.

Tab. 4 shows the comparison of using different numbers of architecture Monte-Carlo samples in supernet training. We can see that the influence of the architecture MC sample number is not that significant, and MC sample=1 is a good choice. We also adapt Fair-NAS [chu2019fairnas] sampling strategy to the NAS-Bench-201 search space (a special case of MC sample 5), and find it does not bring improvements.

4.4 Proxy Model

Due to memory and time constraints, it is common to use a shallower or thinner proxy model in the search process. The common practice is to search using small proxy models with fewer channels and layers, and then “model augment” the discovered architecture to large neural networks. From the experimental results shown in Fig. 3(a)(b), we can find that channel proxy have little influence while layer proxy significantly reduces the reliability of search results. Thus, for cell-based search spaces, proxy-less search w.r.t the layer number is worth studying.

4.5 Over- and Under-estimation of Architectures

For one-shot evaluators, we expect that the training process is fair and balance for all architectures. However, sub-architectures have different amounts of calculation, and they might converge at a different speed. To understand which architectures are under- or over-estimated by the one-shot evaluators, we inspect the Ranking Diff of the ground truth performance and the one-shot evaluation of an architecture : . We divide the architectures into ten groups according to the amount of calculation (FLOPs), and show Kendall’s Tau and Average Rank Diff of each group in Fig. 3(c).

Note that a positive Ranking Diff indicates that this architecture is over-estimated, otherwise it is underestimated. The x-axis is organized that the architecture group with the least amount of calculation is at the leftmost. The architectures before include only one or two conv1x1 layers or one conv3x3 layer, so this part is of little significance. For architectures larger than , the average Rank Diff shows a decreasing trend, which means that the larger the model, the easier it is to be underestimated. Also, the decreasing intra-group Kendall’s Tau indicates that it is harder for the one-shot evaluator to compare larger models (which usually have better performances) than comparing smaller models.

5 Predictor-based Evaluators

In this section, we employ the same criteria (i.e., Kendall’s Tau, Precision@K, BestRanking@K) to evaluate the architecture predictors.

5.1 Experimental Setup

We experiment with 4 different architecture predictors: MLP, LSTM, GATES [ning2020generic]

, and a random forest regressor (RF). For MLP, LSTM, and RF, we serialize each architecture matrix using the 6 elements of its lower triangular portion. We follow

[ning2020generic] to construct MLP and LSTM: The MLP encoder contains 4 fully-connected layers with 512, 2048, 2048, 512 nodes, and the output of the last layer is used as the architecture’s embedding code. The LSTM encoder contains 1 layer with the embedding dimension and hidden dimension both set to 100, and the final hidden stage is used as the embedding of a cell architecture. The RF predictor applies a random forest regressor on the 6-dim sequence. The construction of GATES encoder is exactly the same as in [ning2020generic].

For optimizing MLP, LSTM and GATES, an ADAM optimizer with a learning rate of 1e-3 is used, the batch size is set to 512, and the training lasts for 200 epochs. Following [ning2020generic], a hinge pairwise ranking loss with margin 0.1 is used for training these predictors. For RF, we use a random forest with 100 CARTs to predict architecture performances.

5.2 Evaluation Results

We train these predictors on training sets of different sizes: 39 (0.25%), 78 (0.5%), 390 (2.5%), 781 (5%). Specifically, for each training set size, we randomly sample 3 different training sets, and train each predictor on each training set with 3 different random seeds (20, 2020, 202020). After training, we evaluate each model on the whole NAS-Bench-201 search space using Kendall’s Tau, Precision@K and BestRanking@K.333Different from [ning2020generic], the evaluation is carried out on all the architectures, instead of a separate validation split. And the training & evaluation is carried out on 3 random sampled dataset and 3 training seeds. As shown in Fig. 4, GATES outperforms other predictors in all settings.

As can be seen, training with different seeds on different training sets leads to similar results. In contrast, we found that training predictors with regression loss is not stable and sensitive to the choice of the training set. For example, the Kendall’s Taus of 3 GATES models trained on different training sets of size 78 are 0.7127, 0.7213, and 0.2067, respectively, while with ranking loss, the results are 0.7852, 0.7694, and 0.7456, respectively.

(a) Kendall’s Tau
(b) P@5%
(c) BR@1%
Figure 4: Comparison of predictor performances across different train set size

5.3 Over- and Under-estimation of Architectures

Fig. 5(d)(e)(f) illustrates the relationship between the FLOPs of architectures and how it is likely to be over-estimated. It seems that MLP and RF are more likely to overestimate the smaller architectures and underestimate the larger ones, while LSTM and GATES show no obvious preference on the architectures’ FLOPs. Fig. 5(a)(b)(c) shows that GATES can give more accurate rankings on smaller architectures than larger architectures, which indicates that GATES might still have trouble in comparing larger architectures that usually have good performances.

Figure 5: (a)(b)(c) Kendall-tau in different FLOPs groups, the training set size is 39, 78 and 390, respectively. (d)(e)(f) Average rank difference in different FLOPs groups, the training set size is 39, 78 and 390, respectively.

5.4 One-shot Oracle Evaluator

Figure 6: The overview of predictor-based (i.e., surrogate model-based) neural architecture search (NAS). The underlined descriptions between the parenthesis denote different methods

In this section, we try to answer a question: “Could one use a one-shot evaluator as the oracle evaluator to guide the training of the predictor?” nao2018 made an attempt to use the noisy signals provided by weight sharing evaluator to train the predictor together with the encoder-decoder pair. Although this will significantly accelerate the NAS process, it is found to cause the NAS algorithm to fail to discover good architectures. A question is that, if a fast one-shot evaluator is available and we can conduct sampling using the one-shot scores, do we need to train another predictor using the noisy one-shot scores? In this paper, we try to answer the question of whether a properly constructed and trained predictor can recover from the noisy training signals provided by the one-shot evaluator. Since GATES achieves consistently better results than other predictors, it is used in the experiments in this section.

Specifically, we want to answer two questions:

  1. Can sampling only a subset during supernet training help achieve better Kendall’s Tau on these architectures?

  2. Can predictor training help recover from the noisy training signals provided by the one-shot evaluator?

We random sample 78 architectures from the search space. Two differently trained weight-sharing evaluators are used to provide the one-shot instruction signal of these 78 architectures: 1) Uniformly sampling from the whole search space, 2) Uniformly sampling from the 78 architectures. We find that strategy 1 (sampling from the whole search space) can get a higher evaluation Kendall’s Tau, no matter whether the evaluation is on the 78 architectures (0.657 V.S. 0.628) or the whole search space (0.701 V.S. 0.670). Thus the answer to Question 1 is “No”.

Then, to answer the second question, we utilize the one-shot instruction signal provided by the supernet trained with 15625 architectures to train the predictor444The average of scores provided by 3 supernets trained with different seed is used.. The Kendall’s Tau between the architecture scores given by the resulting predictor and the ground-truth performances is 0.718 on all the 15625 architectures, which is slightly worse than the one-shot instruction signals (0.719). More importantly, BR@1% degrades from 2.5% to 12.1%. Thus, further training a predictor using one-shot signals is not beneficial, and our temporary answer to Question 2 is “No”. Perhaps, incorporating more prior knowledge of the search space and regularizations might increase the denoising effect of the predictor training, which might be worth future research.

6 Conclusion

In this paper, we conduct assessment of both the weight-sharing evaluators and architecture predictors on the NAS-Bench-201 search space, with a set of carefully designed criterion. Hopefully, knowledges revealed by this paper can guide future application of one-shot NAS and predictor-based NAS, and motivate further research.