1 Introduction
Neural Architecture Search (NAS) has received extensive attention due to its capability to discover neural network architectures in an automated manner. Studies have shown that the automatically discovered architectures by NAS can outperform the handcrafted architectures for various applications, such as classification
[nayman2019xnas, zoph2016neural], detection [ghiasi2019fpn, chen2019detnas], video understanding [ryoo2019assemblenet], text modeling [zoph2016neural], etc. The vanilla NAS algorithm [zoph2016neural] suffers from the extremely heavy computational burden, since the evaluation of neural architectures is slow. Thus, how to evaluate a neural architecture in a fast and accurate way is vital for addressing the computational challenge of NAS.A neural architecture evaluator takes an architecture as input, and output the evaluated score that indicates the quality of the architecture. In both hyperparameter optimization and neural architecture search algorithms, the straightforward solution to evaluate a configuration or an architecture is to train a model from scratch to convergence and then test it on the validation dataset, which is extremely timeconsuming. Instead of exactly evaluating them on the target task, researchers usually construct a proxy model with fewer layers or fewer channels [enas, real2019regularized, wu2019fbnet], and train this model to solve a proxy task of smaller scales [cai2018efficient, elsken2018efficient, klein2017fast, wu2019fbnet]
, e.g., smaller dataset or subsets of dataset, training or finetuning for fewer epochs.
Traditional evaluators conduct separate training phases to approximately discover the weights that are suitable for each architecture. In contrast, oneshot evaluation amortized the training cost of different architectures through parameter sharing or a global hypernetwork, thus significantly reduce the architecture evaluation cost. enas construct an overparametrized super network such that all architectures in the search space are subarchitectures of the supernet. Throughout the search process, the shared parameters in the super network are updated on the training dataset split, and each architecture is evaluated by directly using the corresponding subset of the weights in the super network. Afterwards, the parameter sharing technique is widely used for architecture search in different search spaces [wu2019fbnet], or incorporated with different searching strategies [darts, nayman2019xnas, xie2018snas, yang2019cars]. Hypernetwork based evaluation is another type of oneshot evaluation strategy. brock2018smash, zhang2018graph utilized hypernetworks to generate proper weights for evaluating each architecture.
Whether or not the oneshot strategies can provide highlycorrelated evaluation results for different architectures is essential for the efficacy of the NAS process. bender2018understanding
conduct a first study on the correlation between the standalone architecture performances and the evaluated metrics with parameter sharing. A more recent study
[sciuto2019evaluating] finds out that parameter sharing evaluation cannot reflect the true performance ranking of architectures well enough. However, they only conduct the experiments in a toy search space with only 32 architectures in total.Besides oneshot evaluation strategies, predictorbased evaluation strategies [nao2018, liu2018progressive, deng2017peephole, sun2019e2pp, alphax, xu2019renas, ning2020generic] use a performance predictor that takes the architecture description as inputs and outputs a predicted performance score. The performance predictor should be trained using “groundtruth” architecture performances. Thus, in a predictorbased NAS framework, there should be another “oracle” evaluator that provides the instruction signals for training the predictor. nao2018 makes an attempt to use the noisy signals provided by parameter sharing based oneshot evaluator to train the predictor. They find that compared with using the more expensive traditional evaluation, using oneshot evaluation to provide the instruction signal can only discover much worse architecture. In this paper, we try to answer the question of whether it is reasonable to use a oneshot evaluator as the oracle evaluator to train a predictor.
Current fast evaluation strategies of neural architectures are summarized in Fig. 1, including shared weights, hypernetworks, and predictorbased ones. And this paper aims at revealing the status and shortcomings of current architecture evaluation strategies systematically. Specifically, we conduct a controlled and comprehensive evaluation of different neural architecture evaluation strategies with various criteria, and also analyze the evaluation results to identify the architectures being under or overestimated.
2 Related Work
2.1 Training Acceleration with Proxy Tasks and Models
Since the exact evaluation of a hyperparameter setting or a neural architecture is timeconsuming, it is a common practice to evaluate smaller proxy models
[enas, real2019regularized, wu2019fbnet] on smaller proxy tasks [cai2018efficient, elsken2018efficient, klein2017fast, wu2019fbnet], e.g., smaller dataset or subsets of the dataset, training or finetuning for fewer epochs. When the proxy setting is too aggressive, the resulting proxy metrics can be weakly correlated with the true architecture performances. One method of using lowfidelity proxy metrics is to only partially train a hyperparameter setting or a neural architecture by early stopping the training process. On the basis of that, domhan2015speeding, baker2017accelerating proposed to conduct training curve extrapolation to predict the final performance. Given the overall search budget, there are also researches [li2017hyperband, falkner2018bohb] that explore the strategies for trading off between the number of explored configurations and the training resources for a single configuration.To shorten the finetuning time with better weight initialization, network morphism based methods [cai2018efficient, elsken2018efficient] conduct search based on architecture mutation decisions, and initialize the weights of each candidate architecture by inheriting and transforming weights from its parent model.
In this paper, we investigate the correlation gaps brought by several types of proxy tasks or models, including channel number reduction, layer number reduction, and training epoch reduction.
2.2 Oneshot Evaluators
Oneshot evaluation mainly consists of two types of strategies: 1) weight sharing [enas, wu2019fbnet, darts, nayman2019xnas, xie2018snas, yang2019cars], 2) hypernetworks [brock2018smash, zhang2018graph]. These two strategies both amortize the training cost of different architectures via the sharing of the network or hypernetwork parameters.
The ranking correlation gaps of existing shared weights evaluators are brought by two factors: 1) proxy model and task: due to the memory constraint, a proxy super network (supernet) [darts, enas] with fewer channels or layers is usually used; 2) weight sharing. To alleviate the first factor, there are some studies [cai2018proxylessnas, chen2019progressive] that aim at making oneshot evaluation more memory efficient, thus the oneshot search could be carried without using a proxy supernet. For the second factor, there are a few studies that carried out correlation evaluation for oneshot evaluators. zhang2018graph (GHN) conducted a correlation comparison between the GHN hypernetwork evaluator, shared weights evaluator, and several small proxy tasks. However, the correlation is evaluated using 100 architectures randomly sampled from a large search space, which is not a convincing and consistent benchmark metric. luo2019understanding
did a preliminary investigation into why weight sharing evaluation fails to provide correlated evaluations, and proposed to increase the sample probabilities of the large models. Their evaluation is also conducted on dozens of architectures sampled from the search space.
zela2020bench propose a benchmark framework to assess weight sharing NAS on NasBench101, and compare the correlation of different search strategies. sciuto2019evaluating conduct weight sharing NAS in a toy RNN search space with only 32 architectures in total, and discover that the weight sharing rankings do not correlate with the true rankings of architectures. To improve the correlation of oneshot evaluation, chu2019fairnas proposed a sampling strategy in a layerwise search space.In summary, a correlation evaluation of various oneshot evaluation strategies on all architectures in more meaningful search spaces is still missing in the literature. In this paper, we analyze the ranking correlation gaps that are brought by the model proxy (difference in the number of channels and layers) and the oneshot evaluation, respectively. Moreover, after investigating the factors that influence the evaluation correlation, we give some suggestions on improving the evaluation quality.
2.3 Predictorbased Evaluators
An architecture performance predictor takes the architecture descriptions as inputs, and outputs the predicted performance scores without training the architectures. Actually, in the overall NAS framework, the predictorbased evaluator plays a different role from the traditional or oneshot evaluators. Fig. 6 shows the general workflow of predictorbased NAS, and we can see that the predictor should be trained using “groundtruth” architecture performances, which are provided by another “oracle” evaluator. Usually, expensive traditional evaluators that can provide relatively accurate architecture performances are chosen as the oracle evaluators [kandasamy2018bayesian, liu2018progressive, nao2018]. Utilizing a good predictor, we can choose architectures that are more worth evaluating to be evaluated by the expensive oracle evaluator. Thus, the fitness of the performance predictor is vital to the efficacy of the NAS algorithm, as fewer architectures need to be trained when incorporated with a good predictor.
An architecture performance predictor takes the architecture descriptions as inputs, and outputs the predicted performance scores without training the architectures. There are two factors that are crucial to the fitness of the predictors: 1) embedding space; 2) training technique. On one hand, to embed neural architectures into a continuous space and get a meaningful embedding space, there are studies that propose different architecture encoders, e.g., sequencebased [nao2018, liu2018progressive, deng2017peephole, sun2019e2pp, alphax], graphbased [shi2019multi, ning2020generic]. As for nonparametric predictors, kandasamy2018bayesian design a kernel function in the architecture space and exploits gaussian process to get the posterior of the architecture performances. shi2019multi combined a graphbased encoder and nonparametric gaussian process to construct the performance predictor. On the other hand, from the aspect of training techniques, nao2018 employed a encoderdecoder structure and used a reconstruction loss as an auxiliary loss term. xu2019renas, ning2020generic employed learningtorank techniques to train the predictors.
3 Evaluation Criteria
In this section, we introduce the evaluation criteria used in this paper. We denote the search space size as , the true performances and approximated evaluated scores of architectures as and , respectively. And we denote the ranking of the architecture performance and the evaluated performance as and ( indicates that is the best architecture in the search space). Firstly, the correlation criteria adopted in our paper are

Linear correlation: The pearson correlation coefficient .

Kendall’s Tau ranking correlation: The relative difference of concordant pairs and discordant pairs .

Spearman’s ranking correlation: The pearson correlation coefficient between the rank variables .
Besides these correlation criteria, we also adopt several criteria that emphasize more on the relative order of architectures with good performances. Denoting as the set of architectures whose evaluated scores is among the top portion of the search space, these two criteira are

Precision@K (P@K) : The proportion of true topK proportion architectures in the topK architectures according to the scores.

BestRanking@K (BR@K) : The best normalized ranking among the top K proportion of architectures according to the scores.
The two criteria are similar to those used in ning2020generic, except that rankings and architecture numbers are all normalized with respect to the search space size .
The above criteria are used to compare the fitness of various architecture evaluators with different configurations. Not only do we want to choose appropriate configurations of architecture evaluators, we’d also like to interpret their evaluation results. To identify which architectures are under or overestimated by various evaluators, and analyze the reasons accordingly, we investigate the relationship of the truepredicted ranking differences and the architecture properties such as the FLOPs: .
4 Oneshot Evaluators
In this section, we introduce the ranking correlation gaps of oneshot evaluators and evaluate the influence of several sampling strategies and training techniques.
4.1 Experimental Setup
Oneshot evaluators mainly include weight sharing evaluators and hypernetwork evaluators. Since hypernetwork solutions are not generic currently, we concentrate on the evaluation of weight sharing evaluators (i.e., evaluate each architecture using a weightsharing supernet) in this paper^{1}^{1}1From now on, “weightsharing supernet” and “oneshot evaluator” are used interchangeably.. During the training process of the supernet, candidate architectures are randomly sampled, and their corresponding weights are updated in every iteration.
We conduct our experiments on CIFAR10 using a recent NAS benchmarking search space, NASBench201
[Dong2020NASBench201]. NASBench201 is a NAS benchmark that provides the performances of all the 15625 architectures in a cellbased search space. However, there are architectures with different matrix representations that are actually isomorphic in this search space. As also reported by their original paper, there are 6466 unique topology structures in the deisomorphic search space.The hyperparameters used to train all the weightsharing supernets are summarized in Tab. 1. We train weight sharing evaluators via momentum SGD with momentum 0.9 and weight decay 0.0005. The batch size is set to 512. The learning rate is set to 0.05 initially and decayed by 0.5 each time the supernet accuracy stops to increase for 30 epochs. During training, the dropout rate is set to 0.1, and the gradient norm is clipped to be less than 5.0.
4.2 Trend of Different Indicators
We inspect how BR@K, P@K, and the correlation criteria converge during the training process. We train a weight sharing model with 17 layers and 16 initial channels on the deisomorphic NASBench201 search space. Shown in Fig. 2(a), the speed of convergence is highly different. N@K converges in a very short time. P@K converges in around 250 epochs and then even gradually decreases. Meanwhile, linear correlation, Kendall’s Tau and Spearman correlation are still growing till 500 epochs, while the weight sharing accuracy grows during the whole 1000 epochs. This indicates that the models with different rankings change at different speeds as the training progresses, and the topranked models stand out faster. Another evidence is shown in Fig. 2(b) that P@5% converges much faster than P@50%. Another unexpected fact to note in Fig. 2(b) (also see Tab. 3) is that P@5% usually shows a decreasing trends from 200 epochs on. This is due to that, while the architectures with best performances stand out very fast in oneshot training, their oneshot performances will be caught up with by other architectures as the training goes on.
4.3 Sampling Strategy
The NASBench201 search space includes many isomorphic architectures. We expect that oneshot evaluators could handle isomorphic architectures, which means that we expect the accuracy of isomorphic architectures should be as close as possible. We calculate the average variances of test accuracy and ranking in isomorphic groups during the training process, as shown in Tab.
2. As the training progresses, the variance within the group gradually shrinks, which indicates the more sufficient training makes oneshot evaluator handle isomorphic architectures better.epochs  GT  200  400  600  800  1000 
Accuracy std (%)  
Ranking std 
Average standard deviation of accuracies and rankings in architecture isomorphic groups. “GT” (groundtruth) stands for the deviation of training the same architecture with different seeds
We compare the results of sampling with/without isomorphic architectures during training. The results are shown in Tab. 3. From the results, if deisomorphism sampling is not used in the training process, the criterion is worse (2.515% V.S. 0.015% of deisomorphism sampling). In this case, we find that the topranked cell architectures are simple architectures (e.g., a single convolution). That is to say, weight sharing training without deisomorphism training might overestimate simple architectures. We suppose that this might be due to that the equivalent sampling probability is larger for these architectures with many isomorphic counterparts. We also compare deisomorphism sampling in training with post deisomorphism, in which the performances of architectures inside each isomorphic group are averaged during testing, while no changes are incorporated in the training process. And, we find that post deisomorphism results are almost as good as deisomorphism sampling.
Epochs  criterion  200  400  600  800  1000  
No Deisomorphism 







Deisomorphism 







Post Deisomorphism 







1  3  5  FairNAS [chu2019fairnas]  
BR@5%  0.093%  0.139%  0.495%  0.139%  
P@5%  24.77%  9.60%  20.74%  11.76%  
0.7226  0.7128  0.6714  0.7137  
1000 epochs  1  3  5  FairNAS [chu2019fairnas]  
BR@5%  0.093%  0.015%  0.124%  0.031%  
P@5%  24.77%  14.24%  17.03%  15.17%  
0.7226  0.7025  0.7018  0.6965 
Tab. 4 shows the comparison of using different numbers of architecture MonteCarlo samples in supernet training. We can see that the influence of the architecture MC sample number is not that significant, and MC sample=1 is a good choice. We also adapt FairNAS [chu2019fairnas] sampling strategy to the NASBench201 search space (a special case of MC sample 5), and find it does not bring improvements.
4.4 Proxy Model
Due to memory and time constraints, it is common to use a shallower or thinner proxy model in the search process. The common practice is to search using small proxy models with fewer channels and layers, and then “model augment” the discovered architecture to large neural networks. From the experimental results shown in Fig. 3(a)(b), we can find that channel proxy have little influence while layer proxy significantly reduces the reliability of search results. Thus, for cellbased search spaces, proxyless search w.r.t the layer number is worth studying.
4.5 Over and Underestimation of Architectures
For oneshot evaluators, we expect that the training process is fair and balance for all architectures. However, subarchitectures have different amounts of calculation, and they might converge at a different speed. To understand which architectures are under or overestimated by the oneshot evaluators, we inspect the Ranking Diff of the ground truth performance and the oneshot evaluation of an architecture : . We divide the architectures into ten groups according to the amount of calculation (FLOPs), and show Kendall’s Tau and Average Rank Diff of each group in Fig. 3(c).
Note that a positive Ranking Diff indicates that this architecture is overestimated, otherwise it is underestimated. The xaxis is organized that the architecture group with the least amount of calculation is at the leftmost. The architectures before include only one or two conv1x1 layers or one conv3x3 layer, so this part is of little significance. For architectures larger than , the average Rank Diff shows a decreasing trend, which means that the larger the model, the easier it is to be underestimated. Also, the decreasing intragroup Kendall’s Tau indicates that it is harder for the oneshot evaluator to compare larger models (which usually have better performances) than comparing smaller models.
5 Predictorbased Evaluators
In this section, we employ the same criteria (i.e., Kendall’s Tau, Precision@K, BestRanking@K) to evaluate the architecture predictors.
5.1 Experimental Setup
We experiment with 4 different architecture predictors: MLP, LSTM, GATES [ning2020generic]
, and a random forest regressor (RF). For MLP, LSTM, and RF, we serialize each architecture matrix using the 6 elements of its lower triangular portion. We follow
[ning2020generic] to construct MLP and LSTM: The MLP encoder contains 4 fullyconnected layers with 512, 2048, 2048, 512 nodes, and the output of the last layer is used as the architecture’s embedding code. The LSTM encoder contains 1 layer with the embedding dimension and hidden dimension both set to 100, and the final hidden stage is used as the embedding of a cell architecture. The RF predictor applies a random forest regressor on the 6dim sequence. The construction of GATES encoder is exactly the same as in [ning2020generic].For optimizing MLP, LSTM and GATES, an ADAM optimizer with a learning rate of 1e3 is used, the batch size is set to 512, and the training lasts for 200 epochs. Following [ning2020generic], a hinge pairwise ranking loss with margin 0.1 is used for training these predictors. For RF, we use a random forest with 100 CARTs to predict architecture performances.
5.2 Evaluation Results
We train these predictors on training sets of different sizes: 39 (0.25%), 78 (0.5%), 390 (2.5%), 781 (5%). Specifically, for each training set size, we randomly sample 3 different training sets, and train each predictor on each training set with 3 different random seeds (20, 2020, 202020). After training, we evaluate each model on the whole NASBench201 search space using Kendall’s Tau, Precision@K and BestRanking@K.^{3}^{3}3Different from [ning2020generic], the evaluation is carried out on all the architectures, instead of a separate validation split. And the training & evaluation is carried out on 3 random sampled dataset and 3 training seeds. As shown in Fig. 4, GATES outperforms other predictors in all settings.
As can be seen, training with different seeds on different training sets leads to similar results. In contrast, we found that training predictors with regression loss is not stable and sensitive to the choice of the training set. For example, the Kendall’s Taus of 3 GATES models trained on different training sets of size 78 are 0.7127, 0.7213, and 0.2067, respectively, while with ranking loss, the results are 0.7852, 0.7694, and 0.7456, respectively.
5.3 Over and Underestimation of Architectures
Fig. 5(d)(e)(f) illustrates the relationship between the FLOPs of architectures and how it is likely to be overestimated. It seems that MLP and RF are more likely to overestimate the smaller architectures and underestimate the larger ones, while LSTM and GATES show no obvious preference on the architectures’ FLOPs. Fig. 5(a)(b)(c) shows that GATES can give more accurate rankings on smaller architectures than larger architectures, which indicates that GATES might still have trouble in comparing larger architectures that usually have good performances.
5.4 Oneshot Oracle Evaluator
In this section, we try to answer a question: “Could one use a oneshot evaluator as the oracle evaluator to guide the training of the predictor?” nao2018 made an attempt to use the noisy signals provided by weight sharing evaluator to train the predictor together with the encoderdecoder pair. Although this will significantly accelerate the NAS process, it is found to cause the NAS algorithm to fail to discover good architectures. A question is that, if a fast oneshot evaluator is available and we can conduct sampling using the oneshot scores, do we need to train another predictor using the noisy oneshot scores? In this paper, we try to answer the question of whether a properly constructed and trained predictor can recover from the noisy training signals provided by the oneshot evaluator. Since GATES achieves consistently better results than other predictors, it is used in the experiments in this section.
Specifically, we want to answer two questions:

Can sampling only a subset during supernet training help achieve better Kendall’s Tau on these architectures?

Can predictor training help recover from the noisy training signals provided by the oneshot evaluator?
We random sample 78 architectures from the search space. Two differently trained weightsharing evaluators are used to provide the oneshot instruction signal of these 78 architectures: 1) Uniformly sampling from the whole search space, 2) Uniformly sampling from the 78 architectures. We find that strategy 1 (sampling from the whole search space) can get a higher evaluation Kendall’s Tau, no matter whether the evaluation is on the 78 architectures (0.657 V.S. 0.628) or the whole search space (0.701 V.S. 0.670). Thus the answer to Question 1 is “No”.
Then, to answer the second question, we utilize the oneshot instruction signal provided by the supernet trained with 15625 architectures to train the predictor^{4}^{4}4The average of scores provided by 3 supernets trained with different seed is used.. The Kendall’s Tau between the architecture scores given by the resulting predictor and the groundtruth performances is 0.718 on all the 15625 architectures, which is slightly worse than the oneshot instruction signals (0.719). More importantly, BR@1% degrades from 2.5% to 12.1%. Thus, further training a predictor using oneshot signals is not beneficial, and our temporary answer to Question 2 is “No”. Perhaps, incorporating more prior knowledge of the search space and regularizations might increase the denoising effect of the predictor training, which might be worth future research.
6 Conclusion
In this paper, we conduct assessment of both the weightsharing evaluators and architecture predictors on the NASBench201 search space, with a set of carefully designed criterion. Hopefully, knowledges revealed by this paper can guide future application of oneshot NAS and predictorbased NAS, and motivate further research.
Comments
There are no comments yet.