Log In Sign Up

Revisiting the Train Loss: an Efficient Performance Estimator for Neural Architecture Search

by   Binxin Ru, et al.
University of Oxford
Imperial College London

Reliable yet efficient evaluation of generalisation performance of a proposed architecture is crucial to the success of neural architecture search (NAS). Traditional approaches face a variety of limitations: training each architecture to completion is prohibitively expensive, early stopping estimates may correlate poorly with fully trained performance, and model-based estimators require large training sets. Instead, motivated by recent results linking training speed and generalisation with stochastic gradient descent, we propose to estimate the final test performance based on the sum of training losses. Our estimator is inspired by the marginal likelihood, which is used for Bayesian model selection. Our model-free estimator is simple, efficient, and cheap to implement, and does not require hyperparameter-tuning or surrogate training before deployment. We demonstrate empirically that our estimator consistently outperforms other baselines and can achieve a rank correlation of 0.95 with final test accuracy on the NAS-Bench201 dataset within 50 epochs.


page 1

page 2

page 3

page 4


BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture Search

Neural Architecture Search (NAS) has seen an explosion of research in th...

Neural Architecture Search using Bayesian Optimisation with Weisfeiler-Lehman Kernel

Bayesian optimisation (BO) has been widely used for hyperparameter optim...

Accelerating Neural Architecture Search using Performance Prediction

Methods for neural network hyperparameter optimization and meta-modeling...

DARTS without a Validation Set: Optimizing the Marginal Likelihood

The success of neural architecture search (NAS) has historically been li...

Towards Automated Deep Learning: Efficient Joint Neural Architecture and Hyperparameter Search

While existing work on neural architecture search (NAS) tunes hyperparam...

DARTS for Inverse Problems: a Study on Hyperparameter Sensitivity

Differentiable architecture search (DARTS) is a widely researched tool f...

Bayesian Model Selection, the Marginal Likelihood, and Generalization

How do we compare between hypotheses that are entirely consistent with o...

Code Repositories

1 Introduction

Reliably estimating the generalisation performance of a proposed architecture is crucial to the success of Neural Architecture Search (NAS) but has always been a major bottleneck in NAS algorithms elsken2018neural . The traditional approach of training each architecture for a large number of epochs and evaluating it on validation data (full evaluation) provides a reliable performance measure, but requires prohibitively high computational resources on the order of thousands of GPU days ZophLe17_NAS ; Real2017_EvoNAS ; zoph2018learning ; real2019regularized ; elsken2018neural . This motivates the development of methods for speeding up performance estimation to make NAS practical for limited computing budgets. A popular simple approach is early-stopping which offers a low-fidelity approximation of generalisation performance by training for fewer epochs li2016hyperband ; falkner2018bohb ; Li2019_random . However, if we stop the training early, the relative performance ranking may not correlate well with the final performance ranking zela2018towards . Another line of work focuses on learning curve extrapolation domhan2015speeding ; klein2016learning ; baker2017accelerating , which trains a surrogate model to predict the final generalisation performance based on the initial learning curve and/or meta-features of the architecture. However, the training of the surrogate often requires hundreds of fully evaluated architectures to achieve satisfactory extrapolation performance and the hyper-parameters of the surrogate also need to be optimised. Alternatively, the idea of weight sharing is adopted in one-shot NAS methods to speed up evaluation Pham2018_ENAS ; Liu2019_DARTS ; Xie19_SNAS . Despite leading to significant cost-saving, weight sharing heavily underestimates the true performance of good architectures and is unreliable in predicting the relative ranking among architectures Yang2020NASEFH ; Yu2020Evaluating .

In view of the above limitations, we propose a simple model-free method which provides a reliable yet computationally cheap estimation of the generalisation performance ranking of architectures: the Sum over Training Losses (SoTL). Our method harnesses the training losses of the commonly-used SGD optimiser during training, and is motivated by recent empirical and theoretical results linking training speed and generalisation. We ground our method in the Bayesian update setting, where we show that the SoTL estimator computes a lower bound to the model evidence, a quantity with sound theoretical justification for model selection mackay1992bayesian . We show empirically that our estimator can outperform a number of strong baselines in predicting the relative performance ranking among architectures, while speeding up different NAS approaches significantly.

2 Method

We propose a simple metric that estimates the generalisation performance of a deep neural network model via the Sum of its Training Losses (SoTL). After training a deep neural network whose prediction is

for epochs111 can be far from the total training epochs used in complete training, we sum the training losses collected so far:


where is the training loss of a mini-batch at epoch and is the number of training steps within an epoch. If we uses the first few epochs as the burn-in phase for to converge to certain distribution and start the sum from epoch instead of , we obtain a variant SoTL-E. In the case where (i.e. ), our estimator corresponds to the sum over training losses within epoch . We discuss that SoTL has theoretical interpretation based on Bayesian marginal likelihood and training speed in Section 3, and empirically show that SoTL, despite its simple form, can reliably estimate the generalisation performance of neural architectures in Section 5.

If the sum over training losses is a useful indicator for the generalisation performance, one might expect the sum over validation losses to be a similarly effective performance estimator. The sum over validation losses (SoVL) lacks the link to the Bayesian model evidence, and so its theoretical motivation is different from our SoTL. Instead, the validation loss sum can be viewed as performing a bias-variance trade-off; the parameters at epoch

can be viewed as a potentially high-variance sample from a noisy SGD trajectory, and so summation reduces the resulting variance in the validation loss estimate at the expense of incorporating some bias due to the relative ranking of models’ test performance changing during training. We show in Section 5 that SoTL clearly outperforms SoVL in estimating the true test performance, and do not consider its possible theoretical motivation further.

3 Theoretical motivation

The SoTL metric is a direct measure of training speed and draws inspiration from two lines of work: the first is a Bayesian perspective that connects training speed with the marginal likelihood in the model selection setting, and the second is the link between training speed and generalisation hardt2016train . In this section, we will summarize recent results that demonstrate the connection between SoTL and generalisation, and further show that in Bayesian updating regimes, the SoTL metric corresponds to an estimate of a lower bound on the model’s marginal likelihood, under certain assumptions.

3.1 Training speed and the marginal likelihood

We motivate the SoTL estimator by a connection to the model evidence, also called the marginal likelihood, which is used in the Bayesian framework for model selection. The model evidence quantifies how likely a model is to have produced a data set , and so can be used to update a prior belief distribution over which model from a given set is most likely to have generated . Given a model with parameters , prior , and likelihood for a training data set , the (log) marginal likelihood is expressed as follows.


Interpreting the negative log posterior predictive probability

of each data point as a ‘loss’ function, the log evidence then corresponds to the area under a training loss curve, where each training step would be computed by sampling a data point

, taking the log expected likelihood under the current posterior as the current loss, and then updating the posterior by incorporating the new sampled data point. One can therefore interpret the marginal likelihood as a measure of training speed in a Bayesian updating procedure. In the setting where we cannot compute the posterior analytically and only samples

from the posterior over parameters are available, we obtain an unbiased estimator of a lower bound

on the marginal likelihood by Jensen’s inequality, which again corresponds to minimizing a sum over training losses.

A full analysis of the Bayesian setting is outside of the scope of this work. We refer the reader to anonymous2020 for more details of the properties of this estimator in Bayesian linear models. Although the NAS setting does not yield the same interpretation of SoTL as model evidence estimation, we argue that the SoTL metric is still plausibly useful for model selection. Just as the marginal likelihood measures how useful the updates performed from a subset of the data are for predicting later data points, the SoTL of a model trained with SGD will be lower for models whose mini-batch gradient descent updates improve the loss of later mini-batches seen during optimiseation.

3.2 Training speed and generalisation

Independent of the accuracy of SoTL in estimating the Bayesian model evidence, it is also possible to motivate our method by its relationship with training speed: models which achieve low training loss quickly will have low SoTL. There are both empirical and theoretical lines of work that illustrate a deep connection between training speed and generalisation. On the theoretical front, we find that models which train quickly can attain lower generalisation bounds. Training speed and generalisation can be related via stability-based generalisation bounds hardt2016train ; liu2017algorithmic , which characterize the dependence of the solution found by a learning algorithm on its training data. In networks of sufficient width, arora2019fine propose a neural-tangent-kernel-based data complexity measure which bounds both the convergence rate of SGD and the generalisation error of the model obtained by optimisation. A similar generalisation bound and complexity measure is obtained by cao2019 .

While theoretical work has largely focused on ranking bounds on the test error, current results do not provide guarantees on consistency between the ranking of different models’ test set performance and their generalisation bounds. The empirical work of jiang2020fantastic demonstrates that many complexity measures are uncorrelated or negatively correlated with the relative performance of models on their test data but notably, a particular measure of training speed – the number of steps required to reach cross-entropy loss of 0.1, was highly correlated with the test set performance ranking of different models. The connection between training speed and generalisation is also observed by zhang2016understanding , who find that models trained on true labels converge faster than models trained on random labels, and attain better generalisation performance.

4 Related work

Various approaches have been developed to speed up architecture performance estimation, thus improving the efficiency of NAS. Low-fidelity estimation methods accelerate NAS by using the validation accuracy obtained after training architectures for fewer epochs (namely early-stopping) li2016hyperband ; falkner2018bohb ; zoph2018learning ; zela2018towards , training a down-scaled model with fewer cells during the search phase zoph2018learning ; real2019regularized or training on a subset of the data klein2016fast . However, low-fidelity estimates underestimate the true performance of the architecture and can change the relative ranking among architectures elsken2018neural . This undesirable effect on relative ranking is more prominent when the cheap approximation set-up is too dissimilar to the full evaluation zela2018towards . As shown in our Fig 4 below, the validation accuracy at early epochs of training suffers low rank correlation with the final test performance.

Another way to cheaply estimate architecture performance is to train a regression model to extrapolate the learning curve from what is observed in the initial phase of training. Regression model choices that have been explored include Gaussian processes with a tailored kernel function domhan2015speeding , an ensemble of parametric functions domhan2015speeding , a Bayesian neural network klein2016learning and more recently a

-support vector machine regressor (

-SVR)baker2017accelerating which achieves state-of-the-art prediction performance. Although these model-based methods can often predict the performance ranking better than their model-free early-stopping counterparts, they require a relatively large amount of fully evaluated architecture data (e.g. 100 fully evaluated architectures in baker2017accelerating ) to train the regression surrogate properly and optimise the model hyperparameters in order to achieve good prediction performance. The high computational cost of collecting the training set makes such model-based methods less favourable for NAS unless the practitioner has already evaluated hundreds of architectures on the target task. Moreover, both low-fidelity estimates and learning curve extrapolation estimators are empirically developed and lack theoretical motivation.

Finally, one-shot NAS methods employ weight sharing to reduce computational costs Pham2018_ENAS ; Liu2019_DARTS ; Xie19_SNAS . Under the one-shot setting, all architectures are considered as subgraphs of a supergraph. Only the weights of the supergraph are trained while the architectures (subgraphs) inherit the corresponding weights from the supergraph. Weight sharing removes the need for retraining each architecture during the search and thus achieves a significant speed-up. However, the weight sharing ranking among architectures often correlates very poorly with the true performance ranking Yang2020NASEFH ; Yu2020Evaluating ; Zela2020NAS-Bench-1Shot1: , meaning architectures chosen by one-shot NAS are likely to be sub-optimal when evaluated independently Zela2020NAS-Bench-1Shot1: . Moreover, one-shot methods are often outperformed by sample-based NAS methods Dong2020nasbench201 ; Zela2020NAS-Bench-1Shot1: .

Apart from the above mentioned performance estimators used in NAS, many complexity measures have been proposed to analyse the generalisation performance of deep neural networks. jiang2020fantastic provides a rigorous empirical analysis of over 40 such measures. This investigation finds that sharpness-based measures mcallester1999pac ; keskar2016large ; neyshabur2017exploring ; dziugaite2017computing (including PAC-Bayesian bounds) provide good correlation with test set performance, but their estimation requires adding randomly generated perturbations to the network parameters and the magnitude of the perturbations needs to be carefully optimised with additional training, making them unsuitable performance estimators for NAS. Optimisation-based complexity measures also perform well in predicting generalisation. Specifically, the number of steps required to reach loss of 0.1, as mentioned in Section 3.2, is closely related to our approach as both quantities measure the training speed of architectures. To our knowledge though, this measure has never been used in the NAS context before.

5 Experiments

In this section we compare the following measures. Note denotes the intermediate training epoch, which is smaller than the final epoch number :

  • Sum of training losses over all preceding epochs (SoTL): our proposed performance estimator sums the training losses of an architecture from epoch to the current epoch ;

  • Sum of training losses over the most recent epochs (SoTL-E): the variant of our proposed estimator uses the sum of the training losses from epoch to ;

  • Sum of validation losses over all preceding epochs (SoVL): this estimator computes the sum of the validation losses of an neural architecture from epoch to the current epoch ;

  • Validation accuracy at an early epoch (Val Acc): this corresponds to early-stopping practice whereby the user assumes the validation accuracy of an architecture at early epoch is a good estimator of its final test performance at epoch .

  • Learning curve extrapolation (LcSVR): The state-of-the-art learning curve extrapolation method baker2017accelerating uses a trained -SVR to predict the final validation accuracy of an architecture. The inputs for the SVR regression model comprise architecture meta-features (e.g. number of parameters and depth of the architecture), training hyper-parameters (e.g. initial learning rate, mini-batch size and weight decay), learning curve features up to epoch (e.g. the validation accuracies up to epoch , the 1st-order and 2nd-order differences of validation curve up to epoch ). In our experiments, we train the SVR on data of randomly sampled architectures and following the practice in baker2017accelerating , we optimise the SVR hyperparameters via random search using 3-fold cross-validation.

The datasets we used to compare these performance estimators are:

  • NASBench-201 Dong2020nasbench201 : the dataset contains information of 15,625 different neural architectures, each of which is trained with SGD optimiser for 200 epochs (

    ) and evaluated on 3 different datasets: CIFAR10, CIFAR100, IMAGENET-16-120. The NASBench-201 datasets can be used to benchmark almost all up-to-date NAS search strategies.

  • RandWiredNN: we produced this dataset by generating 552 randomly wired neural architectures from the random graph generators proposed in xie2019exploring and evaluating the architecture performance on the FLOWERS102 dataset nilsback2008automated . We explored 69 sets of hyperparameter values for the random graph generators and for each set of hyperparameter values, we sampled 8 randomly wired neural networks from the generator. All the architectures are trained with SGD optimiser for 250 epochs (). More details are in Appendix A. This dataset allows us to evaluate the performance of our simple estimator on model selection for the random graph generator in Section 5.4.

In NAS, the relative performance ranking among different models matters more than the exact test performance of models. Thus, we evaluate different performance estimators by comparing their rank correlation with the model’s true/final test accuracy. We adopt Spearman’s rank correlation following ying2019bench ; Dong2020nasbench201 . We flip the sign of SoTL/SoTL-E/SoVL (which we want to minimise) to compare to the Spearman’s rank correlation of the other methods (which we want to maximise). All experiments were conducted on a 36-core 2.3GHz Intel Xeon processor with 512 GB RAM.

5.1 Example on Bayesian linear regression

We illustrate how the SoTL metric corresponds to a lower bound on the marginal likelihood that can be used for model selection in a simple Bayesian linear regression setting. We consider an idealised data set

with and , with of the form , and . We wish to compare two Bayesian linear regression models and , each of which uses one of two different feature embeddings: and , where is the identity and retains only the single dimension that is correlated with the target, removing the noisy components of the input. The model which uses will have less opportunity to overfit to its training data, and will therefore generalise better than the model which uses ; similarly, it will also have a higher marginal likelihood. We demonstrate empirically in Fig. 1 that the SoTL estimator computed on the iterative posterior updates of the Bayesian linear regression models also exhibits this relative ranking, and illustrate how the SoTL relates to the lower bound described in Section 3.

Figure 1: Example on a simple Bayesian linear regression problem. We see that the sum over training losses gives an estimator for the lower bound of model evidence, and that the SoTL measure is more effective than the final training loss at distinguishing the two models and .

5.2 Method study

Training loss vs Validation loss

We perform a simple sanity check against the validation loss on NASBench-201 datasets. Specifically, we compare our proposed estimators, SoTL and SoTL-E, against two equivalent variants of validation loss-based estimators: SoVL and Sum of validation losses over the most recent epoches (SoVL-E=10) 222This corresponds to a smoothed version of the validation losses as the epoch-wise validation loss and its rank correlation with final test accuracy are quite noisy.. For each image dataset, we randomly sample 5000 different neural network architectures from the search space and compute the rank correlation between the true test accuracies (at ) of these architectures and their corresponding SoTL/SoTL-E as well as SoVL/SoVL-E up to epoch . The results in Fig. 2 show that our proposed estimators SoTL and SoTL-E clearly outperform their validation counterparts.

Another intriguing observation is that the rank correlation performance of SoVL-E drops significantly in the later phase of the training (after around 100 epochs for CIFAR10 and 150 epochs for CIFAR100) and the final test loss, TestL (T=200), also correlates poorly with final test accuracy. This implies that the validation/test losses can become unreliable indicator for the validation/test accuracy on certain datasets; as training proceeds, the validation accuracy keeps improving but the validation losses could stagnate at a relatively high level or even start to rise mukhoti2020calibrating ; soudry2018implicit

. This is because while the neural network can make more correct classifications on validation points (which depend on the argmax of the logits) over the training epochs, it also gets more and more confident on the correctly classified training data and thus the weight norm and maximum of the logits keeps increasing. This can make the network overconfident on the misclassified

validation data and cause the corresponding validation loss to rise, thus offsetting or even outweighing the gain due to improved prediction performance soudry2018implicit . Training loss won’t suffer from this problem (Appendix B). While SoTL-E struggles to distinguish architectures once their training losses have converged to approximately zero, this contributes to a much smaller drop in estimation performance of SoTL-E compared to that of SoVL-E and only happens near the very late phase of training (after 150 epochs) which will hardly be reached if we want efficient NAS using as few training epochs as possible. Therefore, the possibility of network overconfidence under misclassification is another reason for our use of training losses instead of the validation losses.

(a) CIFAR10
(b) CIFAR100
(c) IMAGENET-16-120
Figure 2: Rank correlation (with final test accuracy) performance of the sum of training losses, SoTL (blue) and SoTL-E (red), and those of validation losses (purple), SoVL (solid) and SoVL-E (dash dot), as well as that of final test loss (black) for 5000 random architectures in NASBench-201 on three image datasets.

Effect of summation window

As shown in Fig. 2, summing the training losses over most recent epochs (SoTL-E) can achieve higher rank correlation with the true test accuracy than summing over all the previous epochs (SoTL), especially early on in training. We grid-search different summation window sizes to investigate the effect of and observe consistently across all 3 image datasets that smaller window size gives higher rank correlation during the early training phase and all values converge to the same maximum rank correlation. Thus, we recommend as the default choice for our SoTL-E estimator and use this for the following sections. Note SoTL-E=1 corresponds to the sum of training losses over all the batches in one single epoch.

(a) CIFAR10
(b) CIFAR100
(c) IMAGENET-16-120
Figure 3: Rank correlation performance of the sum of training losses over most recent epochs (SoTL-E). Different values are investigated for 5000 random architectures in NASBench-201 on three image datasets. In all three cases, smaller consistently achieves better rank correlation performance in the early training phase with being the best choice.

5.3 Comparison against other baselines

We now compare our estimators SoTL and SoTL-E against other baselines: early-stopping validation accuracy (Val Acc), learning curve extrapolation methods (LcSVR) and sum of validation losses (SoVL). The results on both NASBench-201 and RandWiredNN datasets are shown in Fig. 4. Our proposed estimator SoTL-E, despite its simple form and cheap computation, outperforms all other methods under evaluation for for all architecture datasets. Although the validation accuracy(Val Acc) at can reach similar rank correlation, this is less interesting for applications like NAS where we want to speed up the evaluation as much as possible and thus use as fewer training epochs as possible. The learning curve extrapolation method, LcSVR, is competitive. However, the method requires hundreds of fully trained architecture data 333baker2017accelerating trains the SVR on 100 architectures. We trained it on architectures to make it perform better. to train the regression surrogate. Lots of computational resources are needed to obtain such training data.

(a) CIFAR10
(b) CIFAR100
(c) IMAGENET-16-120
(d) FLOWERS102
Figure 4: Rank correlation performance of various baselines: SoTL-E, SoTL, SoVL, Val Acc and LcSVR for 5000 random architectures in NASBench-201 on three image datasets (a) to (c) and for 552 randomly wired architectures on FLOWERS102 (d). In all cases, our SoTL-E achieves superior rank correlation with the true test performance in much fewer epochs than other baselines. We shade the region ; this shaded region is less interesting in NAS where we want to use as fewer training epochs as possible to maximise the speed-up gain compared to full evaluation .

5.4 Architecture Generator Selection

Figure 5:

Model selection among 69 random graph generator hyperparamters on RandWiredNN dataset. We use each hyperparameter value to generate 8 architectures and evaluate their true test accuracies after complete training. The mean and standard error of the test performance across 8 architectures for each hyperparamter value are presented as Test Acc (yellow) and treated as ground truth (Right y-axis). We then compute our SoTL-E=1 estimator for all the architectures by using their first

epochs of training losses. The mean and standard error of SoTL-E scores for are presented in different colours (Left y-axis). The rank correlation between the mean Test Acc and that of SoTL-E for various is shown in the corresponding legends. With only 10 epochs of training, our SoTL-E estimator can already capture the trend of the true test performance of different hyperparameters relatively well (Rank correlation) and can successfully identify 24-th hyperparamter setting as the optimal choice.

For the RandWiredNN dataset, we use 69 different hyperparameter values for the random graph generator which generates the randomly wired neural architecture. Here we would like to investigate whether our estimator can be used in place of the true test accuracy to select among different hyperparameter values. For each graph generator hyperparameter value, we sample 8 neural architectures with different wiring. The mean and standard error of both the true test accuracies and SoTL-E scores over the 8 samples are presented in Fig. 5. Our estimator can well predict the relative performance ranking among different hyperparameters (Rank correlation) based on as few as 10 epochs of training. The rank correlation between our estimator and the final test accuracy improves as we use the training loss in later epochs.

5.5 Speed up NAS

(a) RE-CIFAR10
(b) RE-CIFAR100
(c) RE-IMAGENET-16-120
(e) TPE-CIFAR100
(f) TPE-IMAGENET-16-120
Figure 6: NAS performance of Regularised Evolution (RE) (Top row) and TPE (Bottom row) in combined with final validation accuracy (Val Acc (T=200)), early-stopping validation accuracy (Val Acc (T=50)) and our estimator SoTL-E on NASBench-201. SoTL-E leads to the fastest convergence to the top performing architectures in all cases.

Similar to early stopping, our method is model-free and can significantly speed up the architecture performance evaluation by using information from early training epochs. In this section, we incorporate our estimator, SoTL-E, at into several NAS search strategies: Regularised Evolution real2019regularized (top row in Fig. 6), TPE bergstra2011algorithms (bottom row in Fig. 6) and Random Search bergstra2012random (Appendix C) and performance architecture search on NASBench-201 datasets. We compare this against the other two benchmarks which use the final validation accuracy at , denoted as Val Acc (T=200) and the early-stop validation accuracy at , denoted as Val Acc (T=50), respectively to evaluate the architecture’s generalisation performance. All the NAS search strategies start their search from 10 random initial data and are repeated for 20 seeds. The mean and standard error results over the search time are shown in Fig. 6. By using our estimator, the NAS search strategies can find architectures with lower test error given the same time budget or identify the top performing architectures using much less runtime as compared to using final or early-stopping validation accuracy. Also the gain of using our estimator is more significant for NAS methods performing both exploitation and exploration (RE and TPE) than that doing pure exploration (Random Search in Appendix C).

6 Conclusion

We propose a simple yet reliable method for estimating the generalisation performance of neural architectures based on its early training losses. Our estimator enables significant speed-up for performance estimation in NAS while outperforming other efficient estimators in terms of rank correlation with the true test performance. More importantly, our estimator has theoretical interpretation based on training speed and Bayesian marginal likelihood, both of which have strong links with generalisation. We believe our estimator can be a very useful tool for achieving efficient NAS.

Broader Impact

Making NAS more environmentally friendly

Training a deep neural network can lead to a fair amount of carbon emissions strubell2019energy . Such environmental costs are significantly amplified if we need to perform NAS strubell2019energy where repeated training is resource-wasteful but necessary. Our work proposes a cheap yet reliable alternative for estimating the generalisation performance of a neural network based on its early training losses; this significantly reduces the training time required during NAS (e.g. from 200 epochs to 50 epochs) and thus decreases the corresponding environmental costs incurred. Note although developed for the NAS setting, our estimator are potentially applicable for hyperparameter tuning or model selection in general as demonstrated in Section 5.4

, both of which are frequently performed by almost all machine learning practitioners. While our estimator can hardly be on par with the fully trained test accuracy in assessing the generalisation performance of a model, if the practitioners could adopt our estimator in place of looking at the fully trained test accuracy as often as possible, the environmental cost-saving would be substantial.

Making NAS accessible to more users and accelerating the development of NAS search strategies

By speeding up the NAS performance evaluation, our work can reduce not only the computational resources required to run many current NAS search strategies but also the sunk costs incurred during the process of developing new search strategies. This increases the chance of the researchers or users, who have limited computing budgets, being able to use or study NAS, which may in turn stimulate the advancement of NAS research. On a broader scale, this also helps NAS better serve its original motivation which is to free human-labour from designing neural networks for new tasks and make good machine learning models easily accessible to general community.


Appendix A Datasets description

The datasets we experiment with are:

  • NASBench-201 Dong2020nasbench201 : the dataset contains information of 15,625 different neural architectures, each of which is trained with SGD optimiser and evaluated on 3 different datasets: CIFAR10, CIFA100, IMAGENET-16-120 for 3 random initialisation seeds. The training accuracy/loss, validation accuracy/loss after every training epoch as well as architecture meta-information such as number of parameters, and FLOPs are all accessible from the dataset. The search space of the NASBench-201 dataset is a 4-node cell and applicable to almost all up-to-date NAS algorithms. The dataset is available at

  • RandWiredNN: we produced this dataset by generating 552 randomly wired neural architectures from the random graph generators proposed in xie2019exploring and evaluating their performance on the image dataset FLOWERS102 nilsback2008automated . We explored 69 sets of hyperparameter values for the random graph generators and for each set of hyperparameter values, we sampled 8 randomly wired neural networks from the generator. A randomly wired neural network comprises 3 cells connected in sequence and each cell is a 32-node random graph. The wiring/connection within the graph is generated with one of the three classic random graph models in graph theory: Erdos-Renyi(ER), Barabasi-Albert(BA) and Watt-Strogatz(WS) models. Each random graph models have 1 or 2 hyperparameters which decides the generative distribution over edge/node connection in the graph. All the architectures are trained with SGD optimiser for 250 epochs and other training set-ups follow the Liu2019_DARTS . This dataset allows us to evaluate the performance of our simple estimator on hyperparameter/model selection for the random graph generator. We will release this dataset after paper publication.

Appendix B Training losses vs validation losses

b.1 Example showing training loss is better correlated with validation accuracy than validation loss

(a) Arch A: Train loss=0.05,
Val. loss=1.36, Val. acc = 0.70
(b) Arch B: Train loss=0.31,
Val. loss=1.30, Val. acc = 0.67
(c) Arch C: Train loss=0.69,
Val. loss=1.29, Val acc = 0.64
Figure 7: Training losses, validation losses and validation accuracies of three example architectures on CIFAR100. The average of the training losses, validation losses and validation accuracies over the final 10 epochs is presented in the subcaption of each architecture.

We sample three example architectures from the NASBench-201 dataset and plot their losses and validation accuracies on CIFAR100 over the training epochs . The relative ranking for the validation accuracy is: Arch A (0.70) Arch B (0.67) Arch C (0.64), which corresponds perfectly (negatively) with the relatively ranking for the training loss: Arch A (0.05) Arch B (0.31) Arch C (0.69). Namely, the best performing architecture also has the lowest final training epoch loss. However, the ranking among their validation losses is poorly/wrongly correlated with that of validation accuracy; the worst-performing architecture has the lowest final validation losses but the best-performing architecture has the highest validation losses. Moreover, in all three examples, especially the better-performing ones, the validation loss stagnates at a similar and relatively high value while the validation accuracy continues to rise. The training loss doesn’t have this problem and it decreases while the validation accuracy increases. This confirms the observation we made in the main paper that the validation loss will become an unreliable predictor for the validation accuracy or the generalisation performance of the architecture as the training proceeds due to overconfident misclassification.

b.2 Comparison with sum over validation accuracy

Please see Fig. 8.

(a) CIFAR10
(b) CIFAR100
(c) IMAGENET-16-120
Figure 8: Rank correlation performance of the sum of training losses, SoTL-E (red), the sum of validation losses, SoVL-E (purple) and the sum of validation accuracy, SoVA-E (green) for 5000 random architectures in NASBench-201 on three image datasets. The results on CIFAR10 and CIFAR100 confirm the discussion in the paper and the subsection above that as the training proceeds, the validation loss can become poorly correlated with the final validation/test performance. Thus, another baseline to check against is the sum over validation accuracy, SoVA-E. SoVA-E has no specific meaning but can be viewed as a moving average/smoothed version of validation accuracy. It’s expected that SoVA-E should converge to a perfect rank correlation (=1) with the true test performance at the end of the training. However, the results in (a), (b) and (c) show that our proposed estimator SoTL-E can consistently outperform SoVA-E in the early phase of the training (roughly epochs). This reconfirms the superior performance of our estimator.

b.3 Overfitting on CIFAR10 and CIFAR100

(a) Training loss
(b) No. of arch with loss
Figure 9: Mean and standard error of training losses and validation losses on 5000 architectures on different NASBench-201image datasets. (a) shows the training curves and (b) shows the number of architectures whose training losses go below 0.1 as the training proceeds. Many architectures reach very small training loss in the later phase of the training on CIFAR10 and CIFAR100, thus may overfitting on these two datasets. But all the architectures suffer high training losses on IMAGENET-16-120, which is a much more challenging classification task, and none of them overfits.

In Figure 2 in Section 5.2 of the main paper, the rank correlation achieved by SoTL-E on CIFAR10 and CIFAR100 will drop slighted after around epochs but similar trend is not observed for IMAGENET-16-120. We hypothesise that this is due to the fact that many architectures converge to very small training losses on CIFAR10 and CIFAR100 in the later training phase, making it more difficult to distinguish these good architectures based on their later-epoch training losses. But this doesn’t happen on IMAGENET-16-120 because it’s a more challenging dataset. We test this by visualising the training loss curves of all 5000 architectures in Fig. 8(a) where the solid line and error bar correspond to the mean and standard error respectively. We also plot out the number of architectures with training losses below 0.1 444the threshold 0.1 is chosen following the threshold for optimisation-based measures in jiang2020fantastic in Fig. 8(b). It is evident that CIFAR10 and CIFAR100 both see an increasing number of overfitted architectures as the training proceeds whereas all architectures still have high training losses on IMAGENET-16-120 at end of the training with none of them overfits. Thus, our hypothesis is confirmed. In addition, similar observation is also shared in jiang2020fantastic where the authors find the number of optimisation iterations required to reach loss equals 0.1 correlates well with generalisation but the number of iterations required going from loss equals 0.1 to loss equals 0.01 doesn’t.

Appendix C Additional NAS experiments

In this work, we incorporate our estimator, SoTL-E, at into three NAS search strategies: Regularised Evolution real2019regularized , TPE bergstra2011algorithms and Random Search bergstra2012random and performance architecture search on NASBench-201 datasets. We modify the implementation available at for these three methods.

(a) RS-CIFAR10
(b) RS-CIFAR100
(c) RS-IMAGENET-16-120
Figure 10: NAS performance of Random Search (RS) in combined with final validation accuracy (Final Val Acc), early-stop validation accuracy (ES Val Acc) and our estimator SoTL-E on NASBench-201. SoTL-E enjoys competitive convergence as ES Val Acc and both are faster than using Final Val Acc.

Random Search bergstra2012random is a very simple yet competitive NAS search strategy Dong2020nasbench201 . We also combined our estimator, SoTL-E, at training epoch with Random Search to perform NAS. We compare it against the baselines using the final validation accuracy at , denoted as Val Acc (T=200), and the early-stop validation accuracy at , denoted as Val Acc (T=50). Other experimental set-ups follow Section 5.5. in the paper. The results over running hours on all three image tasks are shown in Fig. 10. The use of our estimator clearly leads to faster convergence as compared to the use of final validation i.e. Val Acc (T=200). Moreover, our estimator also outperforms the early-stop validation accuracy, Val Acc (T=50) on the two more challenging image tasks, CIFAR100 and IMAGENET-16-120, and is on par with it on CIFAR10. The performance gain of using our estimator or the early-stopped validation accuracy is relatively less significant in the case of random search compared to the cases of Regularised Evolution and TPE. For example, given a budget of 150 hours on CIFAR100, Regularised Evolution and TPE when combined with our estimator can find an architecture with a test error around or below 0.26 but Random Search only finds architecture with test error of around 0.27. This is due to the fact that Random Search is purely explorative while Regularised Evolution and TPE both trade off exploration and exploitation during their search; our estimator by efficiently estimating the final generalisation performance of the architectures will enable better exploitation. Therefore, we recommend the users to deploy our proposed estimator onto search strategies which involve some degree of exploitation to maximise the potential gain.