Towards NNGP-guided Neural Architecture Search

11/11/2020 ∙ by Daniel S. Park, et al. ∙ Google 6

The predictions of wide Bayesian neural networks are described by a Gaussian process, known as the Neural Network Gaussian Process (NNGP). Analytic forms for NNGP kernels are known for many models, but computing the exact kernel for convolutional architectures is prohibitively expensive. One can obtain effective approximations of these kernels through Monte-Carlo estimation using finite networks at initialization. Monte-Carlo NNGP inference is orders-of-magnitude cheaper in FLOPs compared to gradient descent training when the dataset size is small. Since NNGP inference provides a cheap measure of performance of a network architecture, we investigate its potential as a signal for neural architecture search (NAS). We compute the NNGP performance of approximately 423k networks in the NAS-bench 101 dataset on CIFAR-10 and compare its utility against conventional performance measures obtained by shortened gradient-based training. We carry out a similar analysis on 10k randomly sampled networks in the mobile neural architecture search (MNAS) space for ImageNet. We discover comparative advantages of NNGP-based metrics, and discuss potential applications. In particular, we propose that NNGP performance is an inexpensive signal independent of metrics obtained from training that can either be used for reducing big search spaces, or improving training-based performance measures.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The behavior of deep neural networks often becomes analytically tractable when the network width is very large (Neal, 1994; Williams, 1997; Hazan and Jaakkola, 2015; Schoenholz et al., 2017; Lee et al., 2018; Matthews et al., 2018a, b; Borovykh, 2018; Garriga-Alonso et al., 2019; Novak et al., 2019a; Yang and Schoenholz, 2017; Yang et al., 2019; Yang, 2019a, b; Pretorius et al., 2019; Novak et al., 2020; Hron et al., 2020; Jacot et al., 2018; Li and Liang, 2018; Allen-Zhu et al., 2018; Du et al., 2019a, 2018; Zou et al., 2019a, b; Chizat et al., 2019; Lee et al., 2019; Arora et al., 2019a; Du et al., 2019b; Sohl-Dickstein et al., 2020; Huang et al., 2020; Pennington et al., 2017; Xiao et al., 2018, 2019; Hu et al., 2020; Li et al., 2019; Arora et al., 2020; Shankar et al., 2020a; Cho and Saul, 2009; Daniely et al., 2016; Poole et al., 2016; Chen et al., 2018; Li and Nguyen, 2019; Daniely, 2017; Pretorius et al., 2018; Hayou et al., 2018; Karakida et al., 2018; Blumenfeld et al., 2019; Hayou et al., 2019)

. One such example is that Bayesian inference in the parameter space of a deep neural network becomes equivalent to a Gaussian process prediction in function space

(Neal, 1994; Lee et al., 2018; Matthews et al., 2018a). An intriguing consequence of this correspondence is that once the neural network Gaussian process (NNGP) kernel is computed, exact Bayesian inference is possible in wide networks. This further allows the computation of properties such as the expected validation accuracy of a wide Bayesian neural network with a given architecture and initialization scale.

We expect performance metrics of the NNGP associated with a given network to correlate well with the network’s actual performance after training, since we expect the predictions of Bayesian and gradient descent trained networks to be correlated. An important topic in neural architecture search (NAS) Zoph and Le (2016); Baker et al. (2016) is the discovery of computationally cheap methods to predict the fully-trained performance of a given network. This suggests NNGP performance should provide a useful signal for NAS. This use case has been previously suggested (Novak et al., 2019b; Arora et al., 2019b; Shankar et al., 2020b), but never explored.

There have been extensive studies on computing the exact kernel of NNGPs, but the actual computation can be very expensive Novak et al. (2019b); Arora et al. (2019b); Shankar et al. (2020b), requiring hundreds of accelerator hours. Furthermore, networks in neural network search spaces typically use operations for which the NNGP correspondence has no known closed form. However, Monte Carlo estimates of NNGP kernels are often far cheaper, and can be computed for any architecture by repeated random initialization of the network (Lee et al., 2018).

We have released a Colab notebook111 demonstrating our algorithm to compute NNGP performance, as well as metrics used in this paper to measure the quality of NNGP performance as a predictor of the ground-truth network performance.

1.1 Summary of contributions

We examine the utility of NNGP validation accuracy for predicting final network performance, and compare it against that of shortened-training, which is a common method for approximating network performance Zoph and Le (2016). We do so in two different settings: on the NAS-Bench-101 dataset Ying et al. (2019) with 423k network architectures evaluated on CIFAR-10 Krizhevsky et al. ; and on 10k randomly sampled networks from the MNAS search space Tan et al. (2019) evaluated on ImageNet Russakovsky et al. (2015). In both cases we find that NNGP performance is indicative of final network performance, while being at least an order of magnitude cheaper to compute than shortened-training. We further find that for the large NAS-Bench-101 search space, thresholding by NNGP accuracy dramatically shrinks the search space which needs to be examined by more expensive methods. We also demonstrate that NNGP and shortened-training performance can potentially be combined to produce a metric with improved predictive quality on NAS-Bench-101.

1.2 Further related work

Proxy tasks, i.e., computationally manageable tasks for approximating the performance of a neural network, are commonly used in neural architecture search Zoph and Le (2016); Tan et al. (2019); Real et al. (2019); Ghiasi et al. (2019). Early stopping by leveraging features and training curves from finished trials has also been used Swersky et al. (2014); Domhan et al. (2015); Klein et al. (2017); Baker et al. (2017). A measure for predicting network performance at initialization has recently been proposed in Mellor et al. (2020). Architecture search can also be framed as finding a sub-network within a super-network that is trained once. In architecture search with weight sharing Pham et al. (2018); Liu et al. (2019); Cai et al. (2019), a controller samples sub-networks from the super-network and uses a per mini-batch reward as an approximation for the sub-network performance. In differentiable architecture search Liu et al. (2019) the sub-network selection is part of the gradient based training. Learning-based methods Wen et al. (2019) have also been recently applied to predict the performance of the network, where a neural network that takes the architecture as the input and produces the ground truth performance as the output is trained on a subset of the search space.

Our approach is orthogonal and could be combined with these prior works; the NNGP performance studied is obtained without any gradient based training and furthermore, without any pre-existing gradient based training data.

2 Background

2.1 NNGP Inference

Consider a deep neural network with parameters , whose architecture maps an input

into a feature vector

of dimension

, followed by a linear readout layer with weight variance

producing predicted labels . Consider data points . In the NNGP approximation, the distribution over output vectors for any label at initialization is jointly Gaussian,



is the sample-sample second moment of

averaged over random network initializations and units. This approximation has been shown to be exact in the wide limit for many architectures Neal (1994); Schoenholz et al. (2017); Lee et al. (2018); Matthews et al. (2018a, b); Garriga-Alonso et al. (2019); Novak et al. (2019a); Yang (2019a, b); Hron et al. (2020)

. For the rest of the paper, we proceed with the assumption that this is an adequate approximation for the architectures being studied. While exact convergence of Bayesian inference on the neural network parameter space to the NNGP has not been proven for architectures with some of components we use in this work (e.g., max-pooling), it is expected that very general classes of architectures will exhibit GP-like behaviour as they become wide.

Since the distribution of label vectors are given by a Gaussian distribution, we can compute the exact conditional expectation values of labels. If the parameter initialization distribution is interpreted as a prior distribution, this corresponds to the predictions that would be made by a Bayesian neural network. In other words, if the inputs

produce the label vectors , and introducing a regularization constant , the expected label vector for an input is

Label of (3)

The central challenge in carrying out an NNGP calculation is computing the kernel . In this paper, we estimate the kernel by a Monte-Carlo method first studied in Novak et al. (2019a). That is, is computed by stochastically evaluating the expectation in equation (1) over repeated random initializations of the network. We denote the number of random initializations the ensemble number.

We search over a range of normalization constants , as the result of NNGP inference can vary significantly with

. We normalize the search range of this constant with respect to the average eigenvalue

of the kernel matrix, i.e., . Since has been made dimensionless in this way, the same search range of can be used for all NNGP experiments. We take this range to be numpy.logspace(-7, 2, 20).

The full procedure for computing NNGP validation accuracy in shown in Algorithm 1. The target label vectors are derived from the target labels to be the one-hot vector shifted to have mean-zero.

Input: Training input, label vector pairs ; validation input, label pairs ; network function ; initialization distribution .
Result: The NNGP validation accuracy .
Initialize , , ;
for  do
       Initialize parameters ;
       Initialize batch-norm parameters of network ;
       for  do ;
       for  do ;
       for  do ;
       for  do  ;
end for
for  do
end for
Algorithm 1 NNGP Inference with ensemble number

Computational Cost

Let be the architecture () dependent number of flops for inference on a single sample, the dimension of the feature space of the penultimate layer of the network, the ensemble number, the number of labels, and and the size of the training and validation sets for the NNGP, respectively. The computational FLOPs required for computing the NNGP validation accuracy is


The details of this computation can be found in section B of the supplementary material (SM). This cost does not scale well with , which we cap at 8k samples when computing NNGP accuracy to keep the inference cost reasonable.

Let us denote the FLOPs required for training a model per step per sample by . We have computed and for all the networks in the NAS-Bench-101 dataset for multiple batch sizes and found that the relation holds robustly (see section B of the SM). Thus, the cost of training a model for epochs and carrying out inference for validation can be written as


We add the superscript “all" to emphasize that gradient-based training of the networks is always performed on the entire dataset, while NNGP inference is performed on sub-sampled datasets.

For various plots regarding NAS-Bench-101, we will plot the computational cost of obtaining NNGP or shortened training performance based on average FLOPs computed over all networks in the dataset. We note that GFLOPs while for all networks in NAS-Bench-101.

2.2 Metrics

We use the following metrics to evaluate the quality of NNGP and short-training proxy tasks:

Kendall Rank Correlation Coefficient (Kendall’s Tau): The Kendall rank correlation coefficient measures how well the prediction of two orderings agree. Let us assume two orderings of a set. For every pair of elements of the set, let us denote the number of concordant pairs (pairs that are ordered the same way in both orderings) , the number of discordant pairs (pairs that are ordered the opposite way in the two orderings) , and the number of ties in each of the orderings and . There are multiple versions of Kendall’s Tau which treat ties differently. We use the following definition:


We compute how well a validation accuracy ranks the networks by computing its Kendall rank correlation coefficient against the ground-truth accuracy of the networks.

Correlation Coefficient: We compute the Pearson correlation coefficient between the performance metrics and the ground-truth accuracy.

Prediction Quality for Exceedance of Threshold Performance (PQETP-): To judge the utility of a performance metric of a network, we can measure how well it predicts whether the true network performance is above some threshold

. We do so by computing the area under the receiver operating characteristics (AUROC). The ROC curve for a binary classifier with a continuous output is obtained by plotting the true positive rate against the false positive rate as the true/false boundary of the classifier is varied. In our case, the binary classifier is set to be the performance metric we would like to evaluate (e.g., validation accuracy after shortened-training, NNGP validation accuracy), and the binary class is whether the ground-truth validation accuracy of the network exceeds the threshold

. Thus a metric with a better PQETP for performance is better at determining whether the ground-truth performance of the network is above .

Discovered Performance: The “discovered performance" of a set of networks is obtained by first choosing the top- performers in the set according to the reference performance metric, and taking the best ground-truth performance among those of the selected networks, i.e., it is the performance of the top performer “discovered" using the metric. We choose to be 10 throughout the paper, and will be computing the average discovered performance of subsets of NAS-Bench-101 of fixed size.

3 Experiment Design

3.1 The NAS-Bench-101 Dataset

The NAS-Bench-101 dataset Ying et al. (2019) consists of 423k image classification networks (details of which are provided in section C.1 of the supplementary materials) evaluated on CIFAR-10, with the standard train/validation/test split of 40k/10k/10k samples. The dataset contains the validation and training accuracy of each network after training for 4, 12, 36 and 108 epochs for 3 different trials. We take the ground-truth performance of a network to be the average validation accuracy for the three 108-epoch training trials. Our goal is to evaluate how well NNGP performance predicts this ground truth accuracy.

We compute the NNGP validation accuracy with a range of ensemble numbers and fixed sub-sampled training sets and validation sets of different sizes with balanced labels for all networks. We take , and . We compare the utility of the NNGP validation accuracy obtained using different values of with those obtained by shortened training. To do so, we use the validation accuracy computed at the end of a single trial of 4, 12 and 36 epoch training. We can compute the cost for obtaining each measure using equations (4) and (5). The computational costs will be presented in peta-FLOPs (PFLOPs). All metrics introduced in section 2.2 are computed for the validation accuracies obtained from NNGP inference and shortened training.

3.2 The MNAS Search Space

The MNAS search space Tan et al. (2019) is intended for mobile neural networks on ImageNet Russakovsky et al. (2015) with the aim of balancing performance and computational cost. Throughout this paper, we refer to the provided 50k “validation set" of ImageNet as the “test set" and split a separate 50k subset from the 1.3M-sample training set, designating it as the validation set for evaluation. We select a randomly sampled set of 10k networks from the MNAS search space to study. The 5-epoch validation accuracy, and the MNAS reward function, which is a function of this accuracy and latency, are computed for all 10k networks. For more details, see section C.2 of the supplementary material.

We carry out NNGP inference on a random 8k-sample subset of the training set with balanced labels and the 50k validation set. We also construct training and validation subsets using 100 randomly subsampled labels with sizes (1k, 5k) and (8k, 5k). We use these three pairs of training/validation sets for NNGP inference with fixed ensemble number 4, and compare its utility as a performance measure against the MNAS reward obtained by 5-epoch training. We do so by selecting the top-10 networks according to the NNGP validation accuracies and shortened-training measures, and evaluating the ground-truth performance by training the networks for 400-epochs and evaluating the test set accuracy. The hyperparameters used for 5 and 400-epoch training are identical with those used in

Tan et al. (2019).

4 Results

4.1 CIFAR10 on NAS-Bench-101

Figure 1: NNGP validation accuracy is a computationally cheap way to predict final network performance. Kendall’s Tau and Pearson correlation between NNGP/shortened-training validation accuracy and the ground-truth network performance plotted against the computational cost per network for all networks in NAS-Bench-101. Color indicates for the NNGP performance, while the different points in each connected plot represents different ensemble numbers. See SM section D for average prediction performance.

We compute Kendall’s Tau and the correlation of the NNGP validation accuracies for different combinations of and validation accuracies measured for shortened-training against the ground-truth performance of the 423k networks in NAS-Bench-101. (The raw inference results are presented in SM section D.) These measures have been plotted against the relative computational flops in figure 1. We find that the Kendall’s Tau performance of 4-epoch or 12-epoch training can be matched or bested with NNGP inference with an order-of-magnitude less FLOPs.

Figure 2: PQETP- of validation accuracies from 4, 12-epoch training and NNGP inference for NAS-Bench-101. Color for NNGP inference is coded by . (left) PQETP- plotted against threshold performance . For NNGP inference, the upper-bound and lower-bound values with respect to all configurations are shown. Cumulative density of networks above threshold is plotted for reference. NNGP inference stays competitive with 12-epoch training and universally better than 4-epoch training for the 25 to 92 percentile values of (solid lines). (right) PQETP- plotted against computational cost for selected fixed values of . These are obtained from slices of the left figure by taking the top 1, 5, 20, 50 percentile threshold values (dotted lines on left).

In figure 2 we plot PQETP- for a range of threshold accuracies computed for validation accuracy via 4, 12-epoch training and NNGP inference. We see that while 12-epoch training validation accuracy is better suited than NNGP measures for discerning whether a network performance is within the top 1, 5 or 20 percentile, many of the NNGP measures are better at predicting whether a network performance is above-median. Meanwhile, we find that most NNGP measures have better PQETP- compared to 4-epoch training beyond 92 percentile values of , despite costing less to compute.

We also compute the average discovered performance for 10 randomly sampled 10k-subsets of networks for each metric. These values have been plotted in figure 3

against the computational cost. The standard error for the result of 4 and 12-epoch training has been plotted in bands. We see that the NNGP performance is at most as good as 4-epoch training, and is worse than 12-epoch training.

Figure 3: Average discovered performance for 10 randomly sampled sets of 10k networks from NAS-Bench-101. The standard errors for 4-epoch and 12-epoch training are shown in bands.

Comments on Biases of NNGP performance

We found that competitive architectures with poor NNGP performance were mostly “linear" networks, having none or a small number of residual connections. In figure 

4 we demonstrate this bias. In the directed adjacency matrix specifying the architecture, the -th diagonal elements with correspond to residual connections. We observe that relative performance of architectures with a small number of residual connections are poor early in training but improve significantly with extensive training. For such architectures, NNGP predicts poor performance.

Figure 4: Architectural bias of NNGPs observed on NAS-Bench-101. Relative performance of architectures with a small number of residual connections are poor early in training but improve significantly with extensive gradient-based training.

Comments on usage of NTK

NAS aims to find architectures that train well by gradient descent. Since the neural tangent kernel (NTK) Jacot et al. (2018) characterizes gradient descent training of infinitely wide networks, one might expect signals provided by NTK to be better than NNGP for NAS. To compare their utility, we computed the empirical NTK validation accuracy for a size-1k subset of networks from NAS-Bench-101 with . While the NTK validation accuracy has a non-trivial peak value of Kendall’s Tau at against the ground-truth performance for , this value is lower than that computed for NNGP performance at for equivalent parameters. Moreover, for the same dataset sizes, computing NTK inference is more expensive than NNGP since the Jacobian with respect to the network parameters need to be computed for all datapoints. This incurs a similar compute cost for computing gradients for all samples. In practice, computing the full Jacobian also consumes a large amount of memory, and the Vector-Jacobian Product and the Jacobian-Vector Product need to be utilized (see nt.empirical_ntk_fn in Novak et al. (2019b) for a reference).

Figure 5: Comparison of NTK versus NNGP on NAS-Bench-101. NTK is not as competitive as NNGP in terms of Kendall’s Tau for predicting ground truth performance. Bigger improves the predictive quality of NNGP, while the optimal ensemble number for NTK is 1.

NNGP Performance and Model Size

The models within NAS-Bench-101 have widely varying sizes spanning over an order of magnitude—the smallest model has 2M parameters, while the biggest one has 50M. Given this range, model size is a strong indicator of performance for NAS-Bench-101, as we explore further in section F of the SM. One may thus be concerned that NNGP performance is merely capturing the size of the network, which is a trivial aspect of the neural network architecture.222One can carry out an equivalent analysis of the computational budget, rather than the number of parameters of each model. For NAS-Bench-101, the model size and computational budget almost perfectly correlate—we thus restrict our discussion to model size with this understanding.

Figure 6: Kendall’s Tau and Pearson correlation between NNGP validation accuracy and model size plotted against the computational cost to compute the NNGP validation accuracy per network for all networks in NAS-Bench-101. NNGP validation accuracy and model size exhibit low correlation, and can thus be treated as independent predictors of model quality.

In figure 6, we present the Kendall’s Tau and correlation between the NNGP validation accuracies and network size. We find very low correlation. We present further analysis on the model size distribution of NAS-Bench-101 and the utility of performance predictors under model size constraints in section F of the SM.

4.2 ImageNet on MNAS Search Space

In the left-most panel of figure 7, we plot the performance of ten randomly selected networks, and the top-10 networks selected according to the 5-epoch MNAS reward, 5-epoch accuracy and NNGP validation accuracy indexed by the number of subsampled labels, training set size and validation set size, i.e., . We find that the best networks according to NNGP validation accuracies perform worse than the best randomly selected network. As we discuss in section 5, this suggests that while NNGP provides a strong signal for whether a network will perform reasonably, it does not on its own identify the top performing networks.

In the latter two panels of figure 7, we give Kendall’s Tau and correlation between the MNAS reward function computed after 5-epoch training and the NNGP validation accuracies. We see that there is a non-trivial correlation between the two different type of measures.

Figure 7: Result of architecture selection based on 5-epoch training and NNGP inference performance from 10k samples drawn from the MNAS search space. Datasets used for NNGP inference are parameterized by , where is the number of subsampled labels.

5 Discussion and exploration of practical use of NNGP inference in NAS

On NAS-Bench-101, we find that Monte-Carlo estimated NNGP inference provides a computationally inexpensive signal that shows comparable utility against the validation accuracy obtained from shortened training. We further find that NNGP inference provides a strong signal for discerning whether a network exceeds median performance, but lags behind shortened training when predicting the hierarchy of top-performers.

This is further exemplified in the experiments conducted in the MNAS search space, where a randomly sampled network already exhibits good performance. To see this point, we note that the worst 5-epoch training validation accuracy we obtain from the 10k networks we sampled from the MNAS search space is 23.69%. This is to be compared with the NAS-Bench-101 networks, for which the average of the worst validation accuracy over 10 sets of 10k random networks after training for 4, 12, 36 and 108 epochs is given by 4.37%, 9.12%, 9.49% and 9.49%, respectively. The quality of the MNAS search space is evident, even before considering the fact that ImageNet is a much more difficult task than CIFAR-10 with a hundred times more labels. Thus the fact that the max performance over the top 10 networks ranked by NNGP is less than that over 10 random networks, despite the NNGP performance being correlated with the much more predictive shortened-training performance, is consistent with what we have observed for the NAS-Bench-101 dataset.

Based on these results, we suggest two ways that NNGP inference can be utilized in NAS. The first is that it can be used to shrink a large architecture search space in which there is a large variance in performance of networks (e.g., Radosavovic et al. (2020)) at a low computational cost. The second is that it can be used as a complementary signal that can improve training-based performance measures.

Example of Search-Space Reduction

Consider 10 randomly selected subsets of the networks in NAS-Bench-101 of size 10k. We compare the average discovered performance on these sets obtained by shortened training on a reduced search space obtained by selecting the top-% of networks according to NNGP validation accuracy. As baselines, we consider the average discovered performance of shortened training without NNGP screening, as well as the average discovered performance via shortened training on a random p% subset. We experiment with and .

The results are shown in figure 8. We find that 70% reduction of the search space by screening with NNGP performance for both 12-epoch and 36-epoch training-based random architecture search does not sacrifice the performance of the search while significantly reducing the computational cost. This is to be contrasted with when the search space is reduced by random selection, which leads to marked degradation in performance (orange/black vs. blue in plots). On the other hand, 90% reduction leads to average performance degradation, showing similar results obtained from random reduction of the search space. This is consistent with the PQETP- results, where we found NNGP validation accuracy to be competitive against 4 and 12-epoch training for discerning performance threshold values in the top 29 to 92 percentile. We thus expect performance degradation when the search space size is reduced to being significantly below 29% of the original size.

Figure 8: Average discovered performance on 10 sets of 10k networks by search space reduction using NNGP validation accuracy for shortened training plotted against computational cost (viridis). The average discovered performance from 4, 12 and 36-epoch training are plotted in red, orange and black. Results from search space reduction by random selection have been plotted in blue. All networks are in NAS-Bench-101. Standard errors are shown in bands. We see that search space reduction to 30% using NNGP performance retains discovered performance quality with lower computational cost.

Potential Hybrid Performance Metrics

Figure 9: Average discovered performance on 10 sets of 10k networks using hybrid metrics constructed from 4-epoch and NNGP performance (viridis). The average discovered performance from 4 and 12-epoch training are plotted in red and orange and black, with standard errors shown in bands.

Here, we show that a simple linear model combining 4-epoch training validation accuracy and NNGP validation accuracy produces a better performance metric than 4-epoch training alone, for only a small additional computational cost. We note that we have omitted the computational cost to actually fit the linear model used, as we aim to demonstrate the existence of a hybrid performance metric with improved predictive ability.

We use a linear model with three parameters (including the bias) to fit the 12-epoch validation accuracy against the 4-epoch validation accuracy and each NNGP validation accuracy. By doing so, we obtain a hybrid performance metric, with which we measure the average discovered performance for 10 randomly selected size-10k sets of networks. The results obtained for the hybrid metric is plotted in figure 9. We see that hybrid metrics built out of NNGPs with larger training sets can exhibit statistically significant performance gain with marginal additional computational cost.

Broader Impact

Our research aims to reduce the computational cost of evaluating neural network performance and neural architecture search, which would lead to reduction of the environmental footprint of deep learning research and applications 

(Strubell et al., 2019).

We thank Yasaman Bahri, Gabriel Bender, Pieter-Jan Kindermans, Quoc V. Le, Esteban Real, Samuel S. Schoenholz and Mingxing Tan for useful discussions.


Appendix A Comments on Batch-Normalization

All convolution cells in NAS-Bench-101 and MNAS utilize batch normalization to ensure that the search space contains ResNet and Inception-like cells Ying et al. [2019], whose parameters need to be initialized. We initialize the moving averages in batch-normalization layers using one-forward pass on a random subset of the training set of batch size of (or for ). In practice, we use the NAS-Bench-101’s default batch-normalization momentum value of and use inference mode (training=False

) to compute NNGP kernels, thus the batch normalization parameters are not far from the initial values set at zero mean and unit standard deviation.

Figure S1: Kendall’s Tau and correlation results for NNGP with batch norm momentum set to zero.

In figure S1, we compared the result of NNGP inference with settings used in the paper, and that of updating the batch-normalization parameters with statistics of a random subset of the training set by setting momentum parameter to be . We observe that while the performance of NNGP increases in expectation, the produced validation accuracy becomes less indicative of the ground-truth performance of the network. We in fact see that Kendall’s Tau and the correlation with respect to the ground truth performance when momentum is set to zero is close to , which is far below respective numbers for default momentum or partial training, indicating nearly random chance of predicting the correct ground truth ordering.

Appendix B Computational Cost

The computational cost of NNGP inference stems from computing the Monte Carlo NNGP kernels , (see algorithm 1) and from performing inference with different choices of the regularizer.

Let be the architecture () dependent number of flops for inference on a single sample, the dimension of the feature space of the penultimate layer of the network, the ensemble number, the number of labels, and and the size of the training set and validation set for the NNGP, respectively. Then the computational flops required for NNGP training is given by

Kernel Evaluation Flops (S1)

where the first term comes from forward-propagation on the network and the second term comes from matrix computations used to construct the kernels. Meanwhile, the cost of NNGP inference via Cholesky solve for distinct values of the regularizer is given by


Adding the two terms, we arrive at the expression


Denoting the FLOPs required for training a model per step per sample by , training a model for epochs and carrying out inference costs


where we always train/validate over the full training and validation set for gradient-based training.

We have computed and for all the networks in the NAS-Bench-101 dataset for multiple batch sizes and found that the relation holds to a good approximation (see figure S2). We thus replace with when evaluating computational cost.

Figure S2: Measured computational cost for each model in NAS-Bench-101. We observe holds.

Throughout the paper, we choose to use the computational cost per network for assessing how expensive each metric is to compute. To do so, we must evaluate and for NAS-Bench-101. We find that GFLOPs (see figure S2). Meanwhile, for all networks, which we explain in section C.1.

Appendix C More Details on Search Spaces and Experiment Design

c.1 NAS-Bench-101

A network in the NAS-Bench-101 dataset Ying et al. [2019] is defined by a “cell" architecture, which is parameterized by a labelled directed graph. The labelled directed graph is defined by a set of vertices, whose label defines an operation (, convolutions or max pooling) and a directed adjacency matrix that specifies how to compose these operations to yield an output. The network is constructed by composing a stem convolutional layer with 128 output channels, with 3 repeated applications of blocks, which in turn consists of three concatenations of the defined cell, followed by an average pooling layer and a fully-connected layer. A down-sampling layer, which halves the height and width of the image and widens the network by twice the width is applied in between blocks.

All the networks in the NAS-Bench-101 dataset produce a feature in the penultimate layer of dimension 512. This is because each of the three blocks used for constructing the network preserve the channel number, while each of the two pooling layers situated in between the three blocks make the channel size 2 times as wide. Since the final average-pooling layer does not change the channel number, starting from the 128-channel output of the stem layer, we arrive at channels for the penultimate layer. This has been verified by explicit inspection of all networks.

All NNGP inference has been carried out on CPUs. All CIFAR-10 images for NNGP have been processed by standardizing the RGB channels using means 125.3, 123.0, 113.9 and standard deviations 63.0, 62.1, 66.7. No augmentations have been applied.

c.2 MNAS Search Space

The MNAS search space Tan et al. [2019] is a factorized hierarchical search space, where the model is assumed to be composed of seven feed-forward blocks, each of which acts by repeated application of a convolutional layer. It includes multiple configuration parameters for each block, including convolution operation, kernel size, squeeze-and-excitation ratio Hu et al. [2018], skip operation, filter size, and repetition number (see section of 4.1 of Tan et al. [2019] for details).

The “reward" used in MNAS search is where is the validation accuracy of the model trained after 5-epochs, is the latency of the model and is the target latency, which is set to be milliseconds.

All NNGP inference has been carried out on CPUs. All ImageNet images for NNGP (including the NNGP training set images) have undergone standard validation processing, i.e., the RGB channels are standardized using means 123.7, 116.3, 103.5 and standard deviations 58.4, 57.1, 57.4, scaled so that the shortest edge has length 256 and then center-cropped to size 224 224. No augmentations have been applied.

Data processing, training and validation for 5-epoch and 400-epoch training have all been conducted in accord with Tan et al. [2019]. 5-epoch training of networks has been carried out using 8 Google Cloud TPU chips, while 400-epoch training was done using 32 Google Cloud TPU chips.

Appendix D NAS-Bench-101 Inference Results

In this section, we present raw NNGP inference results and some basic statistics, along with the reference shortened-training results for NAS-Bench-101.

Figure S3: Ground-truth accuracy plotted against NNGP validation accuracy with all values of for a selected set of 10k networks from the NAS-Bench-101 dataset.
Figure S4: Ground-truth accuracy plotted against shortened-training validation accuracy for a selected set of 10k networks from the NAS-Bench-101 dataset.

In figure S3 we plot the ground-truth accuracy against NNGP validation accuracy for all values of and for a selected set of 10k networks. The ground truth accuracy is plotted against shortened-training validation accuracies for these same networks in figure S4.

In figure S5 we plot the mean and median validation accuracy obtained from NNGP inference and shortened training.

Figure S5: The mean and median validation accuracy obtained from NNGP inference and shortened training plotted against computational cost.

Appendix E More Performance Measure Plots

In this section, we present some additional analysis on the utility of NNGP validation accuracy as an indicator of ground-truth performance.

Figure S6: Range of ground-truth threshold accuracies for which NNGP performance is a better indicator compared to 4 and 12-epoch training performance, based on PQETP. The cumulative distribution of is plotted for reference. Each colored shaded region indicates the area in hyperparameter space where the corresponding NNGP configuration outperforms 4-ep or 12-ep training.

In figure S6, for each NNGP validation accuracy, we plot the range of for which its PQETP- exceeds that of validation accuracies obtained from 4 and 12 epoch training. To obtain this plot, we have scanned the threshold accuracy range 0.78 to 0.95 with step size 0.003.

Appendix F Model Size Distribution of NAS-Bench-101

NAS-Bench-101 contains models with wildly varying sizes, the largest model having 20 times as many parameters as the smallest model. Given this variance, model size turns out to be a good indicator for selecting top performing models.

Figure S7 presents an over-all view of the size distribution of models in NAS-Bench-101. From the left panel, which plots the ground-truth accuracy against model size for a selected set of 10k networks, we see a correlation between model size and performance. It is also evident that there are multiple clusters of models with respect to size. Meanwhile, from the PQETP plot against the threshold ground-truth performance depicted in the right panel, we see that model size is a very strong discriminator for threshold performances above the top percentile.

Figure S7: Overview of size distribution of models in NAS-Bench-101. (left) Ground-truth accuracy plotted against model size for a selected set of 10k networks. (right) PQETP- of model size as a indicator for ground truth validation accuracy plotted against threshold accuracy in grey.

As a consequence, we find that while the overall ranking of performance does not align very well with model size, model size is a surprisingly good discriminator for singling out the top performing networks. This is demonstrated in figure S8, which displays the Kendall’s Tau of model size against ground truth accuracy, and average discovered performance obtained by choosing models based on their size.

Figure S8: Kendall’s Tau of model size against ground truth accuracy, and average discovered performance (based on 10 random sets of size 10k) obtained by selecting biggest models plotted in grey against corresponding metrics measured for shortened training and NNGP performance.

Given that there are distinct clusters of models with respect to their sizes that have very different number of parameters, the model size being a strong indicator of performance should be taken as a statement about the dataset, rather than a statement about the utility of model size as a performance indicator. An ideal performance metric should be able to make a distinction between models that have comparable sizes, and in fact, the goal of architecture search is often to find high-performance models with constraints on computational budget, e.g., Tan et al. [2019]. In figure S9, we see that indeed, shortened training performance as well as NNGP validation accuracy satisfy this criterion, having varying values within each cluster of models.

Figure S9: Model size plotted against shortened-training validation accuracy as well as NNGP validation accuracy for a selected 10k subset of models. We find that there performance metrics make distinctive predictions within each model size cluster.

To examine the utility of each performance metric within a size cluster, let us focus on the cluster of models with less than 10M parameters. There are 297k models in this cluster with GFLOPs. As before, we compute Kendall’s Tau between shortened training performance, NNGP performance and model size against ground truth performance for these models, and the average discovered performance across ten 10k-size subsets of such models according to each metric. The results are plotted in figure S10. We find that the discovered performance of models selected based on model size has declined significantly more compared to that of models selected based on NNGP or shortened training accuracy in this setting. Meanwhile the hierarchy of both metrics between NNGP and shortened training performance stay largely the same.

Figure S10: Kendall’s Tau against ground truth accuracy (left) and average discovered performance (based on 10 random sets of size 10k) (right) computed for shortened training performance, NNGP performance and model size for models with less than 10M parameters.