Neural Architecture Search using Bayesian Optimisation with Weisfeiler-Lehman Kernel

06/13/2020 ∙ by Binxin Ru, et al. ∙ University of Oxford 0

Bayesian optimisation (BO) has been widely used for hyperparameter optimisation but its application in neural architecture search (NAS) is limited due to the non-continuous, high-dimensional and graph-like search spaces. Current approaches either rely on encoding schemes, which are not scalable to large architectures and ignore the implicit topological structure of architectures, or use graph neural networks, which require additional hyperparameter tuning and a large amount of observed data, which is particularly expensive to obtain in NAS. We propose a neat BO approach for NAS, which combines the Weisfeiler-Lehman graph kernel with a Gaussian process surrogate to capture the topological structure of architectures, without having to explicitly define a Gaussian process over high-dimensional vector spaces. We also harness the interpretable features learnt via the graph kernel to guide the generation of new architectures. We demonstrate empirically that our surrogate model is scalable to large architectures and highly data-efficient; competing methods require 3 to 20 times more observations to achieve equally good prediction performance as ours. We finally show that our method outperforms existing NAS approaches to achieve state-of-the-art results on NAS datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural architecture search (NAS) is a popular research direction recently that aims to automate the design process of good neural network architectures for a given task/dataset. Neural architectures found via different NAS strategies have demonstrated state-of-the-art performance outperforming human experts’ design on a variety of tasks Real et al. (2017); Zoph and Le (2017); Cai et al. (2018); Liu et al. (2018b, a); Luo et al. (2018); Pham et al. (2018); Real et al. (2018); Zoph et al. (2018a); Xie et al. (2018). Similar to hyperparameter optimisation, NAS can often be formulated as a black-box optimisation problem Elsken et al. (2018) and evaluation of the objective can be very expensive as it involves training of the architecture queried. Under this setting, query efficiency is highly valued Kandasamy et al. (2018b); Elsken et al. (2018); Shi et al. (2019) and Bayesian optimisation (BO), which has been successfully applied for hyperparamter optimisation Snoek et al. (2012); Bergstra et al. (2011, 2013); Hutter et al. (2011); Klein et al. (2016); Falkner et al. (2018); Chen et al. (2018), becomes a natural choice to consider.

Yet, conventional BO methods cannot be applied directly to NAS as the popular cell-based search spaces are non-continuous and graph-like Zoph et al. (2018a); Ying et al. (2019); Dong and Yang (2020); each possible architecture can be represented as an acyclic directed graph whose nodes correspond to operation units/layers and the edges represent connection among these units Ying et al. (2019). To accommodate such search space, Kandasamy et al. (2018b) and Jin et al. (2019) propose distance metrics for comparing architectures, thus enabling the use of Gaussian process (GP)-based BO for NAS. Ying et al. (2019) and White et al. (2019)

encode architectures with a high-dimensional vector of discrete and categorical variables. However, both the distance metrics and the encoding schemes are not scalable to large architectures

White et al. (2019) and overlook the topological structure of the architectures Shi et al. (2019), which can be important Xie et al. (2019)

. Another line of work adopts graph neural networks (GNNs) in combination with Bayesian linear regression as the BO surrogate model to predict the architecture performance

Ma et al. (2019); Zhang et al. (2019); Shi et al. (2019)

. These approaches treat architectures as attributed graph data and consider the graph topology of architectures. But the GNN design introduces additional hyperparameter tuning and the training of the GNN also requires a relatively large amount of architecture data. Moreover, applying Bayesian linear regression on the features extracted by GNN is not a principled way to obtain model uncertainty compared to GP and often leads to poor uncertainty estimates

Springenberg et al. (2016). The extracted features are also hard to interpret, thus not helpful for guiding the practitioners to generate new architectures.

In view of the above limitations, we propose a new BO-based NAS approach which uses a GP in combination with graph kernel. It naturally handles the graph-like architecture spaces and takes into account the topological structure of architectures. Meanwhile, the surrogate preserves the merits of GPs in data-efficiency, uncertainty computation and automated surrogate hyperparameter treatment. Specifically, our main contributions can be summarised as follows.

  • We introduces a GP-based BO strategy for NAS, NAS-BOWL, which is highly query-efficient and amenable to the graph-like NAS search spaces. Our proposed surrogate model combines a GP with a Weisfeiler-Lehman subtree (WL) graph kernel to exploit the implicit topological structure of architectures. It is scalable to large architecture cells (e.g. 32 nodes) and can achieve better prediction performance than GNN-based surrogate with much less training data.

  • We harness the interpretable graph features extracted by the WL graph kernel and propose to learn their corresponding effects on the architecture performance based on surrogate gradient information. We then demonstrate the usefulness of this on an application which is to guide architecture mutation/generation for optimising the acquisition function.

  • We empirically demonstrate that our surrogate model achieves superior performance with much fewer observations in NAS search spaces of different sizes. We finally show that our search strategy achieves state-of-the-art performances on both NAS-Bench datasets.

2 Preliminaries

2.1 Graph Representation of Neural Networks

Architectures in popular NAS search spaces can be represented as an acyclic directed graph Elsken et al. (2018); Zoph et al. (2018b); Ying et al. (2019); Dong and Yang (2020); Xie et al. (2019) where each graph node represents an operation unit or layer (e.g. a conv3

3-bn-relu

111A sequence of operations: convolution with filter size, batch normalisation and ReLU activation. in Ying et al. (2019)) and each edge defines the information flow from one layer to another. With this representation, NAS can be formulated as an optimisation problem to find the directed graph and its corresponding node operations (i.e. the directed attributed graph ) that give the best architecture validation performance : .

2.2 Bayesian Optimisation and Gaussian Processes

To solve the above optimisation, we adopt BO, which is a query-efficient technique for optimising a black-box, expensive-to-evaluate objective Brochu et al. (2010). BO uses a statistical surrogate to model the objective and builds an acquisition function based on the surrogate. The next query location is recommended by optimising the acquisition function which balances the exploitation and exploration. We use a GP as the surrogate model in this work, as it can achieve competitive modelling performance with small number of query data Williams and Rasmussen (2006) and give analytic predictive posterior mean

and variance

: and where and with being the graph kernel function. We experiment with Expected Improvement Mockus et al. (1978) as the acquisition function in this work though our approach is compatible with alternative choices.

2.3 Graph Kernels

Graph kernels are kernel functions defined over graphs to compute their level of similarity. A generic graph kernel may be represented by the function over a pair of graphs and Kriege et al. (2020):

(2.1)

where represents some vector embeddings/features of the graph extracted by the graph kernel and represents an inner product in the reproducing kernel Hilbert space (RKHS) Nikolentzos et al. (2019); Kriege et al. (2020). For more detailed reviews on graph kernels, the readers are referred to Nikolentzos et al. (2019) and Kriege et al. (2020).

3 Proposed Method

Here we discuss two key aspects of our BO-based NAS strategy while the overall algorithm is presented in App. A.

3.1 Graph Kernel Design

Our primary proposal is to use an elegant graph kernel to circumvent the aforementioned limitations of GP-based BO in the NAS setting and enable the direct definition of a GP surrogate on the graph-like search space. This construction preserves desirable properties of GPs, such as its principled uncertainty estimation which is key for BO, and allows the user to deploy the rich advances in GP-based BO (which include parallel computation Ginsbourger et al. (2010); Snoek et al. (2012); González et al. (2016); Hernández-Lobato et al. (2017); Kandasamy et al. (2018a); Alvi et al. (2019), multi-objective Emmerich and Klinkenberg (2008); Lyu et al. (2018); Hernández-Lobato et al. (2016); Paria et al. (2018)

and transfer-learning

Wistuba et al. (2016); Poloczek et al. (2016); Wang et al. (2018); Wistuba et al. (2018); Feurer et al. (2018)) on NAS problems. The kernel choice encodes our prior knowledge on the objective and is crucial in GP modelling. Here we opt to base our GP surrogate on the Weisfeiler-Lehman (WL) graph kernel family Shervashidze et al. (2011) (we term our surrogate GPWL). In this section, we will first illustrate the mechanism of the WL kernel, followed by our rationales for choosing it.

Figure 1: Illustration of one WL iteration on a NAS-Bench-101 cell. Given two architectures at initialisation, WL kernel first collects the neighbourhood labels of each node (Step 1) and compress the collected original labels, i.e., features at (Initialisation) into features at (Step 2). Each node is then relabelled with the compressed label/ feature (Step 3) and the two graphs are compared based on the histogram on both and features (Step 4). This WL iteration will be repeated until . is both the index the WL iteration and the depth of the subtree features extracted. Substructures at and in Arch A are shown in the middle right of the plot.

The WL kernel compares two directed graphs based on both local and global structures. It starts by comparing the node labels of both graphs ( features 0 to 4 in Fig. 1) via a base kernel where denotes the histogram of features in the graph, and is the index of WL iterations and the depth of the subtree features extracted. For the WL kernel with , it then proceeds to collect features following steps 1 to 3 in Fig. 1 and compare the two graphs with based on the subtree structures of depth Shervashidze et al. (2011); Höppner and Jahnke (2020). The procedure then repeats until the highest iteration level specified and the resultant WL kernel is given by:

(3.1)

is a base kernel specified by the user and a simple example is the dot product of the feature embeddings and contains the weights associated with each WL iteration . We follow the convention in Shervashidze et al. (2011) to set all the weights equal.Note as a result of the WL label reassignment, the node labels in Arch A at initialisation () are different from those in Arch A in Step 3 (); features represent subtrees of depth 0 while features are subtrees of depth 1. In this way, as increases, the WL kernel captures higher-order features which correspond to increasingly larger neighbourhoods (see App. A for an algorithmic representation of WL).

We argue that WL kernel is a desirable choice for the NAS application for the following reasons.

  1. [labelindent=0pt, wide]

  2. WL kernel is able to compare labeled and directed graphs of different sizes. As discussed in Section 2.1, architectures in almost all popular NAS search spaces Ying et al. (2019); Dong and Yang (2020); Zoph et al. (2018b); Xie et al. (2019) can be represented as directed graphs with node/edge attributes. Thus, WL kernel can be directly applied on them. On the other hand, many graph kernels either do not handle node labels Shervashidze et al. (2009), or are incompatible with directed graphs Kondor and Pan (2016); de Lara and Pineau (2018). Converting architectures into undirected graphs can result in loss of valuable information such as the direction of data flow in the architecture (we show this in Section 5.1).

  3. WL kernel is expressive yet highly interpretable. WL kernel is able to capture substructures that go from local to global scale with increasing values. Such multi-scale comparison is similar to that enabled by a Multiscale Laplacian Kernel Kondor and Pan (2016) and is desirable for architecture comparison. This is in contrast to graph kernels such as Kashima et al. (2003); Shervashidze et al. (2009), which only focus on local substructures, or those based on graph spectra de Lara and Pineau (2018), which only look at global connectivities. Furthermore, the WL kernel is derived directly from the Weisfeiler-Lehman graph isomorphism test Weisfeiler and Lehman (1968), which is shown to be as powerful as a GNN in distinguishing non-isomorphic graphs Morris et al. (2019); Xu et al. (2018). However, the higher-order graph features extracted by GNNs are hard to interpret by humans. On the other hand, the subtree features learnt by WL kernel (e.g. the and features in Figure 1) are easily interpretable. As we will discuss in Section 3.2 later, we can harness the surrogate gradient information on low- substructures to identify the effect of particular node labels on the architecture performance and thus learn useful information to guide new architecture generation.

  4. WL kernel is relatively efficient and scalable. Other expressive graph kernels are often prohibitive to compute: for example, defining to be the number of nodes and edges in a graph, random walk Gärtner et al. (2003), shortest path Borgwardt and Kriegel (2005) and graphlet kernels Shervashidze et al. (2009) incur a complexity of , and respectively where is the maximum graphlet size. Another approach based on computing the architecture edit-distance Jin et al. (2019) is also expensive: its exact solution is NP-complete Zeng et al. (2009) and is provably difficult to approximate Lin (1994). On the other hand, the WL kernel only entails a complexity222Consequently, naively computing the Gram matrix consisting of pairwise kernel between all pairs in graphs is of , but this can be further improved to . See Morris et al. (2019). of Shervashidze et al. (2011). Empirically, we find that in typical NAS search spaces (such as NAS-Bench datasets) featuring rather small cells, usually suffices (even in a deliberately large cell search space we construct later, is sufficient) – this implies the kernel computing cost is likely eclipsed by the complexity of GPs, not to mention the main bottleneck of NAS is the actual training of the architectures. The scalability of WL is also to be contrasted to other approaches such as an encoding of all input-output paths White et al. (2019), which without truncation scales exponentially with .

With the above-mentioned merits, the incorporation of WL kernel permits the usage of GP-based BO on various NAS search spaces. This enables the practitioners to harness the rich literatures of GP-based BO methods on hyperparameter optimisation and redeploy them on NAS problems. Meanwhile, the use of GP surrogate frees us from hand-picking the WL kernel hyperparameter as we can automatically learn the optimal values by maximising the Bayesian marginal likelihood. This leads to a method with almost no inherent hyperparameters that require manual tuning. We empirically justify the superior prediction performance of our GP surrogate with a WL kernel against other graph kernels and GNNs in Section 5.1. Note that we may further improve the expressiveness of the surrogate by adding multiple types of kernels together, especially if the kernels used capture different aspects of graph information. We briefly investigate this in App. B and find the extent of performance gain depends on the NAS search space; a WL kernel alone can be sufficient for common cell-based space. We leave the comprehensive evaluation of such kernel combinations to a future work.

3.2 Interpretability and Gradient-guided Architecture Generation

In the preceding section, we have elaborated the advantages of using WL graph kernels to make the NAS search space amenable to GP-based BO. A key advantage of WL that we identify in Section 3.1 is that it extracts interpretable features. Here, we demonstrate that its integration with the GP surrogate can further allow us to distinguish the relative impact of these features on the architecture performance. This could potentially be a starting point towards explainable NAS, and practically can be helpful for the practitioners who are not only interested in finding good-performing architectures but also in how to modify the architecture to further improve the performance.

To assess the effect of the extracted features, we propose to utilise the derivatives of the GP predictive mean. Derivatives as tools for interpretability have been used previously Engelbrecht et al. (1995); Koh and Liang (2017); Ribeiro et al. (2016), but the GP surrogate and WL kernels in our method mean that we may compute derivatives with respect to the interpretable features analytically. Formally, the derivative with respect to the -th element of (the feature vector of a test graph) is also Gaussian and has an expected value:

(3.2)

where has the same meaning in equation 2.3 and is the feature matrix stacked from the feature vectors from the previous observations. Intuitively, since each denotes the count of a WL feature in , its derivative naturally encodes the direction and sensitivity of the GP objective (in this case the predicted validation accuracy) about that particular feature.

We illustrate the usefulness of the gradient information with an example on features333We choose for the ease of illustration; our method can be applied to WL features of higher without loss of generality. in Fig. 2. We randomly sample 100 architecture data to train our GPWL surrogate and reserve another 500 samples as the validation set for both NAS-Bench-101 and NAS-Bench-201 datasets. We evaluate the gradient with respect to the features (i.e. the node operation types) on the validation set and compute the average gradient among graphs with the same number of a specific node feature. The mean gradient results for different node feature types are shown in Fig. 1(a) and 1(c) while the validation accuracies achieved by architectures containing different occurrence of various node feature types are shown in Fig. 1(b) and 1(d). For example, in Fig. 1(a), the gradient of maxpool33 at number of occurrences 2 is the average gradient of GP posterior mean with respect to that node feature across all validation architectures which contain 2 maxpool33 nodes. The corresponding point in Fig. 1(b) is the average

1 standard deviation of the validation accuracy across the same group of architectures.

(a) GP derivatives in N101
(b) Valid acc. in N101
(c) GP derivatives in N201
(d) Valid acc. in N201
Figure 2: Mean derivatives and the corresponding mean std of validation accuracy for different node feature types in NAS-Bench-101 (N101) (a)(b) and NAS-Bench-201 (N201) (c)(d) datasets. The x-axis is the number of nodes having the same specific type of node feature in the architectures. Note architectures containing more of negative-derivative features tend to have lower accuracy and vice versa. Thus, the derivative information is useful in assessing the effects of node features.

Fig. 2 clearly show the effectiveness of our surrogate gradient information in assessing the impact of various node operation feature: for example, in NAS-Bench-101 search space, conv33-bn-relu operation always has positive gradients in Fig. 1(a) , which informs us that having more such features likely contributes to better accuracy. Conversely, the negative gradients of conv11-bn-relu suggest that this operation is relatively undesirable whereas the near-zero gradients of maxpool33 suggest such operation has little impact on the architecture performance. These observations are re-confirmed by the accuracy plots in Fig. 1(b) and similar results are observed in NAS-Bench-201 data. For various features, the average gradient tapers to zero as number of occurrence increases. This is due to both the diminishing marginal gain of having more of a certain feature in the architecture (e.g conv33-bn-relu results in Fig.1(b)) or the increasingly rare observations on extreme architectures making the posterior mean of GP surrogate converge to the prior mean of zero.

Interpretability offered here can be useful in many aspects. Here, we demonstrate on one example: harnessing the feature gradient information to generate candidate architectures for acquisition function optimisation. Under the NAS setting, optimising the acquisition function over the search space can be challenging Kandasamy et al. (2018b); Ma et al. (2019); White et al. (2019), because the non-continuous search space makes it ineffective to use the analytic gradients, and exhaustively evaluating on all possible architectures is computationally unviable. A way to generate a population of candidate architectures for acquisition function optimisation at each BO iteration is necessary for all BO-based NAS strategies. The naïve way to do so is to randomly sample architectures from the search space Ying et al. (2019); Yang et al. (2020) while a better alternative is based on genetic mutation which generates the candidate architectures by mutating a small pool of parent architectures Kandasamy et al. (2018b); Ma et al. (2019); White et al. (2019); Shi et al. (2019)

. We build on the genetic mutation, but instead of using random mutation, we propose to use the gradient information provided by GPWL to guide the mutation in a more informed way. A high-level comparison between these approaches is shown in App. C. Specifically, we transform the gradients into pseudo-probabilities defining the chances of mutation for each node and edge in the architecture. The sub-feature whose gradient is very positive and thus contributes positively to the validation accuracy will have lower probability of mutation. Upon a node or an edge is chosen for mutation, we reuse the gradient information on its possible change options to define their corresponding probability of being chosen. The detailed procedures are described with reference to an example cell in App. C. In summary, we propose a new way to learn which architecture component to modify as well as how to modify it to improve performance by making use of interpretable feature extracted in our GPWL surrogate.

4 Related work

Recently there are also some attempts of using BO for NAS Kandasamy et al. (2018b); Ying et al. (2019); Ma et al. (2019); Shi et al. (2019); White et al. (2019). To overcome the limitations of conventional BO on noncontinuous and graph-like NAS search spaces, Kandasamy et al. (2018b) proposes a similarity measure among neural architectures based on optimal transport to enable the use of GP surrogate while Ying et al. (2019) and White et al. (2019) suggest encoding schemes to characterise neural architectures with a vector of discrete and categorical variables. Yet, the proposed kernel Kandasamy et al. (2018b) can be slow to compute for large architectures Shi et al. (2019) and such encoding schemes are not scalable to search cells with large number of of nodes White et al. (2019). Alternatively, several works use graph neural network (GNN) as the surrogate model Ma et al. (2019); Zhang et al. (2019); Shi et al. (2019) to capture the graph structure of neural architectures. However, the design of the GNN introduces many additional hyperparameters to be tuned and GNN requires a relatively large number of training data to achieve decent prediction performance as shown in Section 5.1. The most related work is Ramachandram et al. (2018) which applied GP-based BO with graph-induced kernels to design multimodal fusion deep neural networks. Ramachandram et al. (2018) assign each possible architectures in the search space as a nodes in a undirected super-graph and uses a diffusion kernel to capture similarities between the nodes. The need for constructing and computing on this large super-graph limits the application of the method to relatively small search space. In contrast, we model each architecture in the search space as an individual directed graph and propose to compare pairs of graphs with WL kernel. Such setup allows our method to act on larger and more complex architectures and capture data flow directions in architectures. Our approach of comparing graphs without the need of referencing a super-graph is also computationally cheaper. In addition to BO-NAS literature, there are also several works that applied graph kernels in BO Oh et al. (2019); Cui and Yang (2018); Shiraishi et al. (2019). However, all these work focus on the undirected graph setting which is very different from our directed graph NAS problem and none of these work investigates the use of WL graph kernel family.

Kernel Complexity N101 CIFAR10 CIFAR100 ImageNet16 Flower-102
WL 0.8620.03 0.8120.06 0.8230.03 0.7960.04 0.8040.018
RW 0.8010.04 0.8090.04 0.7820.06 0.7950.03 0.7590.04
SP 0.8010.05 0.7920.06 0.7610.06 0.7620.08 0.6940.08
MLP 0.4580.07 0.4120.15 0.5190.14 0.5380.07 0.4920.12
: is the number of neighbours, a hyperparameter of MLP kernel.
Table 1: Regression performance (i.t.o rank correlation) of different graph kernels.

5 Experiments

5.1 Surrogate Regression Performance

We examine the regression performance of GPWL on available NAS datasets: NAS-Bench-101 on CIFAR10 (denoted as N101) Ying et al. (2019), NAS-Bench-201 on CIFAR10, CIFAR100 and ImageNet16 Dong and Yang (2020) (denoted by their respective image dataset hereafter). However, recognising that both datasets only contain CIFAR-sized images and relatively small architecture cells444In NASBench-101 and NASBench-201, each cell is a graph of 7 and 4 nodes, respectively., as a further demonstration of scalability of our proposed methods to much larger architectures, we also construct a dataset with 547 architectures sampled from the randomly wired graph generator described in Xie et al. (2019); each architecture cell has 32 operation nodes and all the architectures are trained on the Flowers102 dataset Nilsback and Zisserman (2008) (we denote this dataset as Flower102 hereafter). Similar to Ying et al. (2019); Dong and Yang (2020); Shi et al. (2019), we use Spearman’s rank correlation between predicted validation accuracy and the true validation accuracy as the performance metric because in NAS, what matters is the relative ranking among different architectures.

Comparison with other graph kernels

We first compare the performance of WL kernel against other popular graph kernels such as (fast) Random Walk (RW) Kashima et al. (2003); Gärtner et al. (2003), Shortest-Path (SP) Borgwardt and Kriegel (2005), Multiscale Laplacian (MLP) Kondor and Pan (2016) kernels when combined with GPs. These competing graph kernels are chosen because they represent distinct graph kernel classes and are suitable for NAS search space with small or no modifications. In each NAS dataset, we randomly sample 50 architecture data to train the GP surrogate and use another 400 architectures as the validation set to evaluate the rank correlation between the predicted and the ground-truth validation accuracy.

We repeat each trial 20 times, and report the mean and standard error of all the kernel choices on all NAS datasets in Table

LABEL:tab:naspredict. We also include the worst-case complexity of the kernel computation between a pair of graphs in the table. The results in this section justify our reasoning in Section 3.1; combined with the interpretability benefits we discussed, WL consistently outperforms other kernels across search spaces while retaining modest computational costs. RW often comes a close competitor, but its computational complexity is worse and does not always converge. MLP, which requires us to convert directed graphs to undirected graphs, performs poorly, thereby validating that directional information is highly important. Finally, in BO, uncertainty estimates are as important as the predictor accuracy itself; we show that GPWL produces sound uncertainty estimates in App. D.

Comparison with GNN and alternative GP surrogate

We then compare the regression performance of our GPWL surrogate against two competitive baselines: GNN Shi et al. (2019), which uses a combination of graph convolution network and a final Bayesian linear regression layer as the surrogate, and COMBO Oh et al. (2019)555We choose COMBO as it uses GP surrogate with different kernel choices and also is very close to the most related work Ramachandram et al. (2018) whose implementation is not publicly available, which uses GPs with a graph diffusion kernel on combinatorial graphs. We follow the same set-up described above but repeat the experiments on a varying number of training data to evaluate the data-efficiency of different surrogates. It is evident from Fig. 3 that our GPWL surrogate clearly outperforms both competing methods on all the NAS datasets with much less training data. Specifically, GPWL requires at least 3 times fewer data than GNN and at least 10 times fewer data than COMBO on NASBench-201 datasets. It is also able to achieve high rank correlation on datasets with larger search spaces such as NASBench-101 and Flowers102 while requiring 20 times fewer data than GNN on Flowers102 and over 30 times fewer data on NASBench-101.

(a) CIFAR10
(b) CIFAR100
(c) ImageNet16
(d) N101
(e) Flowers102
Figure 3: Mean rank correlation achieved by GPWL, GNN and COMBO surrogates across 20 trials on different datasets. Error bars denote standard error.

5.2 Architecture Search on NAS-Bench datasets

(a) N101
(b) CIFAR10
(c) CIFAR100
(d) ImageNet16
(e) N101
(f) CIFAR10
(g) CIFAR100
(h) ImageNet16
Figure 4: Median validation error on NAS-Bench datasets with deterministic (top row) and stochastic (bottom row) validation errors from 20 trials. Shades denote standard error.

We further benchmark our proposed NAS approach, NAS-BOWL, against a range of existing methods, including random search, TPE Bergstra et al. (2011)

, Reinforcement Learning (

rl) Zoph and Le (2016), BO with SMAC (smacbo) Hutter et al. (2011), regularised evolution Real et al. (2019) and BO with GNN surrogate (gcnbo) Shi et al. (2019). On NAS-Bench-101, we also include BANANAS White et al. (2019), where the authors claim state-of-the-art performance. In both NAS-Bench datasets, validation errors of different random seeds are provided, thereby creating noisy objective function observations. We perform experiments using both the deterministic setup described in White et al. (2019), where the validation errors over multiple random initialisations are averaged to provide deterministic objective functions, and also report results with noisy/stochastic objective functions. We show the validation results in both setups in Figure 4 and the corresponding test results in App. E. In these figures, we use NASBOWLm and NASBOWLr to denote NAS-BOWL variants with architectures generated from the algorithm described in Section 3.2 and from random sampling, respectively. Similarly, BANANASm and BANANASr represent the BANANAS results using mutation and random sampling in White et al. (2019). The readers are referred to App. E for our setup details.

It is evident that NAS-BOWL outperforms all baselines on all NAS-Bench tasks in achieving both lowest validation and test errors. The recent neural-network-based methods such as BANANAS and GCNBO are often the strongest competitors, but we emphasise that unlike our approach, these methods inevitably introduce a number of extra hyperparameters whose tuning is non-trivial and have poorly calibrated uncertainty estimates Springenberg et al. (2016). The experiments with stochastic errors further show that even in a more challenging setup with noisy objective function observations, NAS-BOWL still performs very well as it inherits the robustness against noisy data from the GP model. Finally, we perform ablation studies on NAS-BOWL in App. E.

6 Conclusion

In this paper, we propose a novel BO-based architecture search strategy, NAS-BOWL, which uses a GP surrogate with the WL graph kernel to handle architecture inputs. We show that our method achieves superior prediction performance across various tasks, and attain state-of-the-art performance on NAS-Bench datasets. Building on our proposed framework, a broader range of GP-based BO techniques can be deployed to tackle more challenging NAS problems such as the multi-objective and transfer learning settings. In addition, we also exploit the human-interpretable WL feature extraction for architecture generation; we believe this is a starting point for explainable NAS, which is another exciting direction that warrants future investigations.

References

  • A. Alvi, B. Ru, J. Calliess, S. Roberts, and M. A. Osborne (2019) Asynchronous batch bayesian optimisation with improved local penalisation. In International Conference on Machine Learning, pp. 253–262. Cited by: Appendix C, §E.2, §3.1.
  • J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of machine learning research 13 (Feb), pp. 281–305. Cited by: §E.2.
  • J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems (NIPS), pp. 2546–2554. Cited by: §E.2, §1, §5.2.
  • J. Bergstra, D. Yamins, and D. D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In International Conference on Machine Learning (ICML), Cited by: §1.
  • K. M. Borgwardt and H. Kriegel (2005) Shortest-path kernels on graphs. In Fifth IEEE International Conference on Data Mining (ICDM’05), pp. 8–pp. Cited by: item 3, §5.1.
  • E. Brochu, V. M. Cora, and N. De Freitas (2010) A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599. Cited by: §2.2.
  • H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang (2018) Efficient architecture search by network transformation. In

    AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1.
  • Y. Chen, A. Huang, Z. Wang, I. Antonoglou, J. Schrittwieser, D. Silver, and N. de Freitas (2018) Bayesian optimization in AlphaGo. arXiv:1812.06855. Cited by: §1.
  • J. Cui and B. Yang (2018) Graph bayesian optimization: algorithms, evaluations and applications. arXiv preprint arXiv:1805.01157. Cited by: §4.
  • N. de Lara and E. Pineau (2018) A simple baseline algorithm for graph classification. arXiv preprint arXiv:1810.09155. Cited by: item 1, item 2.
  • X. Dong and Y. Yang (2020) NAS-bench-201: extending the scope of reproducible neural architecture search. arXiv preprint arXiv:2001.00326. Cited by: 2nd item, §E.2, §E.3, §1, §2.1, item 1, §5.1.
  • T. Elsken, J. H. Metzen, and F. Hutter (2018) Neural architecture search: a survey. arXiv:1808.05377. Cited by: §1, §2.1.
  • M. Emmerich and J. Klinkenberg (2008) The computation of the expected improvement in dominated hypervolume of pareto front approximations. Rapport technique, Leiden University 34, pp. 7–3. Cited by: §3.1.
  • A. P. Engelbrecht, I. Cloete, and J. M. Zurada (1995) Determining the significance of input parameters using sensitivity analysis. In International Workshop on Artificial Neural Networks, pp. 382–388. Cited by: §3.2.
  • S. Falkner, A. Klein, and F. Hutter (2018) BOHB: robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning (ICML), pp. 1436–1445. Cited by: §1.
  • M. Feurer, B. Letham, and E. Bakshy (2018) Scalable meta-learning for bayesian optimization using ranking-weighted gaussian process ensembles. In AutoML Workshop at ICML, Cited by: §3.1.
  • T. Gärtner, P. Flach, and S. Wrobel (2003) On graph kernels: hardness results and efficient alternatives. In Learning theory and kernel machines, pp. 129–143. Cited by: item 3, §5.1.
  • D. Ginsbourger, R. Le Riche, and L. Carraro (2010) Kriging is well-suited to parallelize optimization. In Computational intelligence in expensive optimization problems, pp. 131–162. Cited by: §3.1.
  • M. Gönen and E. Alpaydin (2011) Multiple kernel learning algorithms. Journal of machine learning research 12 (64), pp. 2211–2268. Cited by: Appendix B.
  • J. González, Z. Dai, P. Hennig, and N. Lawrence (2016) Batch bayesian optimization via local penalization. In Artificial Intelligence and Statistics, pp. 648–657. Cited by: §3.1.
  • D. Hernández-Lobato, J. Hernandez-Lobato, A. Shah, and R. Adams (2016) Predictive entropy search for multi-objective bayesian optimization. In International Conference on Machine Learning, pp. 1492–1501. Cited by: §3.1.
  • J. M. Hernández-Lobato, J. Requeima, E. O. Pyzer-Knapp, and A. Aspuru-Guzik (2017)

    Parallel and distributed thompson sampling for large-scale accelerated exploration of chemical space

    .
    arXiv preprint arXiv:1706.01825. Cited by: §3.1.
  • F. Höppner and M. Jahnke (2020) Enriched weisfeiler-lehman kernel for improved graph clustering of source code. In International Symposium on Intelligent Data Analysis, pp. 248–260. Cited by: §3.1.
  • F. Hutter, H. H. Hoos, and K. Leyton-Brown (2011) Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507–523. Cited by: §E.2, §1, §5.2.
  • H. Jin, Q. Song, and X. Hu (2019)

    Auto-keras: an efficient neural architecture search system

    .
    In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1946–1956. Cited by: §1, item 3.
  • K. Kandasamy, A. Krishnamurthy, J. Schneider, and B. Póczos (2018a) Parallelised bayesian optimisation via thompson sampling. In International Conference on Artificial Intelligence and Statistics, pp. 133–142. Cited by: §3.1.
  • K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing (2018b) Neural architecture search with Bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems (NIPS), pp. 2016–2025. Cited by: Appendix C, §1, §1, §3.2, §4.
  • H. Kashima, K. Tsuda, and A. Inokuchi (2003) Marginalized kernels between labeled graphs. In Proceedings of the 20th international conference on machine learning (ICML-03), pp. 321–328. Cited by: item 2, §5.1.
  • A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter (2016) Fast Bayesian optimization of machine learning hyperparameters on large datasets. arXiv:1605.07079. Cited by: §1.
  • P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1885–1894. Cited by: §3.2.
  • R. Kondor and H. Pan (2016) The multiscale laplacian graph kernel. In Advances in Neural Information Processing Systems, pp. 2990–2998. Cited by: item 1, item 2, §5.1.
  • N. M. Kriege, P. Giscard, and R. Wilson (2016) On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems, pp. 1623–1631. Cited by: Appendix C.
  • N. M. Kriege, F. D. Johansson, and C. Morris (2020) A survey on graph kernels. Applied Network Science 5 (1), pp. 1–42. Cited by: §2.3.
  • C. Lin (1994) Hardness of approximating graph transformation problem. In International Symposium on Algorithms and Computation, pp. 74–82. Cited by: item 3.
  • C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018a) Progressive neural architecture search. In

    European Conference on Computer Vision (ECCV)

    ,
    pp. 19–34. Cited by: §1.
  • H. Liu, K. Simonyan, and Y. Yang (2018b) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1.
  • H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In International Conference on Learning Representations (ICLR), Cited by: 3rd item.
  • R. Luo, F. Tian, T. Qin, E. Chen, and T. Liu (2018) Neural architecture optimization. In Advances in Neural Information Processing Systems (NIPS), pp. 7816–7827. Cited by: §1.
  • W. Lyu, F. Yang, C. Yan, D. Zhou, and X. Zeng (2018) Multi-objective bayesian optimization for analog/rf circuit synthesis. In Proceedings of the 55th Annual Design Automation Conference, pp. 1–6. Cited by: §3.1.
  • L. Ma, J. Cui, and B. Yang (2019) Deep neural architecture search with deep graph Bayesian optimization. In Web Intelligence (WI), pp. 500–507. Cited by: Appendix C, §1, §3.2, §4.
  • J. Mockus, V. Tiesis, and A. Zilinskas (1978) The application of bayesian methods for seeking the extremum. Towards global optimization 2 (117-129), pp. 2. Cited by: §2.2.
  • C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe (2019) Weisfeiler and leman go neural: higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4602–4609. Cited by: item 2, footnote 2.
  • G. Nikolentzos, G. Siglidis, and M. Vazirgiannis (2019) Graph kernels: a survey. arXiv preprint arXiv:1904.12218. Cited by: §2.3.
  • M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics and Image Processing, pp. 722–729. Cited by: §5.1.
  • C. Oh, J. Tomczak, E. Gavves, and M. Welling (2019) Combinatorial bayesian optimization using the graph cartesian product. In Advances in Neural Information Processing Systems, pp. 2910–2920. Cited by: §4, §5.1.
  • B. Paria, K. Kandasamy, and B. Póczos (2018) A flexible framework for multi-objective bayesian optimization using random scalarizations. arXiv preprint arXiv:1805.12168. Cited by: §3.1.
  • H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning (ICML), pp. 4092–4101. Cited by: §1.
  • M. Poloczek, J. Wang, and P. I. Frazier (2016) Warm starting bayesian optimization. In 2016 Winter Simulation Conference (WSC), pp. 770–781. Cited by: §3.1.
  • D. Ramachandram, M. Lisicki, T. J. Shields, M. R. Amer, and G. W. Taylor (2018) Bayesian optimization on graph-structured search spaces: optimizing deep multimodal fusion architectures. Neurocomputing 298, pp. 80–89. Cited by: §4, footnote 5.
  • C. E. Rasmussen (2003) Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: Appendix B.
  • E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2018)

    Regularized evolution for image classifier architecture search

    .
    arXiv:1802.01548. Cited by: §1.
  • E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019) Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, Vol. 33, pp. 4780–4789. Cited by: §E.2, §5.2.
  • E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. In International Conference on Machine Learning (ICML), pp. 2902–2911. Cited by: §1.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) " Why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §3.2.
  • N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12 (77), pp. 2539–2561. Cited by: Appendix A, §E.2, item 3, §3.1, §3.1, §3.1.
  • N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt (2009) Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics, pp. 488–495. Cited by: item 1, item 2, item 3.
  • H. Shi, R. Pi, H. Xu, Z. Li, J. T. Kwok, and T. Zhang (2019) Multi-objective neural architecture search via predictive network performance optimization. Cited by: Appendix C, §E.2, §1, §1, §3.2, §4, §5.1, §5.1, §5.2, footnote 8.
  • T. Shiraishi, T. Le, H. Kashima, and M. Yamada (2019) Topological bayesian optimization with persistence diagrams. arXiv preprint arXiv:1902.09722. Cited by: §4.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1, §3.1.
  • J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter (2016) Bayesian optimization with robust Bayesian neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 4134–4142. Cited by: §1, §5.2.
  • N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger (2009) Gaussian process optimization in the bandit setting: no regret and experimental design. arXiv preprint arXiv:0912.3995. Cited by: item 4.
  • Z. Wang, B. Kim, and L. P. Kaelbling (2018) Regret bounds for meta bayesian optimization with an unknown gaussian process prior. In Advances in Neural Information Processing Systems, pp. 10477–10488. Cited by: §3.1.
  • B. Weisfeiler and A. A. Lehman (1968) A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia 2 (9), pp. 12–16. Cited by: item 2.
  • C. White, W. Neiswanger, and Y. Savani (2019) BANANAS: bayesian optimization with neural architectures for neural architecture search. arXiv preprint arXiv:1910.11858. Cited by: Appendix C, §E.2, §1, item 3, §3.2, §4, §5.2.
  • C. K. Williams and C. E. Rasmussen (2006) Gaussian processes for machine learning. Vol. 2, MIT press Cambridge, MA. Cited by: §2.2.
  • M. Wistuba, N. Schilling, and L. Schmidt-Thieme (2016) Two-stage transfer surrogate model for automatic hyperparameter optimization. In Joint European conference on machine learning and knowledge discovery in databases, pp. 199–214. Cited by: §3.1.
  • M. Wistuba, N. Schilling, and L. Schmidt-Thieme (2018) Scalable gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning 107 (1), pp. 43–78. Cited by: §3.1.
  • S. Xie, A. Kirillov, R. Girshick, and K. He (2019) Exploring randomly wired neural networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1284–1293. Cited by: 3rd item, §1, §2.1, item 1, §5.1.
  • S. Xie, H. Zheng, C. Liu, and L. Lin (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §1.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: item 2.
  • A. Yang, P. M. Esperança, and F. M. Carlucci (2020) NAS evaluation is frustratingly hard. In International Conference on Learning Representations (ICLR), Cited by: Appendix C, §3.2.
  • C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter (2019) NAS-Bench-101: towards reproducible neural architecture search. In International Conference on Machine Learning (ICML), pp. 7105–7114. Cited by: Appendix C, 1st item, §E.2, §E.3, §1, §2.1, item 1, §3.2, §4, §5.1.
  • Z. Zeng, A. K. Tung, J. Wang, J. Feng, and L. Zhou (2009) Comparing stars: on approximating graph edit distance. Proceedings of the VLDB Endowment 2 (1), pp. 25–36. Cited by: item 3.
  • C. Zhang, M. Ren, and R. Urtasun (2019) Graph hypernetworks for neural architecture search. In International Conference on Learning Representations, External Links: Link Cited by: §1, §4.
  • B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §E.2, §5.2.
  • B. Zoph and Q. Le (2017) Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018a) Learning transferable architectures for scalable image recognition. In

    Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 8697–8710. Cited by: §1, §1.
  • B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018b) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §2.1, item 1.

Appendix A Algorithms

The overall algorithm of our NAS-BOWL method is presented in Algorithm 1. The algorithm for general Weisfeiler-Lehman subtree kernel is presented in Algorithm 2 (modified from Shervashidze et al. (2011)). Note that here we assume the weights associated with each WL iteration in equation 3.1 to be for all .

1:  Input: Observation data , number of BO iterations , BO batch size , acquisition function
2:  Output: The best architecture
3:  for  do
4:     Generate a pool of candidate architectures
5:     Select
6:     Evaluate the validation accuracies of
7:     
8:  end for
Algorithm 1 NAS-BOWL Algorithm
1:  Input: Graphs , Maximum WL iterations
2:  Output: The kernel function value between the graphs
3:  Initialise the feature vectors with the respective counts of original node labels i.e. the WL features. (E.g. is the count of -th node label of graph )
4:  for  do
5:     Assign a multiset-label to each node in consisting of the multiset , where is the node label of node of the -th WL iteration; are the neighbour nodes of node
6:     Sort each elements in in ascending order and concatenate them into string
7:     Add as a prefix to .
8:     Compress each string using hash function so that iff for two nodes .
9:     Set .
10:     Concatenate the with the respective counts of the new labels
11:  end for
12:  Compute inner product between the feature vectors in RKHS
Algorithm 2 Weisfeiler-Lehman subtree kernel computation between two graphs

Appendix B Combining Different Kernels

In general, the sum or product of valid kernels gives another valid kernel, as such, combining different kernels to yield a better-performing kernel is commonly used in GP and Multiple Kernel Learning (MKL) literature Rasmussen (2003); Gönen and Alpaydin (2011). In this section, we conduct a preliminary discussion on its usefulness to GPWL. As a singular example, we consider the additive kernel that is a linear combination of the WL kernel and the MLP kernel:

(B.1)

where are the kernel weights. We choose WL and MLP because we expect them to extract diverse information: whereas WL processes the graph node information directly, MLP consider the spectrum of the graph Laplacian matrix, which often reflect the global properties such as the topology and the graph connectivity. We expect the more diverse features captured by the constituent kernels will lead to a more effective additive kernel. While it is possible to determine the weights in a more principled way such as jointly optimising them in the GP log-marginal likelihood, in this example we simply set and . We then perform regression on NAS-Bench-101 and Flower102 datasets following the setup as in Section 5.1. We repeat each experiment 20 times and report the mean and standard deviation in Table LABEL:tab:additive, and we show the uncertainty estimate of additive kernel in Fig. 5. In both search spaces the additive kernel outperforms the constituent kernels but the gain over the WL kernel is marginal. Interestingly, while MLP performs poorly on its own, it can be seen that the complementary spectral information extracted by it can be helpful when used alongside our WL kernel. Generally, we hypothesise that as the search space increases in complexity (e.g., larger graphs, more edge connections permitted, etc), we expect that the benefits from combining different kernels to increase and we defer a more comprehensive discussion on this to a future work.

Kernel N101 Flower-102
WL + MLP 0.8710.02 0.8130.018
WL 0.8620.03 0.8040.018
MLP 0.4580.07 0.4920.12
: Taken directly from Table LABEL:tab:naspredict.
Table 2: Regression performance (i.t.o rank correlation) of additive kernels
(a) N101
(b) Flower102
Figure 5: Predictive vs ground-truth validation error of GPWL with additive kernel on N101 and Flower-102 in log-log scale. Error bar denotes

1 SD from the GP posterior predictive distribution.

Appendix C Different Ways to Generate Candidate Architectures for Acquisition Function Optimisation

(a) Random Sampling
(b) Genetic Mutation
(c) Gradient Guided Mutation
Figure 6: Different ways to generate candidate architecture population for acquisition function optimisation. Random sampling (a) uses no information from queried data or surrogate model. Conventional genetic mutation (b) uses best architectures queried so far to be the parent architectures for generating new architectures. Our proposed gradient guided mutation (c) further exploits the gradient information of posterior model with respect to the interpretable features learnt by WL kernel to guide how to mutate a given parent architecture.

A way to generate a population of candidate architectures for acquisition function optimisation at each BO iteration is necessary for all BO-based NAS strategy. The naive way to do so is to randomly sample architectures from the search space Ying et al. (2019); Yang et al. (2020)(Fig. 5(a)). This is simple to implement but ignores any information contained in past query data or the predictive posterior model, thus inefficient in exploring the huge search space. A more popular approach is based on genetic mutation which generates the candidate architectures by mutating a small pool of parent architectures Kandasamy et al. (2018b); Ma et al. (2019); White et al. (2019); Shi et al. (2019) (Fig. 5(b)). The parent architectures are usually chosen among queried architectures which give the top validation performance or acquisition function values. Generating candidate architecture pool in this way enables us to exploit the prior information on the best architectures observed so far to explore the large search space more efficiently. However, in all prior work Kandasamy et al. (2018b); Ma et al. (2019); White et al. (2019); Shi et al. (2019), no information can be gained on how to mutate the parent architecture. As a result, any node or edge in an parent architecture gets equal chances of undergoing mutation and is randomly changed to any other possible status except its current one with uniform probability666e.g. In NAS-Bench-101, a conv33 node has a uniform probability of becoming a maxpool33 node or conv11 node..

In contrast, we believe that the aforementioned gradient information of GPWL provides a more informed way in guiding the mutation. As discussed in Section 3.2 in the main text, we re-state the analytic expression of the posterior feature derivative:

(C.1)

When a simple dot product is used (i.e. ), the derivative is simply give by:

(C.2)

When optimal assignment Kriege et al. (2016) inner product is used, we have . The operator leads to non-differentiable points. To tackle this we can use a continuous approximation similar to that proposed in Alvi et al. (2019):

(C.3)

where we choose . However, empirically we find the gradients computed via approximation in equation C.3 is mostly consistent with equation C.2 (as we will show, we normalise the gradients into pseudo-probabilities and magnitudes of the gradients do not matter as much), but the latter can be computed much more cheaply.

Given this, we then transform the gradients using a sigmoid transformation on the negative of the gradients to obtain positive values and then normalise them to represent pseudo-probability that encodes the chance of mutation for each node and edge feature in the architecture. The sub-feature (e.g. the node 1 is a conv33 operations in Fig. 7) whose gradients are very positive will have a lower probability of being mutated as we want to keep such good features which positively contribute to validation accuracy. On the other hand, features with large negative gradients will subject to higher probability of undergoing mutation as we want to change these features that contribute negatively to the architecture performance.

With reference to the illustrative example in Fig. 7, the posterior mean gradients with respect to node labels (i.e. features) can be directly obtained but those with respect to edges are not trivial; an edge can be present in multiple features, each of which can have potentially different sign or magnitude 777For example, the edge from node 0 to 2 in Fig. 7 is present in two features: 0, 12 and 0,23. To deal with this, we assign the probability of mutation to an edge based on the mean gradient of all the features in which the edge appears. This smooths out the noisy contribution of a particular edge to the architecture’s validation performance. Upon a node or an edge is chosen for mutation, we reuse the gradient information on its possible change options to define the corresponding probability of change. For example. if node 2 with label maxpool33 in Fig. 7 is chosen to mutate, it’s more likely to change into conv33-bn-relu than conv11-bn-relu. In summary, we propose a procedure to learn which architecture component to modify as well as how to modify it to improve performance by taking the advantage of interpretable features extracted in our WL kernel GP and their corresponding derivatives. We briefly compare the different candidate acquisition strategies discussed in this Section in App. E.5.

Figure 7: An illustration using gradient information on various features to guide the mutation.

Note that an alternative way is to assign a length-scale to each feature embedding produced in WL kernel and then learn these length-scales with automatic relevance determination (ARD) kernels based on GP marginal likelihood to assess the responsiveness of the architecture performance with respect to different features. However, the number of possible features in architecture search space can be large and grows exponentially with the architecture size. Accurate learning of these length-scales can then be very difficult and easily lead to suboptimal values. Moreover, the length-scales only reflect the smoothness/variability objective function values with respect to the features but does not tell the direction of their effects on the architecture performance. Filtering features based on ARD length-scales can lead to removal of important architecture features which positively correlates with the validation performance.

Appendix D Predictive Mean Standard Deviation of GPWL Surrogate on NAS Datasets

In this section, we show the GPWL predictions on the various NAS datasets when trained with 50 samples each. It can be shown that not only a satisfactory predictive mean is produced by GPWL in terms of the rank correlation and the agreement with the ground truth, there is also sound uncertainty estimates, as we can see that in most cases the ground truths are within the error bar representing one standard deviation of the GP predictive distributions. For the training of GPWL, we always transform the validation errors (the targets of the regression) into log-scale, normalise the data and transform it back at prediction, as empirically we find this leads to better uncertainty estimates.

(a) N101
(b) CIFAR10
(c) CIFAR100
(d) ImageNet16
(e) Flower102
Figure 8: Predicted vs ground-truth validation error of GPWL in various NAS-Bench tasks in log-log scale. Error bar denotes 1 SD from the GP posterior predictive distribution.

Appendix E Experimental Details

All experiments were conducted on a 36-core 2.3GHz Intel Xeon processor with 512 GB RAM.

e.1 Datasets

We experiment on the following datasets:

  • NAS-Bench-101 Ying et al. (2019): The search space is an acyclic directed graph with nodes and a maximum of edges. Besides the input node and output node, the remaining operation nodes can choose one of the three possible operations: conv33-bn-relu, conv11-bn-relu and maxpool33-bn-relu

    . The dataset contains all 423,624 unique neural architectures in the search space. Each architecture is trained for 108 epochs and evaluated on CIFAR10 image data. The evaluation is repeated over 3 random initialisation seeds. We can access the final training/validation/test accuracy, the number of parameters as well as training time of each architecture from the dataset. The dataset and its API can be downloaded from

    https://github.com/google-research/nasbench/.

  • NAS-Bench-201 Dong and Yang (2020): The search space is an acyclic directed graph with nodes and edges. Each edge corresponds to an operation selected from the set of 5 possible options: conv11, conv33, avgpool33, skip-connect and zeroize. This search space is applicable to almost all up-to-date NAS algorithms. Note although the search space of NAS-Bench-201 is more general, it’s smaller than that of NAS-Bench-101. The dataset contains all 15,625 unique neural architectures in the search space. Each architecture is trained for 200 epochs and evaluated on 3 image datasets: CIFAR10, CIFAR100, ImageNet16-120. The evaluation is repeated over 3 random initialisation seeds. We can access the training accuracy/loss, validation accuracy/loss after every training epoch, the final test accuracy/loss, number of parameters as well as FLOPs from the dataset. The dataset and its API can be downloaded from https://github.com/D-X-Y/NAS-Bench-201.

  • Flowers102: We generate this dataset based on the random graph generators proposed in Xie et al. (2019). The search space is an acyclic directed graph with nodes and a varying number of edges. All the nodes can take one of the three possible options: input, output, relu-conv33-bn. Thus, the graph can have multiple inputs and outputs. This search space is very different from those of NAS-Bench-101 and NAS-Bench-201 and is used to test the scalability of our surrogate model for a large-scale search space (i.t.o number of numbers in the graph). The edges/wiring/connection in the graph is created by one of the three classic random graph models: Erdos-Renyi(ER), Barabasi-Albert(BA) and Watt-Strogatz(WS). Different random graph models result in graphs of different topological structures and connectivity patterns and are defined by one or two hyperparameters. We investigate a total of 69 different sets of hyperparameters: 8 values for the hyperparameter of ER model, 6 values for the hyperparameter of BA model and 55 different value combinations for the two hyperparameters of WS model. For each hyperparameter set, we generate 8 different architectures using the random graph model and train each architecture for 250 epochs before evaluating on Flowers102 dataset. The training set-ups follow Liu et al. (2019). This results in our dataset of 552 randomly wired neural architectures.

e.2 Experimental Setup

Nas-Bowl

We use a batch size (i.e., at each BO iteration, architectures yielding top 5 acquisition function values are selected to be evaluated in parallel). When mutation algorithm described in Section 3.2 is used, we use a pool size of , and half of which is generated from mutating the top-10 best performing architectures already queried and the other half is generated from random sampling to encourage more explorations in NAS-Bench-101. In NAS-Bench-201, accounting for the much smaller search space and consequently the lesser need to exploration, we simply generate all architectures from mutation. For experiments with random acquisition, we also use throughout, and we also study the effect of varying later in this section. We use WL with optimal assignment (OA) for all datasets apart from NAS-Bench-201. Denoting the feature vectors of two graphs and as and respectively, the OA inner product in the WL case is given by the histogram intersection , where is the -th element of the vector. On NAS-Bench-201 which features a much smaller search space which we find a simple dot product of the feature vectors to perform empirically better. We always use 10 random samples to initialise BOWL.

On NAS-Bench-101 dataset, we always apply pruning (which is available in the NAS-Bench-101 API) to remove the invalid nodes and edges from the graphs. On NAS-Bench-201 dataset, since the architectures are defined over a DARTS-like, edge-labelled search space, we first convert the edge-labelled graphs to node-labelled graphs as a pre-processing step. It is worth noting that it is possible to use WL kernel defined over edge-labelled graphs directly (e.g the WL-edge kernel proposed by Shervashidze et al. (2011)), although in this paper we find the WL kernels over node-labelled graphs to perform empirically better.

Bananas

We use the code made public by the authors White et al. (2019) (https://github.com/naszilla/bananas), and use the default settings contained in the code with the exception of the number of architectures queried at each BO iteration (i.e. BO batch size): the default is , but to conform to our test settings we use instead. While we do not change the default pool size of at each BO iteration, instead of filling the pool entirely from mutation of the best architectures, we only mutate architectures from top-10 best architectures and generate the other randomly to enable a fair comparison with our method. It is worth noting that neither changes led to a significant deterioration in the performance of BANANAS: under the deterministic validation error setup, the results we report are largely consistent with the results reported in White et al. (2019); under the stochastic validation error setup, our BANANAS results actually slightly outperform results in the original paper. It is finally worth noting that the public implementation of BANANAS on NAS-Bench-201 was not released by the authors.

GCNBO for NAS

We implemented the GNN surrogate in Section 5.1 by ourselves following the description in the most recent work Shi et al. (2019)

, which uses a graph convolution neural network in combination with a Bayesian linear regression layer to predict architecture performance in its BO-based NAS

888Shi et al. (2019) did not publicly release their code.. To ensure fair comparison with our NAS-BOWL, we then define a normal Expected Improvement (EI) acquisition function based on the predictive distribution by the GNN surrogate to obtain another BO-based NAS baseline in Section 5.2, GCNBO. Similar to all the other baselines including our NASBOWLr and BANANASr, we use random sampling to generate candidate architectures for acquisition function optimisation. However, different from NAS-BOWL and BANANAS, GCNBO uses a batch size , i.e. at each BO iteration, NAS-BOWL and BANANAS select 5 new architectures to evaluate next but GCNBO select 1 new architecture to evaluate next. This setup should favour GCNBO if we measure the optimisation performance against the number of architecture evaluations which is the metric used in Figs. 4 and 9 because at each BO iteration, GCNBO selects the next architecture based on the most up-to-date information whereas NAS-BOWL and BANANAS only select one architecture in such fully informed way but select the other four architectures with outdated information. Specifically, in the sequential case (), is selected only after we have evaluated , is selected by maximising ; the same procedure applies for , and . However, in the batch case () where for need to be selected before is evaluated, are all decided based on like . For a more detailed discussion on sequential () and batch () BO, the readers are referred to Alvi et al. (2019).

Other Baselines

For all the other baselines: random search Bergstra and Bengio (2012), TPE Bergstra et al. (2011), Reinforcement Learning Zoph and Le (2016), BO with SMAC Hutter et al. (2011), regularised evolution Real et al. (2019), we follow the implementation available at https://github.com/automl/nas_benchmarks for NAS-Bench-101 Ying et al. (2019). We modify them to be applicable on NAS-Bench-201 Dong and Yang (2020). Note that like GCNBO, all these methods are sequential , and thus should enjoy the same advantage mentioned above when measured against the number of architectures evaluated.

e.3 Additional NAS-Bench Results

Test Errors Against Number of Evaluations

We show the test errors against number of evaluations using both stochastic and deterministic validation errors of NAS-Bench datasets in Figure 9. It is worth noting that regardless of whether the validation errors are stochastic or not, the test errors are always averaged to deterministic values for fair comparison. It is obvious that NAS-BOWL still outperforms the other methods under this metric in achieving lower test error or enjoying faster convergence, or having both under most circumstances. This corresponds well with the results on the validation error in Fig. 4 and double-confirms the superior performance of our proposed NAS-BOWL in searching optimal architectures.

(a) N101
(b) CIFAR10
(c) CIFAR100
(d) ImageNet16
(e) N101
(f) CIFAR10
(g) CIFAR100
(h) ImageNet16
Figure 9: Median test error on NAS-Bench datasets with deterministic (top row) and stochastic (bottom row) validation errors from 20 trials. Shades denote standard error.

Results Against GPU-Hours

(a) N101
(b) CIFAR10
(c) CIFAR100
(d) ImageNet16
(e) N101
(f) CIFAR10
(g) CIFAR100
(h) ImageNet16
Figure 10: Median validation error on NAS-Bench datasets with deterministic (top row) and stochastic (bottom row) validation errors from 20 trials against the number of GPU-hours. Shades denote standard error.
(a) N101
(b) CIFAR10
(c) CIFAR100
(d) ImageNet16
(e) N101
(f) CIFAR10
(g) CIFAR100
(h) ImageNet16
Figure 11: Median test error on NAS-Bench datasets with deterministic (top row) and stochastic (bottom row) validation errors from 20 trials against the number of GPU-hours. Shades denote standard error.

In Figs. 10 and 11, we show the validation and test errors against number of GPU-Hours used to train the architectures (instead of number of architectures evaluated in Figs. 4 and 9), respectively, and it can be seen that NAS-BOWL outperforms the baselines in terms of GPU-time as well. It is worth noting that the training time is available from both of the NAS-Bench datasets in standardised setting described in Ying et al. (2019); Dong and Yang (2020); we did not actually train these models. Finally, whereas for the sequential models the GPU-time is equivalent to the wall-clock time, since our method features batch BO the wall-clock time can be dramatically reduced from the GPU-time described here by taking advantage of any available parallel computing facility (e.g., if and we have 5 GPUs available in parallel, the wall clock time is roughly of the GPU-time).

e.4 Effect of Varying Pool Size

As discussed in the main text, NAS-BOWL introduces no inherent hyperparameters that require manual tuning, but as discussed in App. C, the choice on how to generate the candidate architectures requires us to specify a number of parameters such as the pool size (, the number of candidate architectures to generate at each BO iteration) and batch size . In our main experiments, we have set and throughout; in this section, we consider the effect of varying to investigate whether the performance of NAS-BOWL is sensitive to this parameter.

We keep but adjust , and keep all other settings to be consistent with the other experiments using the deterministic validation errors on NAS-Bench-101 (N101) (i.e. averaging the validation error seeds to remove stochasticity), and we report our results in Fig. 12 where the median result is computed from 20 experiment repeats. It can be shown that while the convergence speed varies slightly between the different choices, for all choices of apart from 50 which performs slightly worse, NAS-BOWL converges to similar validation and test errors at the end of 150 architecture evaluations – this suggests that the performance of NAS-BOWL is rather robust to the value of and that our recommendation of does perform well both in terms of both the final solution returned and the convergence speed.

(a) Validation Error
(b) Test Error
Figure 12: Effect of varying on NAS-BOWL in N101.

e.5 Ablation Studies

In this section we perform ablation studies on the NAS-BOWL performance on both N101 and N201 (with deterministic validation errors). We repeat each experiment 20 times, and we present the median and standard error in terms of both validation and test performances in Fig. 13 (N101 in (a)(b) and N201 in (c)(d)). We now explain each legend as follow:

  1. gradmutate: Full NAS-BOWL using the gradient-guided architecture mutation described in Section 3.2 (identical to NASBOWLm in Figs. 4 and 9);

  2. mutate: NAS-BOWL with the standard mutation algorithm. Specifically, we use identical setup to the gradient-guided mutation scheme, with the only exception that the probabilities of mutation of all nodes and edges are uniform;

  3. WL: NAS-BOWL with random candidate generation. This is identical to NASBOWLr in Figs. 4 and 9;

  4. UCB: NAS-BOWL with random candidate generation, but with the acquisition function changed from Expected Improvement (EI) to Upper Confidence Bound (UCB) Srinivas et al. (2009) , where are the predictive mean and standard deviation of the GPWL surrogate, respectively and is a coefficient that changes as a function of , the number of BO iterations. We select at initialisation () to , but decay it according to as suggested by Srinivas et al. (2009), where is the number of BO iterations.

  5. VH: NAS-BOWL with random candidate generation, but instead of leaving the value of (number of WL iterations) to be automatically determined by the optimisation of the GP log marginal likelihood, we set , i.e. no WL iteration takes place and the only features we use are the counts of each type of original node operation features (e.g. conv33-bn-relu). This essentially reduces the WL kernel to a Vertex Histogram (VH) kernel.

(a) Val. Error on N101
(b) Test Error on N101
(c) Val. Error on N201
(d) Test Error on N201
Figure 13: Ablation studies of NAS-BOWL

We find that using an appropriate is crucial: in both N101 and N201, VH significantly underperforms the other variants, although the extent of underperformance is smaller in N201 likely due to its smaller search space. This suggests that how the nodes are connected, which are extracted as higher-order WL features, are very important, and the multi-scale feature extraction in the WL kernel is crucial to the success of NAS-BOWL. On the other hand, the choice of the acquisition function seems not to matter as much, as there is little difference between UCB and WL runs in both N101 and N201. Finally, using either mutation algorithm leads to a significant improvement in the performance of NAS-BOWL; between gradmutate and mutate, while there is little difference in terms of final performance, in both cases gradmutate does converge faster at initial phase. We note that both datasets still feature rather small search spaces where the gain in guided search can be limited, as it is possible that random mutation might already be sufficient to find high-performing regions in the search space. We expect the potential gain in performance of gradmutate to be larger for more complex NAS search spaces, which we defer to a future work.