1 Introduction
Many realworld problems can be posed as optimizing of a directed acyclic graph (DAG) representing some computational task. For example, the architecture of a neural network is a DAG. The problem of searching optimal neural architectures is essentially a DAG optimization task. Similarly, one critical problem in learning graphical models – optimizing the connection structures of Bayesian networks [1], is also a DAG optimization task. DAG optimization is pervasive in other fields as well. In electronic circuit design, engineers need to optimize DAG circuit blocks not only to realize target functions, but also to meet specifications such as power usage and operating temperature.
DAG optimization is a hard problem. Firstly, the evaluation of a DAG’s performance is often timeconsuming (e.g., training a neural network). Secondly, stateoftheart blackbox optimization techniques such as simulated annealing and Bayesian optimization primarily operate in a continuous space, thus are not directly applicable to DAG optimization due to the discrete nature of DAGs. In particular, to make Bayesian optimization work for discrete structures, we need a kernel to measure the similarity between discrete structures as well as a method to explore the design space and extrapolate to new points. Principled solutions to these problems are still lacking.
Is there a way to circumvent the trouble from discreteness? The answer is yes. If we can embed all DAGs to a continuous space and make the space relatively smooth, we might be able to directly use principled blackbox optimization algorithms to optimize DAGs in this space, or even use gradient methods if gradients are available. Recently, there has been increased interest in training generative models for discrete data types such as molecules [2, 3], arithmetic expressions [4], source code [5], undirected graphs [6], etc. In particular, Kusner et al. [3] developed a grammar variational autoencoder (GVAE) for molecules, which is able to encode and decode molecules into and from a continuous latent space, allowing one to optimize molecule properties by searching in this wellbehaved space instead of a discrete space. Inspired by this work, we propose to also train a variational autoencoder for DAGs, and optimize DAG structures in the latent space via Bayesian optimization.
To encode DAGs, we leverage graph neural networks (GNNs) [7]. Traditionally, a GNN treats all nodes symmetrically, and extracts local features around nodes by simultaneously passing all nodes’ neighbors’ messages to themselves. However, such a simultaneous message passing scheme is designed to learn local structure features. It might not be suitable for DAGs, since in a DAG: 1) nodes are not symmetric, but intrinsically have some ordering based on its dependency structure; and 2) we are more concerned about the computation represented by the entire graph, not the local structures.
In this paper, we propose an asynchronous message passing scheme to encode the computations on DAGs. The message passing no longer happens at all nodes simultaneously, but respects the computation dependencies (the partial order) among the nodes. For example, suppose node A has two predecessors, B and C, in a DAG. Our scheme does not perform feature learning for A until the feature learning on B and C are both finished. Then, the aggregated message from B and C is passed to A to trigger A’s feature learning. We incorporate this feature learning scheme in both our encoder and decoder, and propose the DAG variational autoencoder (DVAE). DVAE has an excellent theoretical property for modeling DAGs – we prove that DVAE can injectively encode computations on DAGs. This means, we can build a mapping from the discrete space to a continuous latent space so that every DAG computation has its unique embedding in the latent space, which justifies performing optimization in the latent space instead of the original design space.
Our contributions in this paper are: 1) We propose DVAE, a variational autoencoder for DAGs using a novel asynchronous message passing scheme, which is able to injectively encode computations. 2) Based on DVAE, we propose a new DAG optimization framework which performs Bayesian optimization in a continuous latent space. 3) We apply DVAE to two problems, neural architecture search and Bayesian network structure learning. Experiments show that DVAE not only generates novel and valid DAGs, but also learns smooth latent spaces effective for optimizing DAG structures.
2 Related work
Variational autoencoder (VAE) [8, 9] provides a framework to learn both a probabilistic generative model (the decoder) as well as an approximated posterior distribution (the encoder). VAE is trained through maximizing the evidence lower bound
(1) 
The posterior approximation and the generative model can in principle take arbitrary parametric forms whose parameters and are output by the encoder and decoder networks. After learning
, we can generate new data by decoding latent space vectors
sampled from the prior . For generating discrete data, is often decomposed into a series of decision steps.Deep graph generative models use neural networks to learn distributions over graphs. There are mainly three types: tokenbased, adjacencymatrixbased, and graphbased. Tokenbased models [2, 3, 10] represent a graph as a sequence of tokens (e.g., characters, grammar rules) and model these sequences using RNNs. They are less general since taskspecific graph grammars such as SMILES for molecules [11] are required. Adjacencymatrixbased models [12, 13, 14, 15, 16] leverage the proxy adjacency matrix representation of a graph, and generate the matrix in one shot or generate the columns/entries sequentially. In contrast, graphbased models [6, 17, 18, 19] seem more natural, since they operate directly on graph structures (instead of proxy matrix representations) by iteratively adding new nodes/edges to a graph based on the the existing graph and node states. In addition, the graph and node states are learned by graph neural networks (GNNs), which have already shown their powerful graph representation learning ability on various tasks [20, 21, 22, 23, 24, 25, 26].
Neural architecture search (NAS) aims at automating the design of neural network architectures. It has seen major advances in recent years [27, 28, 29, 30, 31, 32]. See Hutter et al. [33]
for an overview. NAS methods can be mainly categorized into: 1) reinforcement learning methods
[27, 30, 32] which train controllers to generate architectures with high rewards in terms of validation accuracy, 2) Bayesian optimization based methods [34]which define kernels to measure architecture similarity and extrapolate the architecture space heuristically, 3) evolutionary approaches
[28, 35, 36]which use evolutionary algorithms to optimize neural architectures, and 4) differentiable methods
[31, 37, 38] which use continuous relaxation/mapping of neural architectures to enable gradientbased optimization. In Appendix A, we include more discussion on several most related works.Bayesian network structure learning (BNSL) is to learn the structure of the underlying Bayesian network from observed data [39, 40, 41, 42]. Bayesian network is a probabilistic graphical model which represents conditional dependencies among variables via a DAG [1]. One main approach for BNSL is scorebased search, i.e., we define some “goodnessoffit” score for network structures, and search for one with the optimal score in the discrete design space. Commonly used scores include BIC and BDeu, mostly based on marginal likelihood [1]. Due to the NPhardness [43], however, exact algorithms such as dynamic programming [44] or shortest path approaches [45, 46] can only solve smallscale problems. Thus, people have to resort to heuristic methods such as local search and simulated annealing, etc. [47]. In general, BNSL is still a hard problem with much research ongoing.
3 DAG variational autoencoder (DVAE)
In this section, we describe our proposed DAG variational autoencoder (DVAE). DVAE uses an asynchronous message passing scheme to encode and decode DAGs. In contrast to the simultaneous message passing in traditional GNNs, DVAE allows encoding computations rather than structures.
Definition 1.
(Computation) Given a set of elementary operations , a computation is the composition of a finite number of operations applied to an input signal , with the output of each operation being the input to its succeeding operations.
The set of elementary operations depends on specific applications. For example, when we are interested in computations given by a calculator, will be the set of all the operations defined on the functional buttons, such as , , , , etc. When modeling neural networks, can be a predefined set of basic layers, such as 33 convolution, 55 convolution, 2
2 max pooling, etc. A computation can be represented as a directed acyclic graph (DAG), with directed edges representing signal flow directions among node operations. The graph must be acyclic, since otherwise the input signal will go through an infinite number of operations so that the computation never stops. Figure
1 shows two examples. Note that the two DAGs in Figure 1 represent the same computation, as the input signal goes through exactly the same operations.3.1 Encoding
We first introduce DVAE’s encoder. The DVAE encoder can be seen as a graph neural network (GNN) using an asynchronous message passing scheme. Given a DAG, we assume there is a single starting node which does not have any predecessors (e.g., the input layer of a neural architecture). If there are multiple such nodes, we add a virtual starting node connecting to all of them.
Similar to standard GNNs, we use an update function to compute the hidden state of each node based on its neighbors’ incoming message. The hidden state of node is given by:
(2) 
where
is the onehot encoding of
’s type, and represents the incoming message to . is given by aggregating the hidden states of ’s predecessors using an aggregation function :(3) 
where denotes there is a directed edge from to , and represents a multiset of ’s predecessors’ hidden states. If an empty set is input to (corresponding to the case for the starting node without any predecessors), we let output an allzero vector.
Compared to the traditional simultaneous message passing, in DVAE the message passing for a node must wait until all of its predecessors’ hidden states have already been computed. This simulates how a computation is really performed – to execute some operation, we also need to wait until all its input signals are ready. To make sure the required hidden states are available when a new node comes, we can perform message passing for nodes sequentially following a topological ordering of the DAG.
In Figure 2, we use a real neural architecture to illustrate the encoding process. After all nodes’ hidden states are computed, we use , the hidden state of the ending node without any successors, as the output of the encoder. Then we feed
to two multilayer perceptrons (MLPs) to get the mean and variance parameters of
in (1) which is a normal distribution in our experiments. If there are multiple nodes without successors, we again add a virtual ending node connecting from all of them.
Note that although topological orderings are usually not unique for a DAG, we can take any one of them as the message passing order while ensuring the encoder output is always the same, formalized by the following theorem. We include all theorem proofs in the appendix.
Theorem 1.
The DVAE encoder is invariant to node permutations of the input DAG if the aggregation function is invariant to the order of its inputs (e.g., summing, averaging, etc.).
Theorem 1 means isomorphic DAGs will have the same encoding result, no matter how we reorder/reindex the nodes. It also indicates that so long as we encode a DAG complying with its partial order, the real message passing order and node order do not influence the encoding result.
The next theorem shows another property of DVAE that is crucial for its success in modeling DAGs, i.e., it is able to injectively encode computations on DAGs.
Theorem 2.
Let be any DAG representing some computation . Let be its nodes following a topological order each representing some operation , where is the ending node. Then, the encoder of DVAE maps to injectively if is injective and is injective.
The significance of Theorem 2 is that it provides a way to injectively encode computations on DAGs, so that every computation has a unique embedding in the latent space. Therefore, instead of performing optimization in the original discrete space, we may equivalently perform optimization in the continuous latent space. In this wellbehaved Euclidean space, distance is well defined, and principled Bayesian optimization can be applied to search for latent points with high performance scores, which transforms the discrete optimization problem into an easier continuous problem.
Note that Theorem 2 states DVAE injectively encodes computations on graph structures, rather than graph structures themselves. Being able to injectively encode graph structures is a very strong condition, as it might provide an efficient algorithm to solve the challenging graph isomorphism (GI) problem. Luckily, here what we really want to injectively encode are computations instead of structures, since we do not need to differentiate two different structures and as long as they represent the same computation. Figure 1 shows such an example. Our DVAE can identify that the two DAGs in Figure 1 actually represent the same computation by encoding them to the same vector, while those encoders focusing on encoding structures might fail to capture the underlying computation and output different vectors. In Appendix G, we discuss more advantages of Theorem 2 for optimizing DAGs in the latent space.
To model and learn the injective functions and , we resort to neural networks thanks to the universal approximation theorem [48]. For example, we can let be a gated sum:
(4) 
where is a mapping network and is a gating network. Such a gated sum can model injective multiset functions [49], and is invariant to input order. To model the injective update function
, we can use a gated recurrent unit (GRU)
[50], with treated as the input hidden state:(5) 
Here the subscript denotes “encoding”. Using a GRU also allows reducing our framework to traditional sequence to sequence modeling frameworks [51], as discussed in 3.3.
The above aggregation and update functions can be used to encode general computation graphs. For neural architectures, depending on how the outputs of multiple previous layers are aggregated as the input to a next layer, we will make a modification to (4), which is discussed in Appendix E. For Bayesian networks, we also make some modifications to their encoding due to the special dseparation properties of Bayesian networks, which is discussed in Appendix F.
3.2 Decoding
We now describe how DVAE decodes latent vectors to DAGs (the generative part). The DVAE decoder uses the same asynchronous message passing scheme as in the encoder to learn intermediate node and graph states. Similar to (5), the decoder uses another GRU, denoted by , to update node hidden states during the generation. Given the latent vector to decode, we first use an MLP to map to as the initial hidden state to be fed to . Then, the decoder constructs a DAG node by node. For the generated node , the following steps are performed:

[leftmargin=*,itemsep=1pt,topsep=1pt]

Compute ’s type distribution using an MLP (followed by a softmax) based on the current graph state .

Sample ’s type. If the sampled type is the ending type, stop the decoding, connect all loose ends (nodes without successors) to , and output the DAG; otherwise, continue the generation.

Update ’s hidden state by , where if ; otherwise, is the aggregated message from its predecessors’ hidden states given by equation (4).

For
: (a) compute the edge probability of
using an MLP based on and ; (b) sample the edge; and (c) if a new edge is added, update using step 3.
The above steps are iteratively applied to each new generated node, until step 2 samples the ending type. For every new node, we first predict its node type based on the current graph state, and then sequentially predict whether each existing node has a directed edge to it based on the existing and current nodes’ hidden states. Figure 3 illustrates this process. Since edges always point to new nodes, the generated graph is guaranteed to be acyclic. Note that we maintain hidden states for both the current node and existing nodes, and keep updating them during the generation. For example, whenever step 4 samples a new edge between and , we will update to reflect the change of its predecessors and thus the change of the computation so far. Then, we will use the new for the next prediction. Such a dynamic updating scheme is flexible, computationaware, and always uses the uptodate state of each node to predict next steps. In contrast, methods based on RNNs [3, 13] do not maintain states for old nodes, and only use the current RNN state to predict the next step.
In step 4, when sequentially predicting incoming edges from previous nodes, we choose the reversed order instead of or any other order. This is based on the prior knowledge that a new node is more likely to firstly connect from the node immediately before it. For example, in neural architecture design, when adding a new layer, we often first connect it from the last added layer, and then decide whether there should be skip connections from other previous layers. Note that however, such an order is not fixed and can be flexible according to specific applications.
3.3 Model extensions and discussion
Relation with RNNs. The DVAE encoder and decoder can be reduced to ordinary RNNs when the input DAGs are reduced to linked lists. Although we propose DVAE from a GNN’s perspective, our model can also be seen as a generalization of traditional sequence modeling frameworks [51, 52] where a timestamp depends only on the timestamp immediately before it, to the DAG case where a timestamp has multiple previous dependencies. As special DAGs, similar ideas have been explored for trees [53, 17], where a node can have multiple incoming edges yet only one outgoing edge.
Bidirectional encoding. DVAE’s encoding process can be seen as simulating how an input signal goes through a DAG, with simulating the output signal at each node . This is also known as forward propagation in neural networks. Inspired by the bidirectional RNN [54], we can also use another GRU to reversely encode a DAG (i.e., reverse all edge directions and encode the DAG again), thus simulating the backward propagation too. After reverse encoding, we get two ending states, which are concatenated and linearly mapped to their original size as the final output state. We find this bidirectional encoding can increase the performance and convergence speed on neural architectures.
Incorporating vertex semantics. Note that DVAE currently uses onehot encoding of node types as , which does not consider the semantic meanings of different node types. For example, a convolution layer might be functionally very similar to a convolution layer, while being functionally distinct from a max pooling layer. We expect incorporating such semantic meanings of node types to be able to further improve DVAE’s performance. For example, we can use pretrained embeddings of node types to replace the onehot encoding. We leave it for future work.
4 Experiments
We validate the proposed DAG variational autoencoder (DVAE) on two DAG optimization tasks:

[leftmargin=*,itemsep=1pt,topsep=1pt]

Neural architecture search. Our neural network dataset contains 19,020 neural architectures from the ENAS software [32]. Each neural architecture has 6 layers (excluding input and output layers) sampled from: and convolutions, and depthwiseseparable convolutions [55], max pooling, and average pooling. We evaluate each neural architecture’s weightsharing accuracy [32] (a proxy of the true accuracy) on CIFAR10 [56] as its performance measure. We split the dataset into 90% training and 10% heldout test sets. We use the training set for VAE training, and use the test set only for evaluation. More details are in Appendix H.

Bayesian network structure learning. Our Bayesian network dataset contains 200,000 random 8node Bayesian networks from the bnlearn package [57] in R. For each network, we compute the Bayesian Information Criterion (BIC) score to measure the performance of the network structure for fitting the Asia dataset [58]. We split the Bayesian networks into 90% training and 10% test sets. For more details, please refer to Appendix I.
Following [3], we do four experiments for each task:

[leftmargin=*,itemsep=1pt,topsep=1pt]

Basic abilities of VAE models. In this experiment, we perform standard tests to evaluate the reconstructive and generative abilities of a VAE model for DAGs, including reconstruction accuracy, prior validity, uniqueness and novelty. We move the results of this part to Appendix M.1.

Predictive performance of latent representation. We test how well we can use the latent embeddings of neural architectures and Bayesian networks to predict their performances.

Bayesian optimization. This is the motivating application of DVAE. We test how well the learned latent space can be used for searching for highperformance DAGs through Bayesian optimization.

Latent space visualization. We visualize the latent space to qualitatively evaluate its smoothness.
Since there is little previous work on DAG generation, we compare DVAE with three generative baselines adapted for DAGs: SVAE, GraphRNN and GCN. Among them, SVAE [52] and GraphRNN [13] are adjacencymatrixbased methods, and GCN [22] uses simultaneous message passing to encode DAGs. We include more details about these baselines and discuss DVAE’s advantages over them in Appendix J. The training details are in Appendix K. All the code and data will be made publicly available.
4.1 Predictive performance of latent representation.
In this experiment, we evaluate how well the learned latent embeddings can predict the corresponding DAGs’ performances, which tests a VAE’s unsupervised representation learning ability. Being able to accurately predict a latent point’s performance also makes it much easier to search for highperformance points in this latent space. Thus, the experiment is also an indirect way to evaluate a VAE latent space’s suitability for DAG optimization. Following [3], we train a sparse Gaussian Process (SGP) regression model [59] with 500 inducing points on the training data’s embeddings to predict the unseen test data’s performances. We include the SGP training details in Appendix L.
We use two metrics to evaluate the predictive performance of the latent embeddings (given by the mean of the posterior approximations). One is the RMSE between the SGP predictions and the true performances. The other is the Pearson correlation coefficient (or Pearson’s ), measuring how well the prediction and real performance tend to go up and down together. A small RMSE and a large Pearson’s indicate a better predictive performance. Table 1
shows the results. All the experiments are repeated 10 times and the means and standard deviations are reported.
Neural architectures  Bayesian networks  

Methods  RMSE  Pearson’s  RMSE  Pearson’s 
DVAE  0.3840.002  0.9200.001  0.3000.004  0.9590.001 
SVAE  0.4780.002  0.8730.001  0.3690.003  0.9330.001 
GraphRNN  0.7260.002  0.6690.001  0.7740.007  0.6410.002 
GCN  0.8320.001  0.5270.001  0.4210.004  0.9140.001 
From Table 1, we find that both the RMSE and Pearson’s of DVAE are significantly better than those of the other models. A possible explanation is DVAE encodes the computation, which is directly related to a DAG’s performance. SVAE follows closely by achieving the second best performance. GraphRNN and GCN have less satisfying performances in this experiment. The better predictive power of DVAE’s latent space means performing Bayesian optimization in it may be more likely to find highperformance points.
Great circle interpolation starting from a point and returning to itself. Upper: DVAE. Lower: SVAE.
4.2 Bayesian optimization
We perform Bayesian optimization using the two best models, DVAE and SVAE, validated by previous experiments. Based on the SGP model from the last experiment, we perform 10 iterations of batch Bayesian optimization, and average results across 10 trials. A batch size of 50 and the expected improvement (EI) heuristic [60] are used, following Kusner et al. [3]. Concretely speaking, we start from the training data’s embeddings, and iteratively propose new points from the latent space that maximize the EI acquisition function. For each batch of selected points, we evaluate their decoded DAGs’ real performances and add them back to the SGP to select the next batch. Finally, we check the bestperforming DAGs found by each model to evaluate its DAG optimization performance.
Neural architectures. For neural architectures, we select the top 15 found architectures in terms of their weightsharing accuracies, and fully train them on CIFAR10’s train set to evaluate their true test accuracies. More details can be found in Appendix H. We show the 5 architectures with the highest true test accuracies in Figure 4. As we can see, DVAE in general found much better neural architectures than SVAE. Among the selected architectures, DVAE achieved a highest accuracy of 94.80%, while SVAE’s highest accuracy was only 92.79%. In addition, all the 5 architectures of DVAE have accuracies higher than 94%, indicating that DVAE’s latent space can stably find many highperformance architectures. Although not outperforming stateoftheart NAS techniques such as NAONet [38] (2.11% error rate on CIFAR10), our search space was much smaller, and we did not apply any data augmentation techniques nor did we copy multiple folds or add more filters after finding the architecture. We emphasize that in this paper, we mainly focus on idea illustration rather than record breaking, since achieving stateoftheart NAS results typically requires enormous computation resources beyond our capability. Nevertheless, DVAE does provide a promising new direction for neural architecture search based on graph generation, alternative to existing approaches.
Bayesian networks. We similarly report the top 5 Bayesian networks found by each model ranked by their BIC scores in Figure 5. DVAE generally found better Bayesian networks than SVAE. The best Bayesian network found by DVAE achieved a BIC of 11125.75, which is better than the best network in the training set with a BIC of 11141.89 (a higher BIC score is better). Considering BIC is in log scale, the probability of our found network to explain the data is actually 1E7 times larger than that of the best training network. For reference, the true Bayesian network used to generate the Asia data has a BIC of 11109.74. Although we did not exactly find the true network, our found network is close to it and outperforms all training data. Our experiments show that searching in an embedding space is a promising direction for Bayesian network structure learning.
4.3 Latent space visualization
In this experiment, we visualize the latent spaces of the VAE models to get a sense of their smoothness.
For neural architectures, we visualize the decoded architectures from points along a great circle in the latent space. We start from the latent embedding of a straight network without skip connections. Imagine this point as a point on the surface of a sphere (visualize the earth). We randomly pick a great circle starting from this point and returning to itself around the sphere. Along this circle, we evenly pick 35 points and visualize their decoded nets in Figure 6. As we can see, both DVAE and SVAE show relatively smooth interpolations by changing only a few node types or edges each time. Visually speaking, SVAE’s structural changes are even more smooth. This is because SVAE treats DAGs purely as strings, thus tending to embed DAGs with few differences in string representations to similar regions of the latent space without considering their computational differences (see Appendix J). In contrast, DVAE models computations, and focuses more on the smoothness w.r.t. computation rather than structure.
For Bayesian networks, we aim to directly visualize the BIC score distribution of the latent space. To do so, we reduce its dimensionality by choosing a 2D subspace of the latent space spanned by the first two principal components of the training data’s embeddings. In this lowdimensional subspace, we compute the BIC scores of all the points evenly spaced within a grid and visualize the scores using a colormap in Figure 7. As we can see, DVAE seems to better differentiate highscore points from lowscore ones and shows more smoothly changing of BIC scores, while SVAE shows sharp boundaries and seems to mix highscore and lowscore points more severely. We suspect this helps Bayesian optimization find highperformance Bayesian networks more easily in DVAE.
5 Conclusion
In this paper, we have proposed DVAE, a deep generative model for DAGs. DVAE uses a novel asynchronous message passing scheme to explicitly model computations on DAGs. By performing Bayesian optimization in DVAE’s latent spaces, we offer promising new directions to two important problems, neural architecture search and Bayesian network structure learning. We hope DVAE can inspire more research on extending graph generative models’ applications on structure optimization.
References
 Koller and Friedman [2009] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 GómezBombarelli et al. [2018] Rafael GómezBombarelli, Jennifer N Wei, David Duvenaud, José Miguel HernándezLobato, Benjamín SánchezLengeling, Dennis Sheberla, Jorge AguileraIparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
 Kusner et al. [2017] Matt J Kusner, Brooks Paige, and José Miguel HernándezLobato. Grammar variational autoencoder. In International Conference on Machine Learning, pages 1945–1954, 2017.
 Kusner and HernándezLobato [2016] Matt J Kusner and José Miguel HernándezLobato. Gans for sequences of discrete elements with the gumbelsoftmax distribution. arXiv preprint arXiv:1611.04051, 2016.
 Gaunt et al. [2016] Alexander L Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow. Terpret: A probabilistic programming language for program induction. arXiv preprint arXiv:1608.04428, 2016.
 Li et al. [2018] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324, 2018.
 Wu et al. [2019] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Dai et al. [2018] Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntaxdirected variational autoencoder for structured data. arXiv preprint arXiv:1802.08786, 2018.
 Weininger [1988] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
 Simonovsky and Komodakis [2018] Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. arXiv preprint arXiv:1802.03480, 2018.
 You et al. [2018a] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep autoregressive models. In International Conference on Machine Learning, pages 5694–5703, 2018a.
 De Cao and Kipf [2018] Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.
 Bojchevski et al. [2018] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. Netgan: Generating graphs via random walks. arXiv preprint arXiv:1803.00816, 2018.
 Ma et al. [2018] Tengfei Ma, Jie Chen, and Cao Xiao. Constrained generation of semantically valid graphs via regularizing variational autoencoders. In Advances in Neural Information Processing Systems, pages 7113–7124, 2018.
 Jin et al. [2018] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In Proceedings of the 35th International Conference on Machine Learning, pages 2323–2332, 2018.
 Liu et al. [2018a] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander L Gaunt. Constrained graph variational autoencoders for molecule design. arXiv preprint arXiv:1805.09076, 2018a.
 You et al. [2018b] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goaldirected molecular graph generation. In Advances in Neural Information Processing Systems, pages 6412–6422, 2018b.
 Duvenaud et al. [2015] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
 Li et al. [2015] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
 Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.

Niepert et al. [2016]
Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov.
Learning convolutional neural networks for graphs.
In International conference on machine learning, pages 2014–2023, 2016.  Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.

Zhang et al. [2018]
Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen.
An endtoend deep learning architecture for graph classification.
InThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Zhang and Chen [2018] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pages 5165–5175, 2018.
 Zoph and Le [2016] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 Real et al. [2017] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc Le, and Alex Kurakin. Largescale evolution of image classifiers. arXiv preprint arXiv:1703.01041, 2017.
 Elsken et al. [2017] Thomas Elsken, JanHendrik Metzen, and Frank Hutter. Simple and efficient architecture search for convolutional neural networks. arXiv preprint arXiv:1711.04528, 2017.

Zoph et al. [2018]
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le.
Learning transferable architectures for scalable image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 8697–8710, 2018.  Liu et al. [2018b] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018b.
 Pham et al. [2018] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
 Hutter et al. [2018] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, editors. Automatic Machine Learning: Methods, Systems, Challenges. Springer, 2018. In press, available at http://automl.org/book.
 Kandasamy et al. [2018] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing. Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, 2018.
 Liu et al. [2017] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017.
 Miikkulainen et al. [2019] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Daniel Fink, Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, et al. Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pages 293–312. Elsevier, 2019.
 Cai et al. [2018] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
 Luo et al. [2018] Renqian Luo, Fei Tian, Tao Qin, EnHong Chen, and TieYan Liu. Neural architecture optimization. In Advances in neural information processing systems, 2018.

Chow and Liu [1968]
C Chow and Cong Liu.
Approximating discrete probability distributions with dependence trees.
IEEE Transactions on Information Theory, 14(3):462–467, 1968.  Gao et al. [2017] Tian Gao, Kshitij Fadnis, and Murray Campbell. Localtoglobal Bayesian network structure learning. In International Conference on Machine Learning, pages 1193–1202, 2017.
 Gao and Wei [2018] Tian Gao and Dennis Wei. Parallel Bayesian network structure learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1685–1694, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/gao18b.html.
 Linzner and Koeppl [2018] Dominik Linzner and Heinz Koeppl. Cluster Variational Approximations for Structure Learning of ContinuousTime Bayesian Networks from Incomplete Data. In Advances in Neural Information Processing Systems, pages 7891–7901, 2018.
 Chickering [1996] David Maxwell Chickering. Learning Bayesian networks is NPcomplete. In Learning from data, pages 121–130. Springer, 1996.
 Singh and Moore [2005] Ajit P. Singh and Andrew W. Moore. Finding optimal bayesian networks by dynamic programming, 2005.
 Yuan et al. [2011] Changhe Yuan, Brandon Malone, and Xiaojian Wu. Learning optimal bayesian networks using a* search. In Proceedings of the TwentySecond International Joint Conference on Artificial Intelligence  Volume Three, IJCAI’11, pages 2186–2191. AAAI Press, 2011. ISBN 9781577355151. doi: 10.5591/9781577355168/IJCAI11364. URL http://dx.doi.org/10.5591/9781577355168/IJCAI11364.
 Yuan and Malone [2013] Changhe Yuan and Brandon Malone. Learning optimal bayesian networks: A shortest path perspective. Journal of Artificial Intelligence Research, 48(1):23–65, October 2013. ISSN 10769757. URL http://dl.acm.org/citation.cfm?id=2591248.2591250.
 Chickering et al. [1995] Do Chickering, Dan Geiger, and David Heckerman. Learning Bayesian networks: Search methods and experimental results. In Proceedings of Fifth Conference on Artificial Intelligence and Statistics, pages 112–128, 1995.
 Hornik et al. [1989] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
 Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
 Cho et al. [2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 Bowman et al. [2015] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 Tai et al. [2015] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from treestructured long shortterm memory networks. arXiv preprint arXiv:1503.00075, 2015.
 Schuster and Paliwal [1997] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
 Chollet [2017] François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, pages 1610–02357, 2017.
 Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Scutari [2010] Marco Scutari. Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, Articles, 35(3):1–22, 2010. ISSN 15487660. doi: 10.18637/jss.v035.i03. URL https://www.jstatsoft.org/v035/i03.
 Lauritzen and Spiegelhalter [1988] Steffen L Lauritzen and David J Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological), pages 157–224, 1988.
 Snelson and Ghahramani [2006] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudoinputs. In Advances in neural information processing systems, pages 1257–1264, 2006.
 Jones et al. [1998] Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive blackbox functions. Journal of Global optimization, 13(4):455–492, 1998.
 Zöller and Huber [2019] MarcAndré Zöller and Marco F Huber. Survey on automated machine learning. arXiv preprint arXiv:1904.12054, 2019.
 Mueller et al. [2017] Jonas Mueller, David Gifford, and Tommi Jaakkola. Sequence to better sequence: continuous revision of combinatorial structures. In International Conference on Machine Learning, pages 2536–2544, 2017.
 Fusi et al. [2018] Nicolo Fusi, Rishit Sheth, and Melih Elibol. Probabilistic matrix factorization for automated machine learning. In Advances in Neural Information Processing Systems, pages 3352–3361, 2018.
 Yackley and Lane [2012] Benjamin Yackley and Terran Lane. Smoothness and Structure Learning by Proxy. In International Conference on Machine Learning, 2012.
 Anderson and Lane [2009] Blake Anderson and Terran Lane. Fast Bayesian network structure search using Gaussian processes. 2009. Available at https://www.cs.unm.edu/ treport/tr/0906/paper.pdf.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Appendix A More Related Work
Both neural architecture search (NAS) and Bayesian network structure learning (BNSL) are subfields of AutoML. See Zöller and Huber [61] for a survey. We have given a brief overview of NAS and BNSL in section 2. Below we discuss several works most related to our work in detail.
Luo et al. [38] proposed a novel NAS approach called Neural Architecture Optimization (NAO). The basic idea is to jointly learn an encoderdecoder between networks and a continuous space, and also a performance predictor that maps the continuous representation of a network to its performance on a given dataset; then they perform two or three iterations of gradient descent on to find better architectures in the continuous space, which are then decoded to real networks to evaluate. This methodology is similar to that of GómezBombarelli et al. [2] and Jin et al. [17] for molecule optimization; also similar to Mueller et al. [62] for slightly revising a sentence.
There are several key differences comparing to our approach. First, they use strings (e.g. “node2 conv 3x3 node1 maxpooling 3x3”) to represent neural architectures, whereas we directly use graph representations, which is more natural, and generally applicable to other graphs such as Bayesian network structures. Second, they use supervised learning instead of unsupervised learning. That means they need to first evaluate a considerable amount of randomly sampled graphs on a typically large dataset (e.g. train many neural networks), and use these results to supervise the training of the autoencoder. Given a new dataset, the autoencoder needs to be completely retrained. In contrast, we train our variational autoencoder in a fully unsupervised manner, so the model is of general purposes.
Fusi et al. [63] proposed a novel AutoML algorithm also using model embedding, but with a matrix factorization approach. They first construct a matrix of performances of thousands of ML pipelines on hundreds of datasets; then they use a probabilistic matrix factorization to get the latent representations of the pipelines. Given a new dataset, Bayesian optimization with the expected improvement heuristic is used to find the best pipeline. This approach only allows us to choose from predefined offtheshelf ML models, hence its flexibility is somewhat limited.
Kandasamy et al. [34] use Bayesian optimization for NAS; they define a kernel that measures the similarities between networks by solving an optimal transport problem, and in each iteration, they use some evolutionary heuristics to generate a set of candidate networks based on making small modifications to existing networks, and use expected improvement to choose the next one to evaluate. This work is similar to ours in the application of Bayesian optimization. However, defining a kernel to measure the similarities between discrete structures is a nontrivial problem. In addition, the discrete search space is heuristically extrapolated near existing architectures, which makes the search essentially local. In contrast, we directly fit a Gaussian process over the entire continuous latent space, enabling more global optimization.
Using Gaussian process (GP) for Bayesian network structure learning has also been studied before. Yackley and Lane [64] analyzed the smoothness of BDe score, showing that a local change (e.g. adding an edge) can change the score by at most , where is the number of training points. They proposed to use GP as a proxy for the score to accelerate the search. Anderson and Lane [65] used GP to model the BDe score, and showed that the probability of improvement is higher than that of using hill climbing to guide the local search. However, these methods still heuristically and locally operate in the discrete space, whereas our latent space makes both local and global methods such as gradient descent and Bayesian optimization applicable in a principled manner.
Appendix B Computation vs. Function
In section 3 we defined computation. Here we discuss the difference between a computation and a function. A computation defines a function . However, computations and also define the same function , but , and are different computations. In other words, a computation is (informally speaking) a process which focuses on the course of how the input is processed into the output, while a function is a mapping which cares about the results. Different computations can define the same function.
Sometimes, the same computation can also define different functions, e.g., two identical neural architectures will represent different functions given they are trained differently (since the weights of their layers will be different). In DVAE, we model computations instead of functions, since 1) modeling functions is much harder than modeling computations (requires understanding the semantic meaning of each operation, such as the cancelling out of and ), and 2) modeling functions additionally requires knowing the parameters of some operations, which are unknown before training.
Note also that in Definition 1, we only allow one single input signal. But in real world a computation sometimes has multiple initial input signals. However, the case of multiple input signals can be reduced to the single input case by adding an initial assignment operation that assigns the combined input signal to their corresponding nextlevel operations. For ease of presentation, we uniformly assume single input throughout the paper.
Appendix C Proof of Theorem 1
Let be the starting node with no predecessors. By assumption, is the single starting node no matter how we permute the nodes of the input DAG. For , the aggregation function always outputs a zero vector. Thus, is invariant to node permutations. Subsequently, the hidden state is also invariant to node permutations.
Now we prove the theorem by structural induction. Consider node . Suppose for every predecessor of , the hidden state is invariant to node permutations. We will show that is also invariant to node permutations. Notice that in (3), the output by is invariant to node permutations, since is invariant to the order of its inputs , and all are invariant to node permutations. Subsequently, node ’s hidden state is invariant to node permutations. By induction, we know that every node’s hidden state is invariant to node permutations, including the ending node’s hidden state. Thus, the DVAE encoder is invariant to node permutations. ∎
Appendix D Proof of Theorem 2
Suppose there is an arbitrary input signal fed to the starting node . For convenience, we will use to denote the output signal at vertex , where represents the composition of all the operations along the paths from to .
For the starting node , remember we feed a fixed to (2), thus is also fixed. Since also represents a fixed input operation, we know that the mapping from to is injective. Now we prove the theorem by induction. Assume the mapping from to is injective for all . We will prove that the mapping from to is also injective.
Let where is injective. Consider the output signal , which is given by feeding to . Thus,
(6) 
In other words, we can write as
(7) 
where is an injective function used for defining the composite computation based upon and . Note that can be either unordered or ordered depending on the operation . For example, if is some symmetric operations such as adding or multiplication, then can be unordered. If is some operation like subtraction or division, then must be ordered.
With (2) and (3), we can write the hidden state as follows:
(8) 
where is the injective onehot encoding function mapping to . In the above equation, are all injective. Since the composition of injective functions is injective, there exists an injective function so that
(9) 
Then combining (7) we have:
(10) 
is injective since the composition of injective functions is injective. Thus, we have proved that the mapping from to is injective. ∎
Appendix E Modifications for Encoding Neural Architectures
According to Theorem 2, to ensure DVAE injectively encodes computations, we need the aggregation function to be injective. Remember takes the multiset as input. If the order of its elements does not matter, then the gated sum in (4) can model this injective multiset function without issues. However, if the order matters (i.e., permuting the elements of makes output different results), we need a different aggregation function that can encode such orders.
Whether the order should matter for depends on whether the input order matters for the operations (see the proof for Theorem 2 for more details). For example, if multiple previous layers’ outputs are summed or averaged as the input to a next layer in the neural networks, then can be modeled by the gated sum in (4) as the order of inputs does not matter. However, if these outputs are concatenated as the next layer’s input, then the order does matter. In our experiments, the neural architectures use the second way to aggregate outputs from previous layers. The order of concatenation depends on a global order of the layers in a neural architecture. For example, if layer2 and layer4’s outputs are input to layer5, then layer2’s output will be before layer4’s output in their concatenation.
Since the gated sum in (4) can only handle the unordered case, we can slightly modify (4) in order to make it orderaware thus more suitable for our neural architectures. Our scheme is as follows:
(11) 
where is the onehot encoding of layer ’s global ID (1,2,3,…). Such an aggregation function respects the concatenation order of the layers. We empirically observed that this aggregation function can increase DVAE’s performance on neural architectures compared to the plain aggregation function (4). However, even using (4) still outperformed all baselines.
Appendix F Modifications for Encoding Bayesian Networks
We also make some modifications when encoding Bayesian networks. One modification is that the aggregation function (4) is changed to:
(12) 
Compared to (4), we replace with the node type feature . This is due to the differences between computations on a neural architecture and on a Bayesian network. In a neural network, the signal flow follows the network architecture, where the output signal of a layer is fed as the input signals to its succeeding layers. Also in a neural network, what we are interested in is the result output by the final layer. In contrast, for a Bayesian network, the graph represents a set of conditional dependencies among variables instead of a computational flow. In particular, for Bayesian network structure learning, we are often concerned about computing the (log) marginal likelihood score of a dataset given a graph structure, which is often decomposed into individual variables given their parents (see Definition 18.2 in Koller and Friedman [1]). For example, in Figure 8, the overall score can be decomposed into . To compute the score for , we only need the values of and ; its grandparents and should have no influence on .
Based on this intuition, when computing the hidden state of a node, we use the features of its parents instead of , which “dseparates” the node from its grandparents. For the update function, we still use (5).
Also based on the decomposibility of the score, we make another modification for encoding Bayesian networks by using the sum of all node states as the final output state instead of only using the ending node state. Similarly, when decoding Bayesian networks, the graph state .
Note that the combination of (12) and (5) can injectively model the conditional dependence between and its parents . In addition, using summing can model injective set functions [49, Lemma 5]. Therefore, the above encoding scheme is able to injectively encode the complete conditional dependencies of a Bayesian network, thus also the overall score function of the network.
Appendix G Advantages of Encoding Computations in DAG Optimization
Here we discuss why DVAE’s ability to injectively encode computations (Theorem 2) is of great benefit to performing DAG optimization in the latent space. Firstly, our target is to find a DAG that achieves high performance (e.g., accuracy of neural network, BIC score of Bayesian network) on a given dataset. The performance of a DAG is directly related to its computation. For example, given the same set of layer parameters, two neural networks with the same computation will have the same performance on a given test set. Since DVAE encodes computations instead of structures, it allows embedding DAGs with similar performances to the same regions in the latent space, rather than embedding DAGs with merely similar structure patterns to the same regions. Subsequently, the latent space can be smooth w.r.t. performance instead of structure. Such smoothness can greatly facilitate searching for highperformance DAGs in the latent space, since similarperformance DAGs tend to locate near each other in the latent space instead of locating randomly, and modeling a smoothlychanging performance surface is much easier.
Note that Theorem 2 is a necessary condition for the latent space to be smooth w.r.t. performance, because if DVAE cannot injectively encode computations, it might map two DAGs representing completely different computations to the same encoding, making this point of the latent space arbitrarily unsmooth. Although there yet is no theoretical guarantee that the latent space must be smooth w.r.t. DAGs’ performances, we do empirically observe that the predictive performance and Bayesian optimization performance of DVAE’s latent space are significantly better than those of baselines, which is indirect evidence that DVAE’s latent space is smoother w.r.t. performance. Our visualization results also confirm the smoothness. See Section 4.1, 4.2, 4.3 for details.
Appendix H More Details about Neural Architecture Search
We use the efficient neural architecture search (ENAS)’s software [32] to generate the training and testing neural architectures. With these seed architectures, we can train a VAE model and thus search for new highperformance architectures in the latent space.
ENAS alternately trains two components: 1) a RNNbased controller which is used to propose new architectures, and 2) the shared weights of the proposed architectures. It uses a weightsharing (WS) scheme to obtain a quick but rough estimate of how good an architecture is. That is, it forces all the proposed architectures to use the same set of shared weights, instead of fully training each neural network individually. It assumes that an architecture with a high validation accuracy using the shared weights (i.e., the weightsharing accuracy) is more likely to have a high test accuracy after fully retraining its weights from scratch.
We first run ENAS in the macro space (section 2.3 of Pham et al. [32]
) for 1000 epochs with 20 architectures proposed in each epoch. For all the proposed architectures excluding the first 1000 burnin ones, we evaluate their weightsharing accuracies using the shared weights from the last epoch. We further split the data into 90% training and 10% heldout test sets. Then our task becomes to train a VAE on the training neural architectures, and then generate new highperformance architectures from the latent space based on Bayesian optimization. Note that our target performance measure here is the weightsharing accuracy, not the true validation/test accuracy after fully retraining the architecture. This is because the weightsharing accuracy takes around 0.5 second to evaluate, while fully training a network takes over 12 hours. In consideration of our limited computational resources, we choose the weightsharing accuracy as our optimization target in the Bayesian optimization experiments.
After the Bayesian optimization finds a final set of architectures with high weightsharing accuracies, we will fully train them to evaluate their true test accuracies on CIFAR10. To fully train an architecture, we follow the original setting of ENAS to train each architecture on CIFAR10’s training set for 310 epochs, and report the last epoch’s net’s test accuracy. See [32, section 3.2] for details.
Due to our constrained computational resources, we choose not to perform Bayesian optimization to optimize the true validation accuracy (after fully training), which would be a more principled way for searching neural architectures. Nevertheless, we describe its procedure here for future explorations: After training the DVAE, we have no architectures at all to initialize a Gaussian process regression on the true validation accuracy. Thus, we need to randomly pick up some points in the latent space, decode them into neural architectures, and get their true validation accuracies after full training. Then with these initial points, we start the Bayesian optimization similarly to section 4.2, with the optimization target replaced by the true validation accuracy. Finally, we will find a set of architectures with the highest true validation accuracies, and report their true test accuracies. This experiment will take much longer time (months of GPU time). Thus, making the training parallel is very necessary.
One might wonder why we train another generative model after we already have ENAS. Firstly, ENAS is not generalpurpose, but task specific. It leverages the validation accuracy signals to train the controller based on reinforcement learning. For any new NAS tasks, ENAS needs to be completely retrained. In contrast, DVAE is unsupervised. It only needs to be trained once, and can be applied to other NAS tasks. Secondly, DVAE also provides a way to learn neural architecture embeddings, which can be used for downstream tasks such as visualization, classification, clustering etc.
In the Bayesian optimization experiments (section 4.2), the best architecture found by DVAE achieves a test accuracy of 94.80% on CIFAR10. Although not outperforming stateoftheart NAS techniques which has an error rate of 2.11%, our architecture only contains 3 million parameters compared to the stateoftheart NAONet + Cutout which has 128 million parameters [38]. In addition, NAONet used 200 GPUs to fully train 1,000 architectures for 1 day, and stacked the final found cell for 6 times as well as adding 4 times more filters after optimization. In comparison, we only used 1 GPU to evaluate the weightsharing accuracy, and never used any data augmentation techniques or architecture stacking to boost the performance, since achieving new stateoftheart NAS results (through using great resources and heavy engineering) are beyond the main purpose of our paper.
Appendix I More details about Bayesian network structure learning
We consider a small synthetic problem called Asia [58]
as our target Bayesian network structure learning problem. The Asia dataset is composed of 5,000 samples, each is generated by a true network with 8 binary variables
^{1}^{1}1http://www.bnlearn.com/documentation/man/asia.html. Bayesian Information Criteria (BIC) score is used to evaluate how well a Bayesian network fits the 5,000 samples. To train a VAE model to generate Bayesian network structures, we sample 200,000 random 8node Bayesian networks from the bnlearn package [57] in R, which are split into 90% training and 10% testing sets. Our task is to train a VAE model on the training Bayesian networks, and search in the latent space for Bayesian networks with high BIC scores using Bayesian optimization. In this task, we consider a simplified case where the topological order of the true network is known – we let the sampled training and test Bayesian networks have topological orders consistent with the true network of Asia. This is a reasonable assumption for many practical applications, e.g., when the variables have a temporal order [1]. When sampling a network, the probability of a node having an edge with a previous node (as specified by the order) is set to the default option , where is the number of nodes, which results in sparse graphs where the number of edges is in the same order of the number of nodes.Appendix J Baselines
As discussed in the related work, there are other types of graph generative models that can potentially work for DAGs. We explore three possible approaches and contrast them with DVAE.
SVAE. The SVAE baseline treats a DAG as a sequence of node strings, which we call stringbased variational autoencoder (SVAE). In SVAE, each node is represented as the onehot encoding of its type number concatenated with a 0/1 indicator vector indicating which previous nodes have directed edges to it (i.e., a column of the adjacency matrix). For example, suppose there are two node types and five nodes, then node 4’s string “0 1, 0 1 1 0 0” means this node has type 2, and has directed edges from previous nodes 2 and 3. SVAE leverages a standard GRUbased RNN variational autoencoder [52] on the topologically sorted node sequences, with each node’s string treated as its input bit vector.
GraphRNN. One similar generative model is GraphRNN [13]. Different from SVAE, it further decomposes an adjacency column into entries and generates the entries one by one using another edgelevel GRU. GraphRNN is a pure generative model which does not have an encoder, thus cannot optimize DAG performance in a latent space. To compare with GraphRNN, we equip it with SVAE’s encoder and use it as another baseline. Note that the original GraphRNN feeds nodes using a BFS order (for undirected graphs), yet we find that it is much worse than using a topological order here. Note also that although GraphRNN seems more expressive than SVAE, we find that in our applications GraphRNN tends to have more severe overfitting and generates less diverse DAGs.
Both GraphRNN and SVAE treat DAGs as bit strings and use RNNs to model them. This representation has several drawbacks. Firstly, since the topological ordering is often not unique for a DAG, there might be multiple string representations for the same DAG, which all result in different encoded representations. This will violate the permutation invariance in Theorem 1. Secondly, the string representations can be very brittle in terms of modeling DAGs’ computational purposes. In Figure 9, the left and right DAGs’ string representations are only different by two bits, i.e., the edge (2,3) in the left is changed to the edge (1,3) in the right. However, the two bits of change in structure greatly changes the signal flow, which makes the right DAG always output . In SVAE and GraphRNN, since the bit representations of the left and right DAGs are very similar, they are highly likely to be encoded to similar latent vectors. In particular, the only difference between encoding the left and right DAGs is that, for node 3, the encoder RNN will read an adjacency column of [0, 1, 0, 0, 0, 0] in the left, and read [1, 0, 0, 0, 0, 0] in the right, while all the remaining encoding is exactly the same. By embedding two DAGs serving very different computational purposes to the same region of the latent space, SVAE and GraphRNN tend to have less smooth latent spaces which make optimization on them more difficult. In contrast, DVAE can better differentiate such subtle differences, as the change of edge (2,3) to (1,3) completely changes what aggregated message node 3 receives in DVAE (hidden state of node 2 vs. hidden state of node 1), which greatly affects node 3 and all its successors’ feature learning.
GCN. The graph convolutional network (GCN) [22] is one representative graph neural network with a simultaneous message passing scheme. In GCN, all the nodes take their neighbors’ incoming messages to update their own states simultaneously instead of following an order. After message passing, the summed node states is used as the graph state. We include GCN as the third baseline. Since GCN can only encode graphs, we equip GCN with DVAE’s decoder to make it a VAE model.
Using GCN as the encoder can ensure permutation invariance, since node ordering does not matter in GCN. However, GCN’s message passing focuses on propagating the neighboring nodes’ features to each center node to encode the local substructure pattern around each node. In comparison, DVAE’s message passing simulates how the computation is performed along the directed paths of a DAG and focuses on encoding the computation. Although learning local substructure features is essential for GCN’s successes in node classification and graph classification, here in our tasks, modeling the entire computation is much more important than modeling the local features. Encoding only local substructures may also lose important information about the global DAG topology, thus making it more difficult to reconstruct the DAG.
Appendix K VAE Training Details
We use the same settings and hyperparameters (where applicable) for all the four models to be as pair as possible. Many hyperparameters are inherited from
Kusner et al. [3]. Singlelayer GRUs are used in all models requiring recurrent units, with the same hidden state size of 501. We set the dimension of the latent space to be 56 for all models. All VAE models use as the prior distribution , and take ( denotes the input DAG) to be a normal distribution with a diagonal covariance matrix, whose mean and variance parameters are output by the encoder. The two MLPs used to output the mean and variance parameters are all implemented as single linear layer networks.For the decoder network of DVAE, we let and
be twolayer MLPs with ReLU nonlinearities, where the hidden layer sizes are set to two times of the input sizes. Softmax activation is used after
, and sigmoid activation is used after . For the gating network , we use a single linear layer with sigmoid activation. For the mapping function , we use a linear mapping without activation. The bidirectional encoding discussed in section 3.3 is enabled for DVAE on neural architectures, and disabled for DVAE on Bayesian networks and other models where it gets no better results. To measure the reconstruction loss, we use teacher forcing [17]: following the topological order with which the input DAG’s nodes are consumed, we sum the negative loglikelihood of each decoding step by forcing them to generate the ground truth node type or edge at each step. This ensures that the model makes predictions based on the correct histories. Then, we optimize the VAE loss (the negative of (1)) using gradient descent following [17].When optimizing the VAE loss, we use
as the loss function. In original VAE framework,
is set to 1. However, we found that it led to poor reconstruction accuracies, similar to the findings of previous work [3, 10, 17]. Following the implementation of Jin et al. [17], we set . Minibatch SGD with Adam optimizer [66]is used for all models. For neural architectures, we use a batch size of 32 and train all models for 300 epochs. For Bayesian networks, we use a batch size of 128 and train all models for 100 epochs. We use an initial learning rate of 1E4, and multiply the learning rate by 0.1 whenever the training loss does not decrease for 10 epochs. We use PyTorch to implement all the models.
Appendix L SGP Training Details
We use sparse Gaussian process (SGP) regression as the predictive model. We use the open sourced SGP implementation in [3]. Both the training and testing data’s performances are standardized according to the mean and std of the training data’s performances before feeding to the SGP. And the RMSE and Pearson’s in Table 1 are also calculated on the standardized performances. We use the default Adam optimizer to train the SGP for 100 epochs constantly with a minibatch size of 1,000 and learning rate of 5E4.
For neural architectures, we use all the training data to train the SGP. For Bayesian networks, we randomly sample 5,000 training examples each time, due to two reasons: 1) using all the 180,000 examples to train the SGP might not be realistic for a typical scenario where network/dataset is large and evaluating a network is expensive; and 2) we found using a smaller sample of training data results in more stable BO performance due to the less probability of duplicate rows which might result in ill conditioned matrices. Note also that, when training the variational autoencoders, all the training data are used, since the VAE training is purely unsupervised.
Appendix M More Experimental Results
m.1 Reconstruction accuracy, prior validity, uniqueness and novelty
Being able to accurately reconstruct input examples and generate valid new examples are basic requirements for VAE models. In this experiment, we evaluate the models by measuring 1) how often they can reconstruct input DAGs perfectly (Accuracy), 2) how often they can generate valid neural architectures or Bayesian networks from the prior distribution (Validity), 3) the portion of unique DAGs out of the valid generations (Uniqueness), and 4) the portion of valid generations that are never seen in the training set (Novelty).
We first evaluate each model’s reconstruction accuracy on the test sets. Following previous work [3, 17], we regard the encoding as a stochastic process. That is, after getting the mean and variance parameters of the posterior approximation , we sample a from it as ’s latent vector. To estimate the reconstruction accuracy, we sample 10 times for each , and decode each 10 times too. Then we report the average portion of the 100 decoded DAGs that are identical to the input.
To calculate prior validity, we sample 1,000 latent vectors from the prior distribution and decode each latent vector 10 times. Then we report the portion of these 10,000 generated DAGs that are valid. A generated DAG is valid if it can be read by the original software which generated the training data. More details about the validity experiment are in Appendix M.2.
Neural architectures  Bayesian networks  

Methods  Accuracy  Validity  Uniqueness  Novelty  Accuracy  Validity  Uniqueness  Novelty 
DVAE  99.96  100.00  37.26  100.00  99.94  98.84  38.98  98.01 
SVAE  99.98  100.00  37.03  99.99  99.99  100.00  35.51  99.70 
GraphRNN  99.85  99.84  29.77  100.00  96.71  100.00  27.30  98.57 
GCN  5.42  99.37  41.18  100.00  99.07  99.89  30.53  98.26 
We show the results in Table 2. Among all the models, DVAE and SVAE generally have the highest performance. We find that DVAE, SVAE and GraphRNN all have near perfect reconstruction accuracy, prior validity and novelty. However, DVAE and SVAE show higher uniqueness, meaning that they generate more diverse examples. We find that GCN is not suitable for modeling neural architectures as it only reconstructs 5.42% unseen inputs. This is not surprising, since the simultaneous message passing scheme in GCN focuses on learning local graph structures, but fails to encode the computation represented by the entire neural network. Besides, the sum pooling after the message passing might also lose some global topology information which is important for the reconstruction.
m.2 More details on the piror validity experiment
Since different models can have different levels of convergence w.r.t. the KLD loss in (1), their posterior distribution may have different degrees of alignment with the prior distribution . If we evaluate prior validity by sampling from for all models, we will favor those models that have a higherlevel of KLD convergence. To remove such effects and focus purely on models’ intrinsic ability to generate valid DAGs, when evaluating prior validity, we apply for each model (where are encoded means of the training data by the model), so that the latent vectors are scaled and shifted to the center of the training data’s embeddings. If we do not apply such transformations, we find that we can easily control the prior validity results by optimizing for more or less epochs or putting more or less weight on the KLD loss.
For a generated neural architecture to be read by ENAS, it has to pass the following validity checks: 1) It has one and only one starting node (the input layer); 2) It has one and only one ending type (the output layer); 3) Other than the input node, there are no nodes which do not have any predecessors (no isolated paths); 4) Other than the output node, there are no nodes which do not have any successors (no blocked paths); 5) Each node must have a directed edge from the node immediately before it (the constraint of ENAS), i.e., there is always a main path connecting all the nodes; and 6) It is a DAG.
For a generated Bayesian network to be read by bnlearn and evaluated on the Asia dataset, it has to pass the following validity checks: 1) It has exactly 8 nodes; 2) Each type in "ASTLBEXD" appears exactly once; and 3) It is a DAG.
Note that the training graphs generated by the original software all satisfy these validity constraints.
m.3 Bayesian optimization vs. random search
To validate that Bayesian optimization (BO) in the latent space does provide guidance in searching better DAGs, we compare BO with Random (which randomly samples points from the latent space of DVAE). Figure 10 and 11 show the results (averaged across 10 trials). In each figure, the left plot shows the average performance of all the points found in each BO round, and the right plot shows the highest performance of all the points found so far. As we can see, BO consistently selects points with better average performance in each round than random search, which is expected. However, for the highest performance results, BO tends to fall behind Random in the initial few rounds. This might be because our batch expected improvement heuristic aims to take advantage of the currently most promising regions by selecting most points of the batch in the same region (exploitation), while Random more evenly explores the entire space (exploration). Nevertheless, BO seems to quickly catch up after a few rounds and shows longterm advantages.
m.4 More Visualization Results for Neural Architectures
We randomly pick a neural architecture and use its encoded mean as the starting point. We then generate two random orthogonal directions, and move in the combination of these two directions from the starting point to render a 2D visualization of the decoded architectures in Figure 12.
m.5 More Visualization Results for Bayesian Networks
We similarly show the 2D visualization of decoded Bayesian networks in Figure 13. Both DVAE and SVAE show smooth latent spaces.
Comments
There are no comments yet.