D-VAE: A Variational Autoencoder for Directed Acyclic Graphs

04/24/2019 ∙ by Muhan Zhang, et al. ∙ Washington University in St Louis 0

Graph structured data are abundant in the real world. Among different graph types, directed acyclic graphs (DAGs) are of particular interests to machine learning researchers, as many machine learning models are realized as computations on DAGs, including neural networks and Bayesian networks. In this paper, we study deep generative models for DAGs, and propose a novel DAG variational autoencoder (D-VAE). To encode DAGs into the latent space, we leverage graph neural networks. We propose a DAG-style asynchronous message passing scheme that allows encoding the computations defined by DAGs, rather than using existing simultaneous message passing schemes to encode the graph structures. We demonstrate the effectiveness of our proposed D-VAE through two tasks: neural architecture search and Bayesian network structure learning. Experiments show that our model not only generates novel and valid DAGs, but also produces a smooth latent space that facilitates searching for DAGs with better performance through Bayesian optimization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real-world problems can be posed as optimizing of a directed acyclic graph (DAG) representing some computational task. For example, the architecture of a neural network is a DAG. The problem of searching optimal neural architectures is essentially a DAG optimization task. Similarly, one critical problem in learning graphical models – optimizing the connection structures of Bayesian networks [1], is also a DAG optimization task. DAG optimization is pervasive in other fields as well. In electronic circuit design, engineers need to optimize DAG circuit blocks not only to realize target functions, but also to meet specifications such as power usage and operating temperature.

DAG optimization is a hard problem. Firstly, the evaluation of a DAG’s performance is often time-consuming (e.g., training a neural network). Secondly, state-of-the-art black-box optimization techniques such as simulated annealing and Bayesian optimization primarily operate in a continuous space, thus are not directly applicable to DAG optimization due to the discrete nature of DAGs. In particular, to make Bayesian optimization work for discrete structures, we need a kernel to measure the similarity between discrete structures as well as a method to explore the design space and extrapolate to new points. Principled solutions to these problems are still lacking.

Is there a way to circumvent the trouble from discreteness? The answer is yes. If we can embed all DAGs to a continuous space and make the space relatively smooth, we might be able to directly use principled black-box optimization algorithms to optimize DAGs in this space, or even use gradient methods if gradients are available. Recently, there has been increased interest in training generative models for discrete data types such as molecules [2, 3], arithmetic expressions [4], source code [5], undirected graphs [6], etc. In particular, Kusner et al. [3] developed a grammar variational autoencoder (GVAE) for molecules, which is able to encode and decode molecules into and from a continuous latent space, allowing one to optimize molecule properties by searching in this well-behaved space instead of a discrete space. Inspired by this work, we propose to also train a variational autoencoder for DAGs, and optimize DAG structures in the latent space via Bayesian optimization.

To encode DAGs, we leverage graph neural networks (GNNs) [7]. Traditionally, a GNN treats all nodes symmetrically, and extracts local features around nodes by simultaneously passing all nodes’ neighbors’ messages to themselves. However, such a simultaneous message passing scheme is designed to learn local structure features. It might not be suitable for DAGs, since in a DAG: 1) nodes are not symmetric, but intrinsically have some ordering based on its dependency structure; and 2) we are more concerned about the computation represented by the entire graph, not the local structures.

In this paper, we propose an asynchronous message passing scheme to encode the computations on DAGs. The message passing no longer happens at all nodes simultaneously, but respects the computation dependencies (the partial order) among the nodes. For example, suppose node A has two predecessors, B and C, in a DAG. Our scheme does not perform feature learning for A until the feature learning on B and C are both finished. Then, the aggregated message from B and C is passed to A to trigger A’s feature learning. We incorporate this feature learning scheme in both our encoder and decoder, and propose the DAG variational autoencoder (D-VAE). D-VAE has an excellent theoretical property for modeling DAGs – we prove that D-VAE can injectively encode computations on DAGs. This means, we can build a mapping from the discrete space to a continuous latent space so that every DAG computation has its unique embedding in the latent space, which justifies performing optimization in the latent space instead of the original design space.

Our contributions in this paper are: 1) We propose D-VAE, a variational autoencoder for DAGs using a novel asynchronous message passing scheme, which is able to injectively encode computations. 2) Based on D-VAE, we propose a new DAG optimization framework which performs Bayesian optimization in a continuous latent space. 3) We apply D-VAE to two problems, neural architecture search and Bayesian network structure learning. Experiments show that D-VAE not only generates novel and valid DAGs, but also learns smooth latent spaces effective for optimizing DAG structures.

2 Related work

Variational autoencoder (VAE) [8, 9] provides a framework to learn both a probabilistic generative model (the decoder) as well as an approximated posterior distribution (the encoder). VAE is trained through maximizing the evidence lower bound

(1)

The posterior approximation and the generative model can in principle take arbitrary parametric forms whose parameters and are output by the encoder and decoder networks. After learning

, we can generate new data by decoding latent space vectors

sampled from the prior . For generating discrete data, is often decomposed into a series of decision steps.

Deep graph generative models use neural networks to learn distributions over graphs. There are mainly three types: token-based, adjacency-matrix-based, and graph-based. Token-based models [2, 3, 10] represent a graph as a sequence of tokens (e.g., characters, grammar rules) and model these sequences using RNNs. They are less general since task-specific graph grammars such as SMILES for molecules [11] are required. Adjacency-matrix-based models [12, 13, 14, 15, 16] leverage the proxy adjacency matrix representation of a graph, and generate the matrix in one shot or generate the columns/entries sequentially. In contrast, graph-based models [6, 17, 18, 19] seem more natural, since they operate directly on graph structures (instead of proxy matrix representations) by iteratively adding new nodes/edges to a graph based on the the existing graph and node states. In addition, the graph and node states are learned by graph neural networks (GNNs), which have already shown their powerful graph representation learning ability on various tasks [20, 21, 22, 23, 24, 25, 26].

Neural architecture search (NAS) aims at automating the design of neural network architectures. It has seen major advances in recent years [27, 28, 29, 30, 31, 32]. See Hutter et al. [33]

for an overview. NAS methods can be mainly categorized into: 1) reinforcement learning methods

[27, 30, 32] which train controllers to generate architectures with high rewards in terms of validation accuracy, 2) Bayesian optimization based methods [34]

which define kernels to measure architecture similarity and extrapolate the architecture space heuristically, 3) evolutionary approaches

[28, 35, 36]

which use evolutionary algorithms to optimize neural architectures, and 4) differentiable methods

[31, 37, 38] which use continuous relaxation/mapping of neural architectures to enable gradient-based optimization. In Appendix A, we include more discussion on several most related works.

Bayesian network structure learning (BNSL) is to learn the structure of the underlying Bayesian network from observed data [39, 40, 41, 42]. Bayesian network is a probabilistic graphical model which represents conditional dependencies among variables via a DAG [1]. One main approach for BNSL is score-based search, i.e., we define some “goodness-of-fit” score for network structures, and search for one with the optimal score in the discrete design space. Commonly used scores include BIC and BDeu, mostly based on marginal likelihood [1]. Due to the NP-hardness [43], however, exact algorithms such as dynamic programming [44] or shortest path approaches [45, 46] can only solve small-scale problems. Thus, people have to resort to heuristic methods such as local search and simulated annealing, etc. [47]. In general, BNSL is still a hard problem with much research ongoing.

3 DAG variational autoencoder (D-VAE)

In this section, we describe our proposed DAG variational autoencoder (D-VAE). D-VAE uses an asynchronous message passing scheme to encode and decode DAGs. In contrast to the simultaneous message passing in traditional GNNs, D-VAE allows encoding computations rather than structures.

Definition 1.

(Computation) Given a set of elementary operations , a computation is the composition of a finite number of operations applied to an input signal , with the output of each operation being the input to its succeeding operations.

Figure 1: Computations can be represented by DAGs. Note that the left and right DAGs represent the same computation.

The set of elementary operations depends on specific applications. For example, when we are interested in computations given by a calculator, will be the set of all the operations defined on the functional buttons, such as , , , , etc. When modeling neural networks, can be a predefined set of basic layers, such as 33 convolution, 55 convolution, 2

2 max pooling, etc. A computation can be represented as a directed acyclic graph (DAG), with directed edges representing signal flow directions among node operations. The graph must be acyclic, since otherwise the input signal will go through an infinite number of operations so that the computation never stops. Figure

1 shows two examples. Note that the two DAGs in Figure 1 represent the same computation, as the input signal goes through exactly the same operations.

3.1 Encoding

We first introduce D-VAE’s encoder. The D-VAE encoder can be seen as a graph neural network (GNN) using an asynchronous message passing scheme. Given a DAG, we assume there is a single starting node which does not have any predecessors (e.g., the input layer of a neural architecture). If there are multiple such nodes, we add a virtual starting node connecting to all of them.

Similar to standard GNNs, we use an update function to compute the hidden state of each node based on its neighbors’ incoming message. The hidden state of node is given by:

(2)

where

is the one-hot encoding of

’s type, and represents the incoming message to . is given by aggregating the hidden states of ’s predecessors using an aggregation function :

(3)

where denotes there is a directed edge from to , and represents a multiset of ’s predecessors’ hidden states. If an empty set is input to (corresponding to the case for the starting node without any predecessors), we let output an all-zero vector.

Compared to the traditional simultaneous message passing, in D-VAE the message passing for a node must wait until all of its predecessors’ hidden states have already been computed. This simulates how a computation is really performed – to execute some operation, we also need to wait until all its input signals are ready. To make sure the required hidden states are available when a new node comes, we can perform message passing for nodes sequentially following a topological ordering of the DAG.

In Figure 2, we use a real neural architecture to illustrate the encoding process. After all nodes’ hidden states are computed, we use , the hidden state of the ending node without any successors, as the output of the encoder. Then we feed

to two multi-layer perceptrons (MLPs) to get the mean and variance parameters of

in (1

) which is a normal distribution in our experiments. If there are multiple nodes without successors, we again add a virtual ending node connecting from all of them.

Figure 2: An illustration of the encoding procedure for a neural architecture. Following a topological ordering, we iteratively compute the hidden state for each node (red) by feeding in its predecessors’ hidden states (blue). This simulates how an input signal goes through the DAG, with simulating the output signal at node .

Note that although topological orderings are usually not unique for a DAG, we can take any one of them as the message passing order while ensuring the encoder output is always the same, formalized by the following theorem. We include all theorem proofs in the appendix.

Theorem 1.

The D-VAE encoder is invariant to node permutations of the input DAG if the aggregation function is invariant to the order of its inputs (e.g., summing, averaging, etc.).

Theorem 1 means isomorphic DAGs will have the same encoding result, no matter how we reorder/reindex the nodes. It also indicates that so long as we encode a DAG complying with its partial order, the real message passing order and node order do not influence the encoding result.

The next theorem shows another property of D-VAE that is crucial for its success in modeling DAGs, i.e., it is able to injectively encode computations on DAGs.

Theorem 2.

Let be any DAG representing some computation . Let be its nodes following a topological order each representing some operation , where is the ending node. Then, the encoder of D-VAE maps to injectively if is injective and is injective.

The significance of Theorem 2 is that it provides a way to injectively encode computations on DAGs, so that every computation has a unique embedding in the latent space. Therefore, instead of performing optimization in the original discrete space, we may equivalently perform optimization in the continuous latent space. In this well-behaved Euclidean space, distance is well defined, and principled Bayesian optimization can be applied to search for latent points with high performance scores, which transforms the discrete optimization problem into an easier continuous problem.

Note that Theorem 2 states D-VAE injectively encodes computations on graph structures, rather than graph structures themselves. Being able to injectively encode graph structures is a very strong condition, as it might provide an efficient algorithm to solve the challenging graph isomorphism (GI) problem. Luckily, here what we really want to injectively encode are computations instead of structures, since we do not need to differentiate two different structures and as long as they represent the same computation. Figure 1 shows such an example. Our D-VAE can identify that the two DAGs in Figure 1 actually represent the same computation by encoding them to the same vector, while those encoders focusing on encoding structures might fail to capture the underlying computation and output different vectors. In Appendix G, we discuss more advantages of Theorem 2 for optimizing DAGs in the latent space.

To model and learn the injective functions and , we resort to neural networks thanks to the universal approximation theorem [48]. For example, we can let be a gated sum:

(4)

where is a mapping network and is a gating network. Such a gated sum can model injective multiset functions [49], and is invariant to input order. To model the injective update function

, we can use a gated recurrent unit (GRU)

[50], with treated as the input hidden state:

(5)

Here the subscript denotes “encoding”. Using a GRU also allows reducing our framework to traditional sequence to sequence modeling frameworks [51], as discussed in 3.3.

The above aggregation and update functions can be used to encode general computation graphs. For neural architectures, depending on how the outputs of multiple previous layers are aggregated as the input to a next layer, we will make a modification to (4), which is discussed in Appendix E. For Bayesian networks, we also make some modifications to their encoding due to the special d-separation properties of Bayesian networks, which is discussed in Appendix F.

Figure 3: An illustration of the steps for generating a new node.

3.2 Decoding

We now describe how D-VAE decodes latent vectors to DAGs (the generative part). The D-VAE decoder uses the same asynchronous message passing scheme as in the encoder to learn intermediate node and graph states. Similar to (5), the decoder uses another GRU, denoted by , to update node hidden states during the generation. Given the latent vector to decode, we first use an MLP to map to as the initial hidden state to be fed to . Then, the decoder constructs a DAG node by node. For the generated node , the following steps are performed:

  1. [leftmargin=*,itemsep=-1pt,topsep=-1pt]

  2. Compute ’s type distribution using an MLP (followed by a softmax) based on the current graph state .

  3. Sample ’s type. If the sampled type is the ending type, stop the decoding, connect all loose ends (nodes without successors) to , and output the DAG; otherwise, continue the generation.

  4. Update ’s hidden state by , where if ; otherwise, is the aggregated message from its predecessors’ hidden states given by equation (4).

  5. For

    : (a) compute the edge probability of

    using an MLP based on and ; (b) sample the edge; and (c) if a new edge is added, update using step 3.

The above steps are iteratively applied to each new generated node, until step 2 samples the ending type. For every new node, we first predict its node type based on the current graph state, and then sequentially predict whether each existing node has a directed edge to it based on the existing and current nodes’ hidden states. Figure 3 illustrates this process. Since edges always point to new nodes, the generated graph is guaranteed to be acyclic. Note that we maintain hidden states for both the current node and existing nodes, and keep updating them during the generation. For example, whenever step 4 samples a new edge between and , we will update to reflect the change of its predecessors and thus the change of the computation so far. Then, we will use the new for the next prediction. Such a dynamic updating scheme is flexible, computation-aware, and always uses the up-to-date state of each node to predict next steps. In contrast, methods based on RNNs [3, 13] do not maintain states for old nodes, and only use the current RNN state to predict the next step.

In step 4, when sequentially predicting incoming edges from previous nodes, we choose the reversed order instead of or any other order. This is based on the prior knowledge that a new node is more likely to firstly connect from the node immediately before it. For example, in neural architecture design, when adding a new layer, we often first connect it from the last added layer, and then decide whether there should be skip connections from other previous layers. Note that however, such an order is not fixed and can be flexible according to specific applications.

3.3 Model extensions and discussion

Relation with RNNs. The D-VAE encoder and decoder can be reduced to ordinary RNNs when the input DAGs are reduced to linked lists. Although we propose D-VAE from a GNN’s perspective, our model can also be seen as a generalization of traditional sequence modeling frameworks [51, 52] where a timestamp depends only on the timestamp immediately before it, to the DAG case where a timestamp has multiple previous dependencies. As special DAGs, similar ideas have been explored for trees [53, 17], where a node can have multiple incoming edges yet only one outgoing edge.

Bidirectional encoding. D-VAE’s encoding process can be seen as simulating how an input signal goes through a DAG, with simulating the output signal at each node . This is also known as forward propagation in neural networks. Inspired by the bidirectional RNN [54], we can also use another GRU to reversely encode a DAG (i.e., reverse all edge directions and encode the DAG again), thus simulating the backward propagation too. After reverse encoding, we get two ending states, which are concatenated and linearly mapped to their original size as the final output state. We find this bidirectional encoding can increase the performance and convergence speed on neural architectures.

Incorporating vertex semantics. Note that D-VAE currently uses one-hot encoding of node types as , which does not consider the semantic meanings of different node types. For example, a convolution layer might be functionally very similar to a convolution layer, while being functionally distinct from a max pooling layer. We expect incorporating such semantic meanings of node types to be able to further improve D-VAE’s performance. For example, we can use pretrained embeddings of node types to replace the one-hot encoding. We leave it for future work.

4 Experiments

We validate the proposed DAG variational autoencoder (D-VAE) on two DAG optimization tasks:

  • [leftmargin=*,itemsep=-1pt,topsep=-1pt]

  • Neural architecture search. Our neural network dataset contains 19,020 neural architectures from the ENAS software [32]. Each neural architecture has 6 layers (excluding input and output layers) sampled from: and convolutions, and depthwise-separable convolutions [55], max pooling, and average pooling. We evaluate each neural architecture’s weight-sharing accuracy [32] (a proxy of the true accuracy) on CIFAR-10 [56] as its performance measure. We split the dataset into 90% training and 10% held-out test sets. We use the training set for VAE training, and use the test set only for evaluation. More details are in Appendix H.

  • Bayesian network structure learning. Our Bayesian network dataset contains 200,000 random 8-node Bayesian networks from the bnlearn package [57] in R. For each network, we compute the Bayesian Information Criterion (BIC) score to measure the performance of the network structure for fitting the Asia dataset [58]. We split the Bayesian networks into 90% training and 10% test sets. For more details, please refer to Appendix I.

Following [3], we do four experiments for each task:

  • [leftmargin=*,itemsep=-1pt,topsep=-1pt]

  • Basic abilities of VAE models. In this experiment, we perform standard tests to evaluate the reconstructive and generative abilities of a VAE model for DAGs, including reconstruction accuracy, prior validity, uniqueness and novelty. We move the results of this part to Appendix M.1.

  • Predictive performance of latent representation. We test how well we can use the latent embeddings of neural architectures and Bayesian networks to predict their performances.

  • Bayesian optimization. This is the motivating application of D-VAE. We test how well the learned latent space can be used for searching for high-performance DAGs through Bayesian optimization.

  • Latent space visualization. We visualize the latent space to qualitatively evaluate its smoothness.

Since there is little previous work on DAG generation, we compare D-VAE with three generative baselines adapted for DAGs: S-VAE, GraphRNN and GCN. Among them, S-VAE [52] and GraphRNN [13] are adjacency-matrix-based methods, and GCN [22] uses simultaneous message passing to encode DAGs. We include more details about these baselines and discuss D-VAE’s advantages over them in Appendix J. The training details are in Appendix K. All the code and data will be made publicly available.

4.1 Predictive performance of latent representation.

In this experiment, we evaluate how well the learned latent embeddings can predict the corresponding DAGs’ performances, which tests a VAE’s unsupervised representation learning ability. Being able to accurately predict a latent point’s performance also makes it much easier to search for high-performance points in this latent space. Thus, the experiment is also an indirect way to evaluate a VAE latent space’s suitability for DAG optimization. Following [3], we train a sparse Gaussian Process (SGP) regression model [59] with 500 inducing points on the training data’s embeddings to predict the unseen test data’s performances. We include the SGP training details in Appendix L.

We use two metrics to evaluate the predictive performance of the latent embeddings (given by the mean of the posterior approximations). One is the RMSE between the SGP predictions and the true performances. The other is the Pearson correlation coefficient (or Pearson’s ), measuring how well the prediction and real performance tend to go up and down together. A small RMSE and a large Pearson’s indicate a better predictive performance. Table 1

shows the results. All the experiments are repeated 10 times and the means and standard deviations are reported.

Neural architectures Bayesian networks
Methods RMSE Pearson’s RMSE Pearson’s
D-VAE 0.3840.002 0.9200.001 0.3000.004 0.9590.001
S-VAE 0.4780.002 0.8730.001 0.3690.003 0.9330.001
GraphRNN 0.7260.002 0.6690.001 0.7740.007 0.6410.002
GCN 0.8320.001 0.5270.001 0.4210.004 0.9140.001
Table 1: Predictive performance of encoded means.

From Table 1, we find that both the RMSE and Pearson’s of D-VAE are significantly better than those of the other models. A possible explanation is D-VAE encodes the computation, which is directly related to a DAG’s performance. S-VAE follows closely by achieving the second best performance. GraphRNN and GCN have less satisfying performances in this experiment. The better predictive power of D-VAE’s latent space means performing Bayesian optimization in it may be more likely to find high-performance points.

Figure 4: Top 5 neural architectures found by each model and their true test accuracies.
Figure 5: Top 5 Bayesian networks found by each model and their BIC scores (higher the better).
Figure 6:

Great circle interpolation starting from a point and returning to itself. Upper: D-VAE. Lower: S-VAE.

4.2 Bayesian optimization

We perform Bayesian optimization using the two best models, D-VAE and S-VAE, validated by previous experiments. Based on the SGP model from the last experiment, we perform 10 iterations of batch Bayesian optimization, and average results across 10 trials. A batch size of 50 and the expected improvement (EI) heuristic [60] are used, following Kusner et al. [3]. Concretely speaking, we start from the training data’s embeddings, and iteratively propose new points from the latent space that maximize the EI acquisition function. For each batch of selected points, we evaluate their decoded DAGs’ real performances and add them back to the SGP to select the next batch. Finally, we check the best-performing DAGs found by each model to evaluate its DAG optimization performance.

Neural architectures. For neural architectures, we select the top 15 found architectures in terms of their weight-sharing accuracies, and fully train them on CIFAR-10’s train set to evaluate their true test accuracies. More details can be found in Appendix H. We show the 5 architectures with the highest true test accuracies in Figure 4. As we can see, D-VAE in general found much better neural architectures than S-VAE. Among the selected architectures, D-VAE achieved a highest accuracy of 94.80%, while S-VAE’s highest accuracy was only 92.79%. In addition, all the 5 architectures of D-VAE have accuracies higher than 94%, indicating that D-VAE’s latent space can stably find many high-performance architectures. Although not outperforming state-of-the-art NAS techniques such as NAONet [38] (2.11% error rate on CIFAR-10), our search space was much smaller, and we did not apply any data augmentation techniques nor did we copy multiple folds or add more filters after finding the architecture. We emphasize that in this paper, we mainly focus on idea illustration rather than record breaking, since achieving state-of-the-art NAS results typically requires enormous computation resources beyond our capability. Nevertheless, D-VAE does provide a promising new direction for neural architecture search based on graph generation, alternative to existing approaches.

Bayesian networks. We similarly report the top 5 Bayesian networks found by each model ranked by their BIC scores in Figure 5. D-VAE generally found better Bayesian networks than S-VAE. The best Bayesian network found by D-VAE achieved a BIC of -11125.75, which is better than the best network in the training set with a BIC of -11141.89 (a higher BIC score is better). Considering BIC is in log scale, the probability of our found network to explain the data is actually 1E7 times larger than that of the best training network. For reference, the true Bayesian network used to generate the Asia data has a BIC of -11109.74. Although we did not exactly find the true network, our found network is close to it and outperforms all training data. Our experiments show that searching in an embedding space is a promising direction for Bayesian network structure learning.

4.3 Latent space visualization

In this experiment, we visualize the latent spaces of the VAE models to get a sense of their smoothness.

For neural architectures, we visualize the decoded architectures from points along a great circle in the latent space. We start from the latent embedding of a straight network without skip connections. Imagine this point as a point on the surface of a sphere (visualize the earth). We randomly pick a great circle starting from this point and returning to itself around the sphere. Along this circle, we evenly pick 35 points and visualize their decoded nets in Figure 6. As we can see, both D-VAE and S-VAE show relatively smooth interpolations by changing only a few node types or edges each time. Visually speaking, S-VAE’s structural changes are even more smooth. This is because S-VAE treats DAGs purely as strings, thus tending to embed DAGs with few differences in string representations to similar regions of the latent space without considering their computational differences (see Appendix J). In contrast, D-VAE models computations, and focuses more on the smoothness w.r.t. computation rather than structure.

Figure 7: Visualizing a principal 2-D subspace of the latent space.

For Bayesian networks, we aim to directly visualize the BIC score distribution of the latent space. To do so, we reduce its dimensionality by choosing a 2-D subspace of the latent space spanned by the first two principal components of the training data’s embeddings. In this low-dimensional subspace, we compute the BIC scores of all the points evenly spaced within a grid and visualize the scores using a colormap in Figure 7. As we can see, D-VAE seems to better differentiate high-score points from low-score ones and shows more smoothly changing of BIC scores, while S-VAE shows sharp boundaries and seems to mix high-score and low-score points more severely. We suspect this helps Bayesian optimization find high-performance Bayesian networks more easily in D-VAE.

5 Conclusion

In this paper, we have proposed D-VAE, a deep generative model for DAGs. D-VAE uses a novel asynchronous message passing scheme to explicitly model computations on DAGs. By performing Bayesian optimization in D-VAE’s latent spaces, we offer promising new directions to two important problems, neural architecture search and Bayesian network structure learning. We hope D-VAE can inspire more research on extending graph generative models’ applications on structure optimization.

References

Appendix A More Related Work

Both neural architecture search (NAS) and Bayesian network structure learning (BNSL) are subfields of AutoML. See Zöller and Huber [61] for a survey. We have given a brief overview of NAS and BNSL in section 2. Below we discuss several works most related to our work in detail.

Luo et al. [38] proposed a novel NAS approach called Neural Architecture Optimization (NAO). The basic idea is to jointly learn an encoder-decoder between networks and a continuous space, and also a performance predictor that maps the continuous representation of a network to its performance on a given dataset; then they perform two or three iterations of gradient descent on to find better architectures in the continuous space, which are then decoded to real networks to evaluate. This methodology is similar to that of Gómez-Bombarelli et al. [2] and Jin et al. [17] for molecule optimization; also similar to Mueller et al. [62] for slightly revising a sentence.

There are several key differences comparing to our approach. First, they use strings (e.g. “node-2 conv 3x3 node1 max-pooling 3x3”) to represent neural architectures, whereas we directly use graph representations, which is more natural, and generally applicable to other graphs such as Bayesian network structures. Second, they use supervised learning instead of unsupervised learning. That means they need to first evaluate a considerable amount of randomly sampled graphs on a typically large dataset (e.g. train many neural networks), and use these results to supervise the training of the autoencoder. Given a new dataset, the autoencoder needs to be completely retrained. In contrast, we train our variational autoencoder in a fully unsupervised manner, so the model is of general purposes.

Fusi et al. [63] proposed a novel AutoML algorithm also using model embedding, but with a matrix factorization approach. They first construct a matrix of performances of thousands of ML pipelines on hundreds of datasets; then they use a probabilistic matrix factorization to get the latent representations of the pipelines. Given a new dataset, Bayesian optimization with the expected improvement heuristic is used to find the best pipeline. This approach only allows us to choose from predefined off-the-shelf ML models, hence its flexibility is somewhat limited.

Kandasamy et al. [34] use Bayesian optimization for NAS; they define a kernel that measures the similarities between networks by solving an optimal transport problem, and in each iteration, they use some evolutionary heuristics to generate a set of candidate networks based on making small modifications to existing networks, and use expected improvement to choose the next one to evaluate. This work is similar to ours in the application of Bayesian optimization. However, defining a kernel to measure the similarities between discrete structures is a non-trivial problem. In addition, the discrete search space is heuristically extrapolated near existing architectures, which makes the search essentially local. In contrast, we directly fit a Gaussian process over the entire continuous latent space, enabling more global optimization.

Using Gaussian process (GP) for Bayesian network structure learning has also been studied before. Yackley and Lane [64] analyzed the smoothness of BDe score, showing that a local change (e.g. adding an edge) can change the score by at most , where is the number of training points. They proposed to use GP as a proxy for the score to accelerate the search. Anderson and Lane [65] used GP to model the BDe score, and showed that the probability of improvement is higher than that of using hill climbing to guide the local search. However, these methods still heuristically and locally operate in the discrete space, whereas our latent space makes both local and global methods such as gradient descent and Bayesian optimization applicable in a principled manner.

Appendix B Computation vs. Function

In section 3 we defined computation. Here we discuss the difference between a computation and a function. A computation defines a function . However, computations and also define the same function , but , and are different computations. In other words, a computation is (informally speaking) a process which focuses on the course of how the input is processed into the output, while a function is a mapping which cares about the results. Different computations can define the same function.

Sometimes, the same computation can also define different functions, e.g., two identical neural architectures will represent different functions given they are trained differently (since the weights of their layers will be different). In D-VAE, we model computations instead of functions, since 1) modeling functions is much harder than modeling computations (requires understanding the semantic meaning of each operation, such as the cancelling out of and ), and 2) modeling functions additionally requires knowing the parameters of some operations, which are unknown before training.

Note also that in Definition 1, we only allow one single input signal. But in real world a computation sometimes has multiple initial input signals. However, the case of multiple input signals can be reduced to the single input case by adding an initial assignment operation that assigns the combined input signal to their corresponding next-level operations. For ease of presentation, we uniformly assume single input throughout the paper.

Appendix C Proof of Theorem 1

Let be the starting node with no predecessors. By assumption, is the single starting node no matter how we permute the nodes of the input DAG. For , the aggregation function always outputs a zero vector. Thus, is invariant to node permutations. Subsequently, the hidden state is also invariant to node permutations.

Now we prove the theorem by structural induction. Consider node . Suppose for every predecessor of , the hidden state is invariant to node permutations. We will show that is also invariant to node permutations. Notice that in (3), the output by is invariant to node permutations, since is invariant to the order of its inputs , and all are invariant to node permutations. Subsequently, node ’s hidden state is invariant to node permutations. By induction, we know that every node’s hidden state is invariant to node permutations, including the ending node’s hidden state. Thus, the D-VAE encoder is invariant to node permutations. ∎

Appendix D Proof of Theorem 2

Suppose there is an arbitrary input signal fed to the starting node . For convenience, we will use to denote the output signal at vertex , where represents the composition of all the operations along the paths from to .

For the starting node , remember we feed a fixed to (2), thus is also fixed. Since also represents a fixed input operation, we know that the mapping from to is injective. Now we prove the theorem by induction. Assume the mapping from to is injective for all . We will prove that the mapping from to is also injective.

Let where is injective. Consider the output signal , which is given by feeding to . Thus,

(6)

In other words, we can write as

(7)

where is an injective function used for defining the composite computation based upon and . Note that can be either unordered or ordered depending on the operation . For example, if is some symmetric operations such as adding or multiplication, then can be unordered. If is some operation like subtraction or division, then must be ordered.

With (2) and (3), we can write the hidden state as follows:

(8)

where is the injective one-hot encoding function mapping to . In the above equation, are all injective. Since the composition of injective functions is injective, there exists an injective function so that

(9)

Then combining (7) we have:

(10)

is injective since the composition of injective functions is injective. Thus, we have proved that the mapping from to is injective. ∎

Appendix E Modifications for Encoding Neural Architectures

According to Theorem 2, to ensure D-VAE injectively encodes computations, we need the aggregation function to be injective. Remember takes the multiset as input. If the order of its elements does not matter, then the gated sum in (4) can model this injective multiset function without issues. However, if the order matters (i.e., permuting the elements of makes output different results), we need a different aggregation function that can encode such orders.

Whether the order should matter for depends on whether the input order matters for the operations (see the proof for Theorem 2 for more details). For example, if multiple previous layers’ outputs are summed or averaged as the input to a next layer in the neural networks, then can be modeled by the gated sum in (4) as the order of inputs does not matter. However, if these outputs are concatenated as the next layer’s input, then the order does matter. In our experiments, the neural architectures use the second way to aggregate outputs from previous layers. The order of concatenation depends on a global order of the layers in a neural architecture. For example, if layer-2 and layer-4’s outputs are input to layer-5, then layer-2’s output will be before layer-4’s output in their concatenation.

Since the gated sum in (4) can only handle the unordered case, we can slightly modify (4) in order to make it order-aware thus more suitable for our neural architectures. Our scheme is as follows:

(11)

where is the one-hot encoding of layer ’s global ID (1,2,3,…). Such an aggregation function respects the concatenation order of the layers. We empirically observed that this aggregation function can increase D-VAE’s performance on neural architectures compared to the plain aggregation function (4). However, even using (4) still outperformed all baselines.

Appendix F Modifications for Encoding Bayesian Networks

We also make some modifications when encoding Bayesian networks. One modification is that the aggregation function (4) is changed to:

(12)

Compared to (4), we replace with the node type feature . This is due to the differences between computations on a neural architecture and on a Bayesian network. In a neural network, the signal flow follows the network architecture, where the output signal of a layer is fed as the input signals to its succeeding layers. Also in a neural network, what we are interested in is the result output by the final layer. In contrast, for a Bayesian network, the graph represents a set of conditional dependencies among variables instead of a computational flow. In particular, for Bayesian network structure learning, we are often concerned about computing the (log) marginal likelihood score of a dataset given a graph structure, which is often decomposed into individual variables given their parents (see Definition 18.2 in Koller and Friedman [1]). For example, in Figure 8, the overall score can be decomposed into . To compute the score for , we only need the values of and ; its grandparents and should have no influence on .

Figure 8: An example Bayesian network and its encoding.

Based on this intuition, when computing the hidden state of a node, we use the features of its parents instead of , which “d-separates” the node from its grandparents. For the update function, we still use (5).

Also based on the decomposibility of the score, we make another modification for encoding Bayesian networks by using the sum of all node states as the final output state instead of only using the ending node state. Similarly, when decoding Bayesian networks, the graph state .

Note that the combination of (12) and (5) can injectively model the conditional dependence between and its parents . In addition, using summing can model injective set functions [49, Lemma 5]. Therefore, the above encoding scheme is able to injectively encode the complete conditional dependencies of a Bayesian network, thus also the overall score function of the network.

Appendix G Advantages of Encoding Computations in DAG Optimization

Here we discuss why D-VAE’s ability to injectively encode computations (Theorem 2) is of great benefit to performing DAG optimization in the latent space. Firstly, our target is to find a DAG that achieves high performance (e.g., accuracy of neural network, BIC score of Bayesian network) on a given dataset. The performance of a DAG is directly related to its computation. For example, given the same set of layer parameters, two neural networks with the same computation will have the same performance on a given test set. Since D-VAE encodes computations instead of structures, it allows embedding DAGs with similar performances to the same regions in the latent space, rather than embedding DAGs with merely similar structure patterns to the same regions. Subsequently, the latent space can be smooth w.r.t. performance instead of structure. Such smoothness can greatly facilitate searching for high-performance DAGs in the latent space, since similar-performance DAGs tend to locate near each other in the latent space instead of locating randomly, and modeling a smoothly-changing performance surface is much easier.

Note that Theorem 2 is a necessary condition for the latent space to be smooth w.r.t. performance, because if D-VAE cannot injectively encode computations, it might map two DAGs representing completely different computations to the same encoding, making this point of the latent space arbitrarily unsmooth. Although there yet is no theoretical guarantee that the latent space must be smooth w.r.t. DAGs’ performances, we do empirically observe that the predictive performance and Bayesian optimization performance of D-VAE’s latent space are significantly better than those of baselines, which is indirect evidence that D-VAE’s latent space is smoother w.r.t. performance. Our visualization results also confirm the smoothness. See Section 4.1, 4.2, 4.3 for details.

Appendix H More Details about Neural Architecture Search

We use the efficient neural architecture search (ENAS)’s software [32] to generate the training and testing neural architectures. With these seed architectures, we can train a VAE model and thus search for new high-performance architectures in the latent space.

ENAS alternately trains two components: 1) a RNN-based controller which is used to propose new architectures, and 2) the shared weights of the proposed architectures. It uses a weight-sharing (WS) scheme to obtain a quick but rough estimate of how good an architecture is. That is, it forces all the proposed architectures to use the same set of shared weights, instead of fully training each neural network individually. It assumes that an architecture with a high validation accuracy using the shared weights (i.e., the weight-sharing accuracy) is more likely to have a high test accuracy after fully retraining its weights from scratch.

We first run ENAS in the macro space (section 2.3 of Pham et al. [32]

) for 1000 epochs with 20 architectures proposed in each epoch. For all the proposed architectures excluding the first 1000 burn-in ones, we evaluate their weight-sharing accuracies using the shared weights from the last epoch. We further split the data into 90% training and 10% held-out test sets. Then our task becomes to train a VAE on the training neural architectures, and then generate new high-performance architectures from the latent space based on Bayesian optimization. Note that our target performance measure here is the weight-sharing accuracy, not the true validation/test accuracy after fully retraining the architecture. This is because the weight-sharing accuracy takes around 0.5 second to evaluate, while fully training a network takes over 12 hours. In consideration of our limited computational resources, we choose the weight-sharing accuracy as our optimization target in the Bayesian optimization experiments.

After the Bayesian optimization finds a final set of architectures with high weight-sharing accuracies, we will fully train them to evaluate their true test accuracies on CIFAR-10. To fully train an architecture, we follow the original setting of ENAS to train each architecture on CIFAR-10’s training set for 310 epochs, and report the last epoch’s net’s test accuracy. See [32, section 3.2] for details.

Due to our constrained computational resources, we choose not to perform Bayesian optimization to optimize the true validation accuracy (after fully training), which would be a more principled way for searching neural architectures. Nevertheless, we describe its procedure here for future explorations: After training the D-VAE, we have no architectures at all to initialize a Gaussian process regression on the true validation accuracy. Thus, we need to randomly pick up some points in the latent space, decode them into neural architectures, and get their true validation accuracies after full training. Then with these initial points, we start the Bayesian optimization similarly to section 4.2, with the optimization target replaced by the true validation accuracy. Finally, we will find a set of architectures with the highest true validation accuracies, and report their true test accuracies. This experiment will take much longer time (months of GPU time). Thus, making the training parallel is very necessary.

One might wonder why we train another generative model after we already have ENAS. Firstly, ENAS is not general-purpose, but task specific. It leverages the validation accuracy signals to train the controller based on reinforcement learning. For any new NAS tasks, ENAS needs to be completely retrained. In contrast, D-VAE is unsupervised. It only needs to be trained once, and can be applied to other NAS tasks. Secondly, D-VAE also provides a way to learn neural architecture embeddings, which can be used for downstream tasks such as visualization, classification, clustering etc.

In the Bayesian optimization experiments (section 4.2), the best architecture found by D-VAE achieves a test accuracy of 94.80% on CIFAR-10. Although not outperforming state-of-the-art NAS techniques which has an error rate of 2.11%, our architecture only contains 3 million parameters compared to the state-of-the-art NAONet + Cutout which has 128 million parameters [38]. In addition, NAONet used 200 GPUs to fully train 1,000 architectures for 1 day, and stacked the final found cell for 6 times as well as adding 4 times more filters after optimization. In comparison, we only used 1 GPU to evaluate the weight-sharing accuracy, and never used any data augmentation techniques or architecture stacking to boost the performance, since achieving new state-of-the-art NAS results (through using great resources and heavy engineering) are beyond the main purpose of our paper.

Appendix I More details about Bayesian network structure learning

We consider a small synthetic problem called Asia [58]

as our target Bayesian network structure learning problem. The Asia dataset is composed of 5,000 samples, each is generated by a true network with 8 binary variables

111http://www.bnlearn.com/documentation/man/asia.html. Bayesian Information Criteria (BIC) score is used to evaluate how well a Bayesian network fits the 5,000 samples. To train a VAE model to generate Bayesian network structures, we sample 200,000 random 8-node Bayesian networks from the bnlearn package [57] in R, which are split into 90% training and 10% testing sets. Our task is to train a VAE model on the training Bayesian networks, and search in the latent space for Bayesian networks with high BIC scores using Bayesian optimization. In this task, we consider a simplified case where the topological order of the true network is known – we let the sampled training and test Bayesian networks have topological orders consistent with the true network of Asia. This is a reasonable assumption for many practical applications, e.g., when the variables have a temporal order [1]. When sampling a network, the probability of a node having an edge with a previous node (as specified by the order) is set to the default option , where is the number of nodes, which results in sparse graphs where the number of edges is in the same order of the number of nodes.

Appendix J Baselines

As discussed in the related work, there are other types of graph generative models that can potentially work for DAGs. We explore three possible approaches and contrast them with D-VAE.

S-VAE. The S-VAE baseline treats a DAG as a sequence of node strings, which we call string-based variational autoencoder (S-VAE). In S-VAE, each node is represented as the one-hot encoding of its type number concatenated with a 0/1 indicator vector indicating which previous nodes have directed edges to it (i.e., a column of the adjacency matrix). For example, suppose there are two node types and five nodes, then node 4’s string “0 1, 0 1 1 0 0” means this node has type 2, and has directed edges from previous nodes 2 and 3. S-VAE leverages a standard GRU-based RNN variational autoencoder [52] on the topologically sorted node sequences, with each node’s string treated as its input bit vector.

GraphRNN. One similar generative model is GraphRNN [13]. Different from S-VAE, it further decomposes an adjacency column into entries and generates the entries one by one using another edge-level GRU. GraphRNN is a pure generative model which does not have an encoder, thus cannot optimize DAG performance in a latent space. To compare with GraphRNN, we equip it with S-VAE’s encoder and use it as another baseline. Note that the original GraphRNN feeds nodes using a BFS order (for undirected graphs), yet we find that it is much worse than using a topological order here. Note also that although GraphRNN seems more expressive than S-VAE, we find that in our applications GraphRNN tends to have more severe overfitting and generates less diverse DAGs.

Figure 9: Two bits of change in the string representations can completely change the computational purpose.

Both GraphRNN and S-VAE treat DAGs as bit strings and use RNNs to model them. This representation has several drawbacks. Firstly, since the topological ordering is often not unique for a DAG, there might be multiple string representations for the same DAG, which all result in different encoded representations. This will violate the permutation invariance in Theorem 1. Secondly, the string representations can be very brittle in terms of modeling DAGs’ computational purposes. In Figure 9, the left and right DAGs’ string representations are only different by two bits, i.e., the edge (2,3) in the left is changed to the edge (1,3) in the right. However, the two bits of change in structure greatly changes the signal flow, which makes the right DAG always output . In S-VAE and GraphRNN, since the bit representations of the left and right DAGs are very similar, they are highly likely to be encoded to similar latent vectors. In particular, the only difference between encoding the left and right DAGs is that, for node 3, the encoder RNN will read an adjacency column of [0, 1, 0, 0, 0, 0] in the left, and read [1, 0, 0, 0, 0, 0] in the right, while all the remaining encoding is exactly the same. By embedding two DAGs serving very different computational purposes to the same region of the latent space, S-VAE and GraphRNN tend to have less smooth latent spaces which make optimization on them more difficult. In contrast, D-VAE can better differentiate such subtle differences, as the change of edge (2,3) to (1,3) completely changes what aggregated message node 3 receives in D-VAE (hidden state of node 2 vs. hidden state of node 1), which greatly affects node 3 and all its successors’ feature learning.

GCN. The graph convolutional network (GCN) [22] is one representative graph neural network with a simultaneous message passing scheme. In GCN, all the nodes take their neighbors’ incoming messages to update their own states simultaneously instead of following an order. After message passing, the summed node states is used as the graph state. We include GCN as the third baseline. Since GCN can only encode graphs, we equip GCN with D-VAE’s decoder to make it a VAE model.

Using GCN as the encoder can ensure permutation invariance, since node ordering does not matter in GCN. However, GCN’s message passing focuses on propagating the neighboring nodes’ features to each center node to encode the local substructure pattern around each node. In comparison, D-VAE’s message passing simulates how the computation is performed along the directed paths of a DAG and focuses on encoding the computation. Although learning local substructure features is essential for GCN’s successes in node classification and graph classification, here in our tasks, modeling the entire computation is much more important than modeling the local features. Encoding only local substructures may also lose important information about the global DAG topology, thus making it more difficult to reconstruct the DAG.

We omit other possible approaches such as GraphVAE [12], GraphSAGE [24], and other graph-based models [6, 17, 18, 19] etc., either because they share similar characteristics to the compared baselines, or because they lack official code or target specific graphs (such as molecules).

Appendix K VAE Training Details

We use the same settings and hyperparameters (where applicable) for all the four models to be as pair as possible. Many hyperparameters are inherited from

Kusner et al. [3]. Single-layer GRUs are used in all models requiring recurrent units, with the same hidden state size of 501. We set the dimension of the latent space to be 56 for all models. All VAE models use as the prior distribution , and take ( denotes the input DAG) to be a normal distribution with a diagonal covariance matrix, whose mean and variance parameters are output by the encoder. The two MLPs used to output the mean and variance parameters are all implemented as single linear layer networks.

For the decoder network of D-VAE, we let and

be two-layer MLPs with ReLU nonlinearities, where the hidden layer sizes are set to two times of the input sizes. Softmax activation is used after

, and sigmoid activation is used after . For the gating network , we use a single linear layer with sigmoid activation. For the mapping function , we use a linear mapping without activation. The bidirectional encoding discussed in section 3.3 is enabled for D-VAE on neural architectures, and disabled for D-VAE on Bayesian networks and other models where it gets no better results. To measure the reconstruction loss, we use teacher forcing [17]: following the topological order with which the input DAG’s nodes are consumed, we sum the negative log-likelihood of each decoding step by forcing them to generate the ground truth node type or edge at each step. This ensures that the model makes predictions based on the correct histories. Then, we optimize the VAE loss (the negative of (1)) using gradient descent following [17].

When optimizing the VAE loss, we use

as the loss function. In original VAE framework,

is set to 1. However, we found that it led to poor reconstruction accuracies, similar to the findings of previous work [3, 10, 17]. Following the implementation of Jin et al. [17], we set . Mini-batch SGD with Adam optimizer [66]

is used for all models. For neural architectures, we use a batch size of 32 and train all models for 300 epochs. For Bayesian networks, we use a batch size of 128 and train all models for 100 epochs. We use an initial learning rate of 1E-4, and multiply the learning rate by 0.1 whenever the training loss does not decrease for 10 epochs. We use PyTorch to implement all the models.

Appendix L SGP Training Details

We use sparse Gaussian process (SGP) regression as the predictive model. We use the open sourced SGP implementation in [3]. Both the training and testing data’s performances are standardized according to the mean and std of the training data’s performances before feeding to the SGP. And the RMSE and Pearson’s in Table 1 are also calculated on the standardized performances. We use the default Adam optimizer to train the SGP for 100 epochs constantly with a mini-batch size of 1,000 and learning rate of 5E-4.

For neural architectures, we use all the training data to train the SGP. For Bayesian networks, we randomly sample 5,000 training examples each time, due to two reasons: 1) using all the 180,000 examples to train the SGP might not be realistic for a typical scenario where network/dataset is large and evaluating a network is expensive; and 2) we found using a smaller sample of training data results in more stable BO performance due to the less probability of duplicate rows which might result in ill conditioned matrices. Note also that, when training the variational autoencoders, all the training data are used, since the VAE training is purely unsupervised.

Appendix M More Experimental Results

m.1 Reconstruction accuracy, prior validity, uniqueness and novelty

Being able to accurately reconstruct input examples and generate valid new examples are basic requirements for VAE models. In this experiment, we evaluate the models by measuring 1) how often they can reconstruct input DAGs perfectly (Accuracy), 2) how often they can generate valid neural architectures or Bayesian networks from the prior distribution (Validity), 3) the portion of unique DAGs out of the valid generations (Uniqueness), and 4) the portion of valid generations that are never seen in the training set (Novelty).

We first evaluate each model’s reconstruction accuracy on the test sets. Following previous work [3, 17], we regard the encoding as a stochastic process. That is, after getting the mean and variance parameters of the posterior approximation , we sample a from it as ’s latent vector. To estimate the reconstruction accuracy, we sample 10 times for each , and decode each 10 times too. Then we report the average portion of the 100 decoded DAGs that are identical to the input.

To calculate prior validity, we sample 1,000 latent vectors from the prior distribution and decode each latent vector 10 times. Then we report the portion of these 10,000 generated DAGs that are valid. A generated DAG is valid if it can be read by the original software which generated the training data. More details about the validity experiment are in Appendix M.2.

Neural architectures Bayesian networks
Methods Accuracy Validity Uniqueness Novelty Accuracy Validity Uniqueness Novelty
D-VAE 99.96 100.00 37.26 100.00 99.94 98.84 38.98 98.01
S-VAE 99.98 100.00 37.03 99.99 99.99 100.00 35.51 99.70
GraphRNN 99.85 99.84 29.77 100.00 96.71 100.00 27.30 98.57
GCN 5.42 99.37 41.18 100.00 99.07 99.89 30.53 98.26
Table 2: Reconstruction accuracy, prior validity, uniqueness and novelty (%).

We show the results in Table 2. Among all the models, D-VAE and S-VAE generally have the highest performance. We find that D-VAE, S-VAE and GraphRNN all have near perfect reconstruction accuracy, prior validity and novelty. However, D-VAE and S-VAE show higher uniqueness, meaning that they generate more diverse examples. We find that GCN is not suitable for modeling neural architectures as it only reconstructs 5.42% unseen inputs. This is not surprising, since the simultaneous message passing scheme in GCN focuses on learning local graph structures, but fails to encode the computation represented by the entire neural network. Besides, the sum pooling after the message passing might also lose some global topology information which is important for the reconstruction.

m.2 More details on the piror validity experiment

Since different models can have different levels of convergence w.r.t. the KLD loss in (1), their posterior distribution may have different degrees of alignment with the prior distribution . If we evaluate prior validity by sampling from for all models, we will favor those models that have a higher-level of KLD convergence. To remove such effects and focus purely on models’ intrinsic ability to generate valid DAGs, when evaluating prior validity, we apply for each model (where are encoded means of the training data by the model), so that the latent vectors are scaled and shifted to the center of the training data’s embeddings. If we do not apply such transformations, we find that we can easily control the prior validity results by optimizing for more or less epochs or putting more or less weight on the KLD loss.

For a generated neural architecture to be read by ENAS, it has to pass the following validity checks: 1) It has one and only one starting node (the input layer); 2) It has one and only one ending type (the output layer); 3) Other than the input node, there are no nodes which do not have any predecessors (no isolated paths); 4) Other than the output node, there are no nodes which do not have any successors (no blocked paths); 5) Each node must have a directed edge from the node immediately before it (the constraint of ENAS), i.e., there is always a main path connecting all the nodes; and 6) It is a DAG.

For a generated Bayesian network to be read by bnlearn and evaluated on the Asia dataset, it has to pass the following validity checks: 1) It has exactly 8 nodes; 2) Each type in "ASTLBEXD" appears exactly once; and 3) It is a DAG.

Note that the training graphs generated by the original software all satisfy these validity constraints.

Figure 10: Comparing BO with random search on neural architectures. Left: average weight-sharing accuracy of the selected points in each iteration. Right: highest weight-sharing accuracy of the selected points over time.
Figure 11: Comparing BO with random search on Bayesian networks. Left: average BIC score of the selected points in each iteration. Right: highest BIC score of the selected points over time.

Figure 12: 2-D visualization of decoded neural architectures. Left: D-VAE. Right: S-VAE.

m.3 Bayesian optimization vs. random search

To validate that Bayesian optimization (BO) in the latent space does provide guidance in searching better DAGs, we compare BO with Random (which randomly samples points from the latent space of D-VAE). Figure 10 and 11 show the results (averaged across 10 trials). In each figure, the left plot shows the average performance of all the points found in each BO round, and the right plot shows the highest performance of all the points found so far. As we can see, BO consistently selects points with better average performance in each round than random search, which is expected. However, for the highest performance results, BO tends to fall behind Random in the initial few rounds. This might be because our batch expected improvement heuristic aims to take advantage of the currently most promising regions by selecting most points of the batch in the same region (exploitation), while Random more evenly explores the entire space (exploration). Nevertheless, BO seems to quickly catch up after a few rounds and shows long-term advantages.

m.4 More Visualization Results for Neural Architectures

We randomly pick a neural architecture and use its encoded mean as the starting point. We then generate two random orthogonal directions, and move in the combination of these two directions from the starting point to render a 2-D visualization of the decoded architectures in Figure 12.

Figure 13: 2-D visualization of decoded Bayesian networks. Left: D-VAE. Right: S-VAE.

m.5 More Visualization Results for Bayesian Networks

We similarly show the 2-D visualization of decoded Bayesian networks in Figure 13. Both D-VAE and S-VAE show smooth latent spaces.