1 Introduction and Related Work
Generative models for realworld graphs have important applications in many domains, including modeling physical and social interactions, discovering new chemical and molecular structures, and constructing knowledge graphs. Development of generative graph models has a rich history, and many methods have been proposed that can generate graphs based on a priori structural assumptions
(Newman, 2010). However, a key open challenge in this area is developing methods that can directly learn generative models from an observed set of graphs. Developing generative models that can learn directly from data is an important step towards improving the fidelity of generated graphs, and paves a way for new kinds of applications, such as discovering new graph structures and completing evolving graphs.In contrast, traditional generative models for graphs (e.g., BarabásiAlbert model, Kronecker graphs, exponential random graphs, and stochastic block models) (Erdős & Rényi, 1959; Leskovec et al., 2010; Albert & Barabási, 2002; Airoldi et al., 2008; Leskovec et al., 2007; Robins et al., 2007) are handengineered to model a particular family of graphs, and thus do not have the capacity to directly learn the generative model from observed data. For example, the BarabásiAlbert model is carefully designed to capture the scalefree nature of empirical degree distributions, but fails to capture many other aspects of realworld graphs, such as community structure.
Recent advances in deep generative models, such as variational autoencoders (VAE)
(Kingma & Welling, 2014) and generative adversarial networks (GAN) (Goodfellow et al., 2014), have made important progress towards generative modeling for complex domains, such as image and text data. Building on these approaches a number of deep learning models for generating graphs have been proposed
(Kipf & Welling, 2016; Grover et al., 2017; Simonovsky & Komodakis, 2018; Li et al., 2018). For example, Simonovsky & Komodakis 2018 propose a VAEbased approach, while Li et al. 2018propose a framework based upon graph neural networks. However, these recently proposed deep models are either limited to learning from a single graph
(Kipf & Welling, 2016; Grover et al., 2017) or generating small graphs with 40 or fewer nodes (Li et al., 2018; Simonovsky & Komodakis, 2018)—limitations that stem from three fundamental challenges in the graph generation problem:
[itemsep=0pt,topsep=0pt,leftmargin=11pt]

Large and variable output spaces: To generate a graph with nodes the generative model has to output values to fully specify its structure. Also, the number of nodes and edges varies between different graphs and a generative model needs to accommodate such complexity and variability in the output space.

Nonunique representations: In the general graph generation problem studied here, we want distributions over possible graph structures without assuming a fixed set of nodes (e.g., to generate candidate molecules of varying sizes). In this general setting, a graph with nodes can be represented by up to equivalent adjacency matrices, each corresponding to a different, arbitrary node ordering/numbering. Such high representation complexity is challenging to model and makes it expensive to compute and then optimize objective functions, like reconstruction error, during training. For example, GraphVAE (Simonovsky & Komodakis, 2018) uses approximate graph matching to address this issue, requiring operations in the worst case (Cho et al., 2014).

Complex dependencies: Edge formation in graphs involves complex structural dependencies. For example, in many realworld graphs two nodes are more likely to be connected if they share common neighbors (Newman, 2010). Therefore, edges cannot be modeled as a sequence of independent events, but rather need to be generated jointly, where each next edge depends on the previously generated edges. Li et al. 2018 address this problem using graph neural networks to perform a form of “message passing”; however, while expressive, this approach takes operations to generate a graph with edges, nodes and diameter .
Present work. Here we address the above challenges and present
Graph Recurrent Neural Networks
(GraphRNN), a scalable framework for learning generative models of graphs. GraphRNN models a graph in an autoregressive (or recurrent) manner—as a sequence of additions of new nodes and edges—to capture the complex joint probability of all nodes and edges in the graph. In particular, GraphRNN can be viewed as a hierarchical model, where a
graphlevel RNN maintains the state of the graph and generates new nodes, while an edgelevel RNN generates the edges for each newly generated node. Due to its autoregressive structure, GraphRNN can naturally accommodate variablesized graphs, and we introduce a breadthfirstsearch (BFS) nodeordering scheme to drastically improve scalability. This BFS approach alleviates the fact that graphs have nonunique representations—by collapsing distinct representations to unique BFS trees—and the treestructure induced by BFS allows us to limit the number of edge predictions made for each node during training. Our approach requires operations on worstcase (i.e., complete) graphs, but we prove that our BFS ordering scheme permits subquadratic complexity in many cases.In addition to the novel GraphRNN framework, we also introduce a comprehensive suite of benchmark tasks and baselines for the graph generation problem, with all code made publicly available^{1}^{1}1The code is available in https://github.com/snapstanford/GraphRNN, the appendix is available in https://arxiv.org/abs/1802.08773.
. A key challenge for the graph generation problem is quantitative evaluation of the quality of generated graphs. Whereas prior studies have mainly relied on visual inspection or firstorder moment statistics for evaluation, we provide a comprehensive evaluation setup by comparing graph statistics such as the degree distribution, clustering coefficient distribution and motif counts for two sets of graphs based on variants of the Maximum Mean Discrepancy (MMD)
(Gretton et al., 2012). This quantitative evaluation approach can compare higher order moments of graphstatistic distributions and provides a more rigorous evaluation than simply comparing mean values.Extensive experiments on synthetic and realworld graphs of varying size demonstrate the significant improvement GraphRNN provides over baseline approaches, including the most recent deep graph generative models as well as traditional models. Compared to traditional baselines (e.g., stochastic block models), GraphRNN is able to generate highquality graphs on all benchmark datasets, while the traditional models are only able to achieve good performance on specific datasets that exhibit special structures. Compared to other stateoftheart deep graph generative models, GraphRNN is able to achieve superior quantitative performance—in terms of the MMD distance between the generated and test set graphs—while also scaling to graphs that are larger than what these previous approaches can handle. Overall, GraphRNN reduces MMD by over the baselines on average across all datasets and effectively generalizes, achieving comparatively high loglikelihood scores on heldout data.
2 Proposed Approach
We first describe the background and notation for building generative models of graphs, and then describe our autoregressive framework, GraphRNN.
2.1 Notations and Problem Definition
An undirected graph^{2}^{2}2We focus on undirected graphs. Extensions to directed graphs and graphs with features are discussed in the Appendix. is defined by its node set and edge set . One common way to represent a graph is using an adjacency matrix, which requires a node ordering that maps nodes to rows/columns of the adjacency matrix. More precisely, is a permutation function over (i.e., is a permutation of ). We define as the set of all possible node permutations. Under a node ordering , a graph can then be represented by the adjacency matrix , where . Note that elements in the set of adjacency matrices all correspond to the same underlying graph.
The goal of learning generative models of graphs is to learn a distribution over graphs, based on a set of observed graphs sampled from data distribution , where each graph may have a different number of nodes and edges. When representing , we further assume that we may observe any node ordering with equal probability, i.e., . Thus, the generative model needs to be capable of generating graphs where each graph could have exponentially many representations, which is distinct from previous generative models for images, text, and time series.
Finally, note that traditional graph generative models (surveyed in the introduction) usually assume a single input training graph. Our approach is more general and can be applied to a single as well as multiple input training graphs.
2.2 A Brief Survey of Possible Approaches
We start by surveying some general alternative approaches for modeling , in order to highlight the limitations of existing nonautoregressive approaches and motivate our proposed autoregressive architecture.
Vectorrepresentation based models. One naïve approach would be to represent by flattening
into a vector in
, which is then used as input to any offtheshelf generative model, such as a VAE or GAN. However, this approach suffers from serious drawbacks: it cannot naturally generalize to graphs of varying size, and requires training on all possible node permutations or specifying a canonical permutation, both of which require time in general.Nodeembedding based models. There have been recent successes in encoding a graph’s structural properties into node embeddings (Hamilton et al., 2017), and one approach to graph generation could be to define a generative model that decodes edge probabilities based on pairwise relationships between learned node embeddings (as in Kipf & Welling 2016). However, this approach is only welldefined when given a fixedset of nodes, limiting its utility for the general graph generation problem, and approaches based on this idea are limited to learning from a single input graph (Kipf & Welling, 2016; Grover et al., 2017).
2.3 GraphRNN: Deep Generative Models for Graphs
The key idea of our approach is to represent graphs under different node orderings as sequences, and then to build an autoregressive generative model on these sequences. As we will show, this approach does not suffer from the drawbacks common to other general approaches (c.f., Section 2.2), allowing us to model graphs of varying size with complex edge dependencies, and we introduce a BFS node ordering scheme to drastically reduce the complexity of learning over all possible node sequences (Section 2.3.4). In this autoregressive framework, the model complexity is greatly reduced by weight sharing with recurrent neural networks (RNNs). Figure 1 illustrates our GraphRNN approach, where the main idea is that we decompose graph generation into a process that generates a sequence of nodes (via a graphlevel RNN), and another process that then generates a sequence of edges for each newly added node (via an edgelevel RNN).
2.3.1 Modeling graphs as sequences
We first define a mapping from graphs to sequences, where for a graph with nodes under node ordering , we have
(1) 
where each element is an adjacency vector representing the edges between node and the previous nodes already in the graph:^{3}^{3}3We prohibit selfloops and is defined as an empty vector.
(2) 
For undirected graphs, determines a unique graph , and we write the mapping as where .
Thus, instead of learning , whose sample space cannot be easily characterized, we sample the auxiliary to get the observations of and learn , which can be modeled autoregressively due to the sequential nature of . At inference time, we can sample without explicitly computing by sampling , which maps to via .
Given the above definitions, we can write
as the marginal distribution of the joint distribution
:(3) 
where is the distribution that we want to learn using a generative model. Due to the sequential nature of , we further decompose as the product of conditional distributions over the elements:
(4) 
where we set as the end of sequence token EOS, to represent sequences with variable lengths. We simplify as in further discussions.
2.3.2 The GraphRNN framework
So far we have transformed the modeling of to modeling , which we further decomposed into the product of conditional probabilities . Note that is highly complex as it has to capture how node links to previous nodes based on how previous nodes are interconnected among each other. Here we propose to parameterize using expressive neural networks to model the complex distribution. To achieve scalable modeling, we let the neural networks share weights across all time steps . In particular, we use an RNN that consists of a statetransition function and an output function:
(5)  
(6) 
where is a vector that encodes the state of the graph generated so far, is the adjacency vector for the most recently generated node , and specifies the distribution of next node’s adjacency vector (i.e., ). In general, and can be arbitrary neural networks, and can be an arbitrary distribution over binary vectors. This general framework is summarized in Algorithm 1.
Note that the proposed problem formulation is fully general; we discuss and present some specific variants with implementation details in the next section. Note also that RNNs require fixedsize input vectors, while we previously defined as having varying dimensions depending on ; we describe an efficient and flexible scheme to address this issue in Section 2.3.4.
2.3.3 GraphRNN variants
Different variants of the GraphRNN model correspond to different assumptions about . Recall that each dimension of is a binary value that models existence of an edge between the new node and a previous node . We propose two variants of GraphRNN, both of which implement the transition function (i.e.
, the graphlevel RNN) as a Gated Recurrent Unit (GRU)
(Chung et al., 2014) but differ in the implementation of (i.e., the edgelevel model). Both variants are trained using stochastic gradient descent with a maximum likelihood loss over
— i.e., we optimize the parameters of the neural networks to optimize over all observed graph sequences.Multivariate Bernoulli. First we present a simple baseline variant of our GraphRNN approach, which we term GraphRNNS (“S” for “simplified”). In this variant, we model
as a multivariate Bernoulli distribution, parameterized by the
vector that is output by . In particular, we implementas single layer multilayer perceptron (MLP) with sigmoid activation function, that shares weights across all time steps. The output of
is a vector , whose element can be interpreted as a probability of edge . We then sample edges in independently according to a multivariate Bernoulli distribution parametrized by .Dependent Bernoulli sequence. To fully capture complex edge dependencies, in the full GraphRNN model we further decompose into a product of conditionals,
(7) 
where denotes a binary scalar that is if node is connected to node (under ordering ). In this variant, each distribution in the product is approximated by an another RNN. Conceptually, we have a hierarchical RNN, where the first (i.e., the graphlevel) RNN generates the nodes and maintains the state of the graph, while the second (i.e., the edgelevel) RNN generates the edges of a given node (as illustrated in Figure 1). In our implementation, the edgelevel RNN is a GRU model, where the hidden state is initialized via the graphlevel hidden state and where the output at each step is mapped by a MLP to a scalar indicating the probability of having an edge. is sampled from this distribution specified by the th output of the th edgelevel RNN, and is fed into the th input of the same RNN. All edgelevel RNNs share the same parameters.
2.3.4 Tractability via breadthfirst search
A crucial insight in our approach is that rather than learning to generate graphs under any possible node permutation, we learn to generate graphs using breadthfirstsearch (BFS) node orderings, without a loss of generality. Formally, we modify Equation (1) to
(8) 
where denotes the deterministic BFS function. In particular, this BFS function takes a random permutation as input, picks as the starting node and appends the neighbors of a node into the BFS queue in the order defined by . Note that the BFS function is manytoone, i.e., multiple permutations can map to the same ordering after applying the BFS function.
Using BFS to specify the node ordering during generation has two essential benefits. The first is that we only need to train on all possible BFS orderings, rather than all possible node permutations, i.e., multiple node permutations map to the same BFS ordering, providing a reduction in the overall number of sequences we need to consider.^{4}^{4}4In the worst case (e.g., star graphs), the number of BFS orderings is , but we observe substantial reductions on many realworld graphs. The second is that the BFS ordering makes learning easier by reducing the number of edge predictions we need to make in the edgelevel RNN; in particular, when we are adding a new node under a BFS ordering, the only possible edges for this new node are those connecting to nodes that are in the “frontier” of the BFS (i.e., nodes that are still in the BFS queue)—a notion formalized by Proposition 1 (proof in the Appendix):
Proposition 1.
Suppose is a BFS ordering of nodes in graph , and but for some , then , and .
Importantly, this insight allows us to redefine the variable size vector as a fixed dimensional vector, representing the connectivity between node and nodes in the current BFS queue with maximum size :
(9) 
As a consequence of Proposition 1, we can bound as follows:
Corollary 1.
With a BFS ordering the maximum number of entries that GraphRNN model needs to predict for , is , where denotes the shortestpathdistance between vertices.
The overall time complexity of GraphRNN is thus
. In practice, we estimate an empirical upper bound for
(see the Appendix for details).3 GraphRNN Model Capacity
In this section we analyze the representational capacity of GraphRNN, illustrating how it is able to capture complex edge dependencies. In particular, we discuss two very different cases on how GraphRNN can learn to generate graphs with a global community structure as well as graphs with a very regular geometric structure. For simplicity, we assume that (the hidden state of the graphlevel RNN) can exactly encode , and that the edgelevel RNN can encode . That is, we assume that our RNNs can maintain memory of the decisions they make and elucidate the models capacity in this ideal case. We similarly rely on the universal approximation theorem of neural networks (Hornik, 1991).
Graphs with community structure. GraphRNN can model structures that are specified by a given probabilistic model. This is because the posterior of a new edge probability can be expressed as a function of the outcomes of previous nodes. For instance, suppose that the training set contains graphs generated from the following distribution : half of the nodes are in community , and half of the nodes are in community (in expectation), and nodes are connected with probability within each community and probability between communities. Given such a model, we have the following key (inductive) observation:
Observation 1.
Assume there exists a parameter setting for GraphRNN such that it can generate and according to the distribution over implied by , then there also exists a parameter setting for GraphRNN such that it can output according to .
This observation follows from three facts: First, we know that can be expressed as a function of , , and (which holds by ’s definition). Second, by our earlier assumptions on the RNN memory, can be encoded into the initial state of the edgelevel RNN, and the edgelevel RNN can also encode the outcomes of . Third, we know that is computable from and (by Bayes’ rule and ’s definition, with an analogous result for ). Finally, GraphRNN can handle the base case of the induction in Observation 1, i.e., , simply by sampling according to at the first step of the edgelevel RNN (i.e., 0.5 probability is in same community as node ).
Graphs with regular structure. GraphRNN can also naturally learn to generate regular structures, due to its ability to learn functions that only activate for where has specific degree. For example, suppose that the training set consists of ladder graphs (Noy & Ribó, 2004). To generate a ladder graph, the edgelevel RNN must handle three key cases: if , then the new node should only connect to the degree node or else any degree node; if , then the new node should only connect to the degree node that is exactly two hops away; and finally, if then the new node should make no further connections. And note that all of the statistics needed above are computable from and . The appendix contains visual illustrations and further discussions on this example.
4 Experiments
We compare GraphRNN to stateoftheart baselines, demonstrating its robustness and ability to generate highquality graphs in diverse settings.
4.1 Datasets
We perform experiments on both synthetic and real datasets, with drastically varying sizes and characteristics. The sizes of graphs vary from to .
Community. 500 twocommunity graphs with . Each community is generated by the ErdősRényi model (ER) (Erdős & Rényi, 1959) with nodes and . We then add intercommunity edges with uniform probability.
Grid. 100 standard 2D grid graphs with . We also run our models on 100 standard 2D grid graphs with , and achieve comparable results.
BA. 500 graphs with that are generated using the BarabásiAlbert model. During generation, each node is connected to 4 existing nodes.
Protein. 918 protein graphs (Dobson & Doig, 2003) with . Each protein is represented by a graph, where nodes are amino acids and two nodes are connected if they are less than Angstroms apart.
Ego. 757 3hop ego networks extracted from the Citeseer network (Sen et al., 2008) with . Nodes represent documents and edges represent citation relationships.
4.2 Experimental Setup
We compare the performance of our model against various traditional generative models for graphs, as well as some recent deep graph generative models.
Traditional baselines. Following Li et al. 2018 we compare against the ErdősRényi model (ER) (Erdős & Rényi, 1959) and the BarabásiAlbert (BA) model (Albert & Barabási, 2002). In addition, we compare against popular generative models that include learnable parameters: Kronecker graph models (Leskovec et al., 2010) and mixedmembership stochastic block models (MMSB) (Airoldi et al., 2008).
Deep learning baselines. We compare against the recent methods of Simonovsky & Komodakis 2018 (GraphVAE) and Li et al. 2018 (DeepGMG). We provide reference implementations for these methods (which do not currently have associated public code), and we adapt GraphVAE to our problem setting by using onehot indicator vectors as node features for the graph convolutional network encoder.^{5}^{5}5We also attempted using degree and clustering coefficients as features for nodes, but did not achieve better performance.
Experiment settings. We use
of the graphs in each dataset for training and test on the rest. We set the hyperparameters for baseline methods based on recommendations made in their respective papers. The hyperparameter settings for GraphRNN were fixed after development tests on data that was not used in followup evaluations (further details in the Appendix). Note that all the traditional methods are only designed to learn from a single graph, therefore we train a separate model for each training graph in order to compare with these methods. In addition, both deep learning baselines suffer from aforementioned scalability issues, so we only compare to these baselines on a small version of the community dataset with
(Communitysmall) and 200 ego graphs with (Egosmall).Community (160,1945)  Ego (399,1071)  Grid (361,684)  Protein (500,1575)  
Deg.  Clus.  Orbit  Deg.  Clus.  Orbit  Deg.  Clus.  Orbit  Deg.  Clus.  Orbit  
ER  0.021  1.243  0.049  0.508  1.288  0.232  1.011  0.018  0.900  0.145  1.779  1.135 
BA  0.268  0.322  0.047  0.275  0.973  0.095  1.860  0  0.720  1.401  1.706  0.920 
Kronecker  0.259  1.685  0.069  0.108  0.975  0.052  1.074  0.008  0.080  0.084  0.441  0.288 
MMSB  0.166  1.59  0.054  0.304  0.245  0.048  1.881  0.131  1.239  0.236  0.495  0.775 
GraphRNNS  0.055  0.016  0.041  0.090  0.006  0.043  0.029  0.011  0.057  0.102  0.037  
GraphRNN  0.014  0.002  0.039  0.077  0.316  0.030  0  0.034  0.935  0.217 
Communitysmall (20,83)  Egosmall (18,69)  

Degree  Clustering  Orbit  Train NLL  Test NLL  Degree  Clustering  Orbit  Train NLL  Test NLL  
GraphVAE  0.35  0.98  0.54  13.55  25.48  0.13  0.17  0.05  12.45  14.28 
DeepGMG  0.22  0.95  0.40  106.09  112.19  0.04  0.10  0.02  21.17  22.40 
GraphRNNS  0.02  0.15  0.01  31.24  35.94  0.002  0.05  0.0009  8.51  9.88 
GraphRNN  0.03  0.03  0.01  28.95  35.10  0.0003  0.05  0.0009  9.05  10.61 
4.3 Evaluating the Generated Graphs
Evaluating the sample quality of generative models is a challenging task in general (Theis et al., 2016), and in our case, this evaluation requires a comparison between two sets of graphs (the generated graphs and the test sets). Whereas previous works relied on qualitative visual inspection (Simonovsky & Komodakis, 2018) or simple comparisons of average statistics between the two sets (Leskovec et al., 2010), we propose novel evaluation metrics that compare all moments of their empirical distributions.
Our proposed metrics are based on Maximum Mean Discrepancy (MMD) measures. Suppose that a unit ball in a reproducing kernel Hilbert space (RKHS) is used as its function class , and is the associated kernel, the squared MMD between two sets of samples from distributions and can be derived as (Gretton et al., 2012)
(10) 
Proper distance metrics over graphs are in general computationally intractable (Lin, 1994). Thus, we compute MMD using a set of graph statistics , where each is a univariate distribution over , such as the degree distribution or clustering coefficient distribution. We then use the first Wasserstein distance as an efficient distance metric between two distributions and :
(11) 
where is the set of all distributions whose marginals are and respectively, and is a valid transport plan. To capture highorder moments, we use the following kernel, whose Taylor expansion is a linear combination of all moments (proof in the Appendix):
Proposition 2.
The kernel function defined by induces a unique RKHS.
In experiments, we show this derived MMD score for degree and clustering coefficient distributions, as well as average orbit counts statistics, i.e., the number of occurrences of all orbits with 4 nodes (to capture higherlevel motifs) (Hočevar & Demšar, 2014). We use the RBF kernel to compute distances between count vectors.
4.4 Generating High Quality Graphs
Our experiments demonstrate that GraphRNN can generate graphs that match the characteristics of the ground truth graphs in a variety of metrics.
Graph visualization. Figure 2 visualizes the graphs generated by GraphRNN and various baselines, showing that GraphRNN can capture the structure of datasets with vastly differing characteristics—being able to effectively learn regular structures like grids as well as more natural structures like ego networks. Specifically, we found that grids generated by GraphRNN do not appear in the training set, i.e., it learns to generalize to unseen grid widths/heights.
Evaluation with graph statistics. We use three graph statistics—based on degrees, clustering coefficients and orbit counts—to further quantitatively evaluate the generated graphs. Figure 3 shows the average graph statistics in the test vs. generated graphs, which demonstrates that even from hundreds of graphs with diverse sizes, GraphRNN can still learn to capture the underlying graph statistics very well, with the generated average statistics closely matching the overall test set distribution.
Tables 1 and 2 summarize MMD evaluations on the full datasets and small versions, respectively. Note that we train all the models with a fixed number of steps, and report the test set performance at the step with the lowest training error.^{6}^{6}6Using the training set or a validation set to evaluate MMD gave analogous results, so we used the train set for early stopping. GraphRNN variants achieve the best performance on all datasets, with decrease of MMD on average compared with traditional baselines, and decrease of MMD compared with deep learning baselines. Interestingly, on the protein dataset, our simpler GraphRNNS model performs very well, which is likely due to the fact that the protein dataset is a nearest neighbor graph over Euclidean space and thus does not involve highly complex edge dependencies. Note that even though some baseline models perform well on specific datasets (e.g., MMSB on the community dataset), they fail to generalize across other types of input graphs.
Generalization ability. Table 2 also shows negative loglikelihoods (NLLs) on the training and test sets. We report the average in our model, and report the likelihood in baseline methods as defined in their papers. A model with good generalization ability should have small NLL gap between training and test graphs. We found that our model can generalize well, with smaller average NLL gap.^{7}^{7}7The average likelihood is illdefined for the traditional models.
4.5 Robustness
Finally, we also investigate the robustness of our model by interpolating between BarabásiAlbert (BA) and ErdősRényi (ER) graphs. We randomly perturb [
] edges of BA graphs with nodes. With edges perturbed, the graphs are ER graphs; with edges perturbed, the graphs are BA graphs. Figure 4 shows the MMD scores for degree and clustering coefficient distributions for the sets of graphs. Both BA and ER perform well when graphs are generated from their respective distributions, but their performance degrades significantly once noise is introduced. In contrast, GraphRNN maintains strong performance as we interpolate between these structures, indicating high robustness and versatility.5 Further Related Work
In addition to the deep graph generative approaches and traditional graph generation approaches surveyed previously, our framework also builds off a variety of other methods.
Molecule and parsetree generation
. There has been related domainspecific work on generating candidate molecules and parse trees in natural language processing. Most previous work on discovering molecule structures make use of a expertcrafted sequence representations of molecular graph structures (SMILES)
(Olivecrona et al., 2017; Segler et al., 2017; GómezBombarelli et al., 2016). Most recently, SDVAE (Dai et al., 2018) introduced a grammarbased approach to generate structured data, including molecules and parse trees. In contrast to these works, we consider the fully general graph generation setting without assuming features or special structures of graphs.Deep autoregressive models. Deep autoregressive models decompose joint probability distributions as a product of conditionals, a general idea that has achieved striking successes in the image (Oord et al., 2016b) and audio (Oord et al., 2016a) domains. Our approach extends these successes to the domain of generating graphs. Note that the DeepGMG algorithm (Li et al., 2018) and the related prior work of Johnson 2017 can also be viewed as deep autoregressive models of graphs. However, unlike these methods, we focus on providing a scalable (i.e., ) algorithm that can generate general graphs.
6 Conclusion and Future Work
We proposed GraphRNN, an autoregressive generative model for graphstructured data, along with a comprehensive evaluation suite for the graph generation problem, which we used to show that GraphRNN achieves significantly better performance compared to previous stateoftheart models, while being scalable and robust to noise. However, significant challenges remain in this space, such as scaling to even larger graphs and developing models that are capable of doing efficient conditional graph generation.
Acknowledgements
The authors thank Ethan Steinberg, Bowen Liu, Marinka Zitnik and Srijan Kumar for their helpful discussions and comments on the paper. This research has been supported in part by DARPA SIMPLEX, ARO MURI, Stanford Data Science Initiative, Huawei, JD, and Chan Zuckerberg Biohub. W.L.H. was also supported by the SAP Stanford Graduate Fellowship and an NSERC PGSD grant.
References
 Airoldi et al. (2008) Airoldi, E., Blei, D., Fienberg, S., and Xing, E. Mixed membership stochastic blockmodels. JMLR, 2008.
 Albert & Barabási (2002) Albert, R. and Barabási, L. Statistical mechanics of complex networks. Reviews of Modern Physics, 74(1):47, 2002.

Cho et al. (2014)
Cho, M., Sun, J., Duchenne, O., and Ponce, J.
Finding matches in a haystack: A maxpooling strategy for graph matching in the presence of outliers.
In CVPR, 2014.  Chung et al. (2014) Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS Workshop on Deep Learning, 2014.
 Dai et al. (2018) Dai, H., Tian, Y., Dai, B., Skiena, S., and Song, L. Syntaxdirected variational autoencoder for structured data. ICLR, 2018.
 Dobson & Doig (2003) Dobson, P. and Doig, A. Distinguishing enzyme structures from nonenzymes without alignments. Journal of Molecular Biology, 330(4):771–783, 2003.
 Erdős & Rényi (1959) Erdős, P. and Rényi, A. On random graphs I. Publicationes Mathematicae (Debrecen), 6:290–297, 1959.
 GómezBombarelli et al. (2016) GómezBombarelli, R., Wei, J., Duvenaud, D., HernándezLobato, J. M., SánchezLengeling, B., Sheberla, D., AguileraIparraguirre, J., Hirzel, T. D., Adams, R. P., and AspuruGuzik, A. Automatic chemical design using a datadriven continuous representation of molecules. ACS Central Science, 2016.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS, 2014.
 Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel twosample test. JMLR, 2012.
 Grover et al. (2017) Grover, A., Zweig, A., and Ermon, S. Graphite: Iterative generative modeling of graphs. In NIPS Bayesian Deep Learning Workshop, 2017.
 Hamilton et al. (2017) Hamilton, W. L., Ying, R., and Leskovec, J. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, 2017.
 Hočevar & Demšar (2014) Hočevar, T. and Demšar, J. A combinatorial approach to graphlet counting. Bioinformatics, 30(4):559–565, 2014.
 Hornik (1991) Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
 Johnson (2017) Johnson, D. D. Learning graphical state transitions. In ICLR, 2017.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational bayes. In ICLR, 2014.
 Kipf & Welling (2016) Kipf, T. N. and Welling, M. Variational graph autoencoders. In NIPS Bayesian Deep Learning Workshop, 2016.
 Leskovec et al. (2007) Leskovec, J., Kleinberg, J., and Faloutsos, C. Graph evolution: Densification and shrinking diameters. TKDD, 1(1):2, 2007.
 Leskovec et al. (2010) Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., and Ghahramani, Z. Kronecker graphs: An approach to modeling networks. JMRL, 2010.
 Li et al. (2018) Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, P. Learning deep generative models of graphs, 2018. URL https://openreview.net/forum?id=Hy1debAb.
 Lin (1994) Lin, C. Hardness of approximating graph transformation problem. In International Symposium on Algorithms and Computation, 1994.
 Newman (2010) Newman, M. Networks: an introduction. Oxford university press, 2010.
 Noy & Ribó (2004) Noy, M. and Ribó, A. Recursively constructible families of graphs. Advances in Applied Mathematics, 32(12):350–363, 2004.

Olivecrona et al. (2017)
Olivecrona, M., Blaschke, T., Engkvist, O., and Chen, H.
Molecular de novo design through deep reinforcement learning.
Journal of Cheminformatics, 9(1):48, 2017.  Oord et al. (2016a) Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.
 Oord et al. (2016b) Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. In ICML, 2016b.
 Robins et al. (2007) Robins, G., Pattison, P., Kalish, Y., and Lusher, D. An introduction to exponential random graph (p*) models for social networks. Social Networks, 29(2):173–191, 2007.
 Segler et al. (2017) Segler, M., Kogej, T., Tyrchan, C., and Waller, M. Generating focussed molecule libraries for drug discovery with recurrent neural networks. ACS Central Science, 2017.
 Sen et al. (2008) Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and EliassiRad, T. Collective classification in network data. AI Magazine, 29(3):93, 2008.
 Simonovsky & Komodakis (2018) Simonovsky, M. and Komodakis, N. GraphVAE: Towards generation of small graphs using variational autoencoders, 2018. URL https://openreview.net/forum?id=SJlhPMWAW.
 Theis et al. (2016) Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models. In ICLR, 2016.
Appendix A Appendix
a.1 Implementation Details of GraphRNN
In this section we detail parameter setting, data preparation and training strategies for GraphRNN.
We use two sets of model parameters for GraphRNN. One larger model is used to train and test on the larger datasets that are used to compare with traditional methods. One smaller model is used to train and test on datasets with nodes up to . This model is only used to compare with the two most recent preliminary deep generative models for graphs proposed in (Li et al., 2018; Simonovsky & Komodakis, 2018).
For GraphRNN, the graphlevel RNN uses layers of GRU cells, with dimensional hidden state for the larger model, and dimensional hidden state for the smaller model in each layer. The edgelevel RNN uses layers of GRU cells, with dimensional hidden state for both the larger model and the smaller model. To output the adjacency vector prediction, the edgelevel RNN first maps the highest layer of the dimensional hidden state to a
dimensional vector through a MLP with ReLU activation, then another MLP maps the vector to a scalar with sigmoid activation. The edgelevel RNN is initialized by the output of the graphlevel RNN at the start of generating
, . Specifically, the highest layer hidden state of the graphlevel RNN is used to initialize the lowest layer of edgelevel RNN, with a liner layer to match the dimensionality. During training time, teacher forcing is used for both graphlevel and edgelevel RNNs, i.e., we use the groud truth rather than the model’s own prediction during training. At inference time, the model uses its own preditions at each time steps to generate a graph.For the simple version GraphRNNS, a twolayer MLP with ReLU and sigmoid activations respectively is used to generate , with dimensional hidden state for the larger model, and dimensional hidden state for the smaller model. In practice, we find that the performance of the model is relatively stable with respect to these hyperparameters.
We generate the graph sequences used for training the model following the procedure in Section 2.3.4. Specifically, we first randomly sample a graph from the training set, then randomly permute the node ordering of the graph. We then do the deterministic BFS discussed in Section 2.3.4 over the graph with random node ordering, resulting a graph with BFS node ordering. An exception is in the robustness section, where we use the node ordering that generates BA graphs to get graph sequences, in order to see if GraphRNN can capture the underlying preferential attachment properties of BA graphs.
With the proposed BFS node ordering, we can reduce the maximum dimension of , illustrated in Figure 5. To set the maximum dimension of , we use the following empirical procedure. We randomly ran times the above data preprocessing procedure to get graph with BFS node orderings. We remove the all consecutive zeros in all resulting , to find the empirical distribution of the dimensionality of . We set to be roughly the percentile, to account for the majority dimensionality of . In principle, we find that graphs with regular structures tend to have smaller , while random graphs or community graphs tend to have larger . Specifically, for community dataset, we set ; for grid dataset, we set ; for BA dataset, we set ; for protein dataset, we set ; for ego dataset, we set ; for all small graph datasets, we set .
The Adam Optimizer is used for minibatch training. Each minibatch contains graph sequences. We train the model for batchs in all experiments. We set the learning rate to be , which is decayed by at step and in all experiments.
a.2 Running Time of GraphRNN
Training is performed on only Titan X GPU. For the protein dataset that consists of about graphs, each containing about nodes, training converges at around iterations. The runtime is around to hours. This also includes preprocessing, batching and BFS, which are currently implemented using CPU without multithreading. The less expressive GraphRNNS variant is about twice faster. At inference time, for the above dataset, generating a graph using the trained model only takes about second.
a.3 More Details on GraphRNN’s Expressiveness
We illustrate the intuition underlying the good performance of GraphRNNon graphs with regular structures, such as grid and ladder networks. Figure 6 (a) shows the generation process of a ladder graph at an intermediate step. At this time step, the ground truth data (under BFS node ordering) specifies that the new node added to the graph should make an edge to the node with degree . Note that node degree is a function of , thus could be approximated by a neural network.
Once the first edge has been generated, the new node should make an edge with another node of degree . However, there are multiple ways to do so, but only one of them gives a valid grid structure, i.e. one that forms a cycle with the new edge. GraphRNN crucially relies on the edgelevel RNN and the knowledge of the previously added edge, in order to distinguish between the correct and incorrect connections in Figure 6 (c) and (d).
a.4 Code Overview
In the code repository, main.py consists of the main training pipeline, which loads datasets and performs training and inference. It also consists of the Args class, which stores the hyperparameter settings of the model. model.py
consists of the RNN, MLP and loss function modules that are use to build GraphRNN.
data.py contains the minibatch sampler, which samples a random BFS ordering of a batch of randomly selected graphs. evaluate.py contains the code for evaluating the generated graphs using the MMD metric introduced in Sec. 4.3.Baselines including the ErdősRényi model, BarabásiAlbert model, MMSB, and rge very recent deep generative models (GraphVAE, DeepGMG) are also implemented in the baselines folders. We adopt the C++ Kronecker graph model implementation in the SNAP package ^{8}^{8}8The SNAP package is available at http://snap.stanford.edu/snap/index.html..
a.5 Proofs
a.5.1 Proof of Proposition 1
We use the following observation:
Observation 2.
By definition of BFS, if , then the children of in the BFS ordering come before the children of that do not connect to , .
By definition of BFS, all neighbors of a node include the parent of in the BFS tree, all children of which have consecutive indices, and some children of which connect to both and , for some . Hence if but , is the last children of in the BFS ordering. Hence , .
For all , supposed that but . By Observation 2, . By conclusion in the previous paragraph, , . Specifically, , . This is true for all . Hence we prove that , and .
a.5.2 Proof of Proposition 2
As proven in kolouri2016sliced, this Wasserstein distance based kernel is a positive definite (p.d.) kernel. By properties that linear combinations, product and limit (if exists) of p.d. kernels are p.d. kernels, is also a p.d. kernel.^{9}^{9}9This can be seen by expressing the kernel function using Taylor expansion. By the MooreAronszajn theorem, a symmetric p.d. kernel induces a unique RKHS. Therefore Equation (9) holds if we set to be .
a.6 Extension to Graphs with Node and Edge Features
Our GraphRNN model can also be applied to graphs where nodes and edges have feature vectors associated with them. In this extended setting, under node ordering , a graph is associated with its node feature matrix and edge feature matrix , where and are the feature dimensions for node and edge respectively. In this case, we can extend the definition of to include feature vectors of corresponding nodes as well as edges . We can adapt the module, by using a MLP to generate and an edgelevel RNN to genearte respectively. Note also that directed graphs can be viewed as an undirected graphs with two edge types, which is a special case under the above extension.
a.7 Extension to Graphs with Four Communities
To further show the ability of GraphRNN to learn from community graphs, we further conduct experiments on a fourcommunity synthetic graph dataset. Specifically, the data set consists of 500 four community graphs with . Each community is generated by the ErdősRényi model (ER) (Erdős & Rényi, 1959) with nodes and . We then add intercommunity edges with uniform probability. FIgure 7 shows the comparison of visualization of generated graph using GraphRNN and other baselines. We observe that in contrast to baselines, GraphRNN consistently generate 4community graphs and each community has similar structure to that in the training set.