Scalable Deep Generative Modeling for Sparse Graphs

by   Hanjun Dai, et al.

Learning graph generative models is a challenging task for deep learning and has wide applicability to a range of domains like chemistry, biology and social science. However current deep neural methods suffer from limited scalability: for a graph with n nodes and m edges, existing deep neural methods require Ω(n^2) complexity by building up the adjacency matrix. On the other hand, many real world graphs are actually sparse in the sense that m≪ n^2. Based on this, we develop a novel autoregressive model, named BiGG, that utilizes this sparsity to avoid generating the full adjacency matrix, and importantly reduces the graph generation time complexity to O((n + m)log n). Furthermore, during training this autoregressive model can be parallelized with O(log n) synchronization stages, which makes it much more efficient than other autoregressive models that require Ω(n). Experiments on several benchmarks show that the proposed approach not only scales to orders of magnitude larger graphs than previously possible with deep autoregressive graph generative models, but also yields better graph generation quality.


page 1

page 2

page 3

page 4


A Systematic Survey on Deep Generative Models for Graph Generation

Graphs are important data representations for describing objects and the...

Image-Conditioned Graph Generation for Road Network Extraction

Deep generative models for graphs have shown great promise in the area o...

Learning Deep Generative Models of Graphs

Graphs are fundamental data structures which concisely capture the relat...

SPECTRE : Spectral Conditioning Helps to Overcome the Expressivity Limits of One-shot Graph Generators

We approach the graph generation problem from a spectral perspective by ...

TIGGER: Scalable Generative Modelling for Temporal Interaction Graphs

There has been a recent surge in learning generative models for graphs. ...

Contextual Graph Markov Model: A Deep and Generative Approach to Graph Processing

We introduce the Contextual Graph Markov Model, an approach combining id...

Your Autoregressive Generative Model Can be Better If You Treat It as an Energy-Based One

Autoregressive generative models are commonly used, especially for those...

1 Introduction

Representing a distribution over graphs provides a principled foundation for tackling many important problems in knowledge graph completion 

(Xiao et al., 2016), de novo drug design (Li et al., 2018; Simonovsky & Komodakis, 2018), architecture search (Xie et al., 2019) and program synthesis (Brockschmidt et al., 2018). The effectiveness of graph generative modeling usually depends on learning the distribution given a collection of relevant training graphs. However, training a generative model over graphs is usually quite difficult due to their discrete and combinatorial nature.

Classical generative models of graphs, based on random graph theory (Erdős & Rényi, 1960; Barabási & Albert, 1999; Watts & Strogatz, 1998), have been long studied but typically only capture a small set of specific graph properties, such as degree distribution. Despite their computational efficiency, these distribution models are usually too inexpressive to yield competitive results in challenging applications.

Recently, deep graph generative models that exploit the increased capacity of neural networks to learn more expressive graph distributions have been successfully applied to real-world tasks. Prominent examples include VAE-based methods

(Kipf & Welling, 2016; Simonovsky & Komodakis, 2018), GAN-based methods (Bojchevski et al., 2018), flow models (Liu et al., 2019; Shi et al., 2020) and autoregressive models (Li et al., 2018; You et al., 2018; Liao et al., 2019). Despite the success of these approaches in modeling small graphs, e.g. molecules with hundreds of nodes, they are not able to scale to graphs with over 10,000 nodes.

A key shortcoming of current deep graph generative models is that they attempt to generate a full graph adjacency matrix, implying a computational cost of for a graph with nodes and edges. For large graphs, it is impractical to sustain such a quadratic time and space complexity, which creates an inherent trade-off between expressiveness and efficiency. To balance this trade-off, most recent work has introduced various conditional independence assumptions (Liao et al., 2019), ranging from the fully auto-regressive but slow GraphRNN (You et al., 2018), to the fast but fully factorial GraphVAE (Simonovsky & Komodakis, 2018).

In this paper, we propose an alternative approach that does not commit to explicit conditional independence assumptions, but instead exploits the fact that most interesting real-world graphs are sparse, in the sense that . By leveraging sparsity, we develop a new graph generative model, BiGG (BIg Graph Generation), that streamlines the generative process and avoids explicit consideration of every entry in an adjacency matrix. The approach is based on three key elements: (1) an process for generating each edge using a binary tree data structure, inspired by R-MAT (Chakrabarti et al., 2004); (2) a tree-structured autoregressive model for generating the set of edges associated with each node; and (3) an autoregressive model defined over the sequence of nodes. By combining these elements, BiGG can generate a sparse graph in time, which is a substantial improvement over .

For training, the design of BiGG also allows every context embedding in the autoregressive model to be computed in only sequential steps, which enables significant gains in training speed through parallelization. By comparison, the context embedding in GraphRNN requires sequential steps to compute, while GRAN requires steps. In addition, we develop a training mechanism that only requires sublinear memory cost, which in principle makes it possible to train models of graphs with millions of nodes on a single GPU.

On several benchmark datasets, including synthetic graphs and real-world graphs of proteins, 3D mesh and SAT instances, BiGG is able to achieve comparable or superior sample quality than the previous state-of-the-art, while being orders of magnitude more scalable.

To summarize the main contributions of this paper:

  • [leftmargin=*,nolistsep,nosep]

  • We propose an autoregressive generative model, BiGG, that can generate sparse graphs in time, successfully modeling graphs with 100k nodes on 1 GPU.

  • The training process can be largely parallelized, requiring only steps to synchronize learning updates.

  • Memory cost is reduced to for training and for inference.

  • BiGG not only scales to orders of magnitude larger graphs than current deep models, it also yields comparable or better model quality on several benchmark datasets.

Other related work There has been a lot of work (Chakrabarti et al., 2004; Robins et al., 2007; Leskovec et al., 2010; Airoldi et al., 2009)

on generating graphs with a set of specific properties like degree distribution, diameter, and eigenvalues. All these classical models are hand-engineered to model a particular family of graphs, and thus not general enough. Besides the general graphs, a lot of work also exploit domain knowledge for better performance in specific domains. Examples of this include

(Kusner et al., 2017; Dai et al., 2018; Jin et al., 2018; Liu et al., 2018) for modeling molecule graphs, and (You et al., 2019) for SAT instances. See app:related for more discussions.

2 Model

A graph is defined by a set of nodes and set of edges , where a tuple is used to represent an edge between node and . We denote and as the number of nodes and edges in respectively. Note that a given graph may have multiple equivalent adjacency matrix representations, with different node orderings. However, given an ordering of the nodes , there is a one-to-one mapping between the graph structure and the adjacency matrix .

Our goal is to learn a generative model, , given a set of training graphs . In this paper, we assume the graphs are not attributed, and focus on the graph topology only. Such an assumption implies



is the probability of generating the set of edges

under a particular ordering , which is equivalent to the probability of a particular adjacency matrix . Here, is the distribution of number of nodes in a graph. In this paper, we use a single canonical ordering to model each graph , as in (Li et al., 2018):


which is clearly a lower bound on  (Liao et al., 2019)

. We estimate

directly using the empirical distribution over graph size, and only learn an expressive model for . In the following presentation, we therefore omit the ordering and assume a default canonical ordering of nodes in the graph when appropriate.

As will be extremely sparse for large sparse graphs (), generating only the non-zero entries in , , the edge set , will be much more efficient than the full matrix:


where each include the indices of the two nodes associated with one edge, resulting in a generation process of steps. We order the edges following the node ordering, hence this process generates all the edges for the first row in , before generating the second row, etc. A naive approach to generating a single edge will be if we factorize and assume both and to be simple multinomials over nodes. This, however, will not give us any benefit over traditional methods.

2.1 Recursive edge generation

The main scalability bottleneck is the large output space. One way to reduce the output space size is to use a hierarchical recursive decomposition, inspired by the classic random graph model R-MAT (Chakrabarti et al., 2004). In R-MAT, each edge is put into the adjacency matrix by dividing the 2D matrix into 4 equally sized quadrants, and recursively descending into one of the quadrants, until reaching a single entry of the matrix. In this generation process, the complexity of generating each edge is only .

Figure 1: Recursive generation of the edge given .

We adopt the recursive decomposition design of R-MAT, and further simplify and neuralize it, to make our model efficient and expressive. In our model, we generate edges following the node ordering row by row, so that each edge only needs to be put into the right place in a single row, reducing the process to 1D, as illustrated in fig:edge_gen. For any edge , the process of picking node as one of ’s neighbors starts by dividing the node index interval in half, then recursively descending into one half until reaching a single entry. Each corresponds to a unique sequence of decisions , where is the -th decision in the sequence, and is the maximum number of required decisions to specify .

The probability of can then be formulated as


where each is the probability of following the decision that leads to at step .

Let us use to denote the set of edges incident to node , and . Generating only a single edge is similar to hierarchical softmax (Mnih & Hinton, 2009), and applying the above procedure repeatedly can generate all of edges in time. But we can do better than that when generating all these edges.

Further improvement using binary trees. As illustrated in the left half of fig:row_gen, the process of jointly generating all of is equivalent to building up a binary tree , where each tree node corresponds to a graph node index interval , and for each the process starts from the root and ends in a leaf .

Taking this perspective, we propose a more efficient generation process for , which generates the tree directly instead of repeatedly generating each leaf through a path from the root. We propose a recursive process that builds up the tree following a depth-first or in-order traversal order, where we start at the root, and recursively for each tree node : (1) decide if has a left child denoted as , and (2) if so recurse into and generate the left sub-tree, and then (3) decide if has a right child denoted as , (4) if so recurse into and generate the right sub-tree, and (5) return to ’s parent. This process is shown in alg:row_gen, which will be elaborated in next section.

Overloading the notation a bit, we use to denote the probability that tree node has a left child, when , or does not have a left child, when , under our model, and similarly define . Then the probability of generating or equivalently tree is


This new process generates each tree node exactly once, hence the time complexity is proportional to the tree size , and it is clear that , since is the number of leaf nodes in the tree and is the max depth of the tree. The time saving comes from removing the duplicated effort near the root of the tree. When is large, as some fraction of when is one of the “hub” nodes in the graph, the tree becomes dense and our new generation process will be significantly faster, as the time complexity becomes close to while generating each leaf from the root would require time.

In the following, we present our approach to make this model fully autoregressive,  making and depend on all the decisions made so far in the process of generating the graph, and make this model neuralized so that all the probability values in the model come from expressive deep neural networks.

2.2 Autoregressive conditioning for generating

1:  function recursive(
2:     if is_leaf(t) then
3:        Return
4:     end if
5:     has_left using Eq. (8)
6:     if has_left then
7:        Create , and let recursive()
8:     else
10:     end if
11:     has_right using Eq. (9)
12:     if has_right then
13:        Create , and let recursive()
14:     else
16:     end if
19:     Return ,
20:  end function
Algorithm 1 Generating outgoing edges of node

In this section we consider how to add autoregressive conditioning to and when generating .

Figure 2: Autoregressive generation of edge-binary tree of node . To generate the red node , the two embeddings that captures ’s left subtree (orange region) and nodes with in-order traversal before (blue region) respectively are used for conditioning.

In our generation process, the decision about whether exists for a particular tree node is made after

, all its ancestors, and all the left sub-trees of the ancestors are generated. We can use a top-down context vector

to summarize all these contexts, and modify to . Similarly, the decision about is made after generating and its dependencies, and ’s entire left-subtree (see fig:row_gen right half for illustration). We therefore need both the top-down context , as well as the bottom-up context that summarizes the sub-tree rooted at , if any. The autoregressive model for therefore becomes


and we can recursively define


where and are two TreeLSTM cells (Tai et al., 2015) that combine information from the incoming nodes into a single node state, and and represents the embedding vector for the binary values “left” and “right”. We initialize , and discuss in the next section.

The distributions can then be parameterized as


2.3 Full autoregressive model

With the efficient recursive edge generation and autoregressive conditioning presented in sec:edge_gen and sec:row_gen respectively, we are ready to present the full autoregressive model for generating the entire adjacency matrix .

The full model will utilize the autoregressive model for as building blocks. Specifically, we are going to generate the adjacency matrix row by row:


Let be the embedding that summarizes , suppose we have an efficient mechanism to encode into , then we can effectively use to generate and thus the entire process would become autoregressive. Again, since there are rows in total, using a chain structured LSTM would make the history length too long for large graphs. Therefore, we use an approach inspired by the Fenwick tree (Fenwick, 1994) which is a data structure that maintains the prefix sum efficiently. Given an array of numbers, the Fenwick tree allows calculating any prefix sum or updating any single entry in for a sequence of length . We build such a tree to maintain and update the prefix embeddings. We denote it as row-binary forest as such data structure is a forest of binary trees.

Figure 3: Autoregressive conditioning across rows of adjacency matrix. Embedding summarizes all the rows before , and is used for generating next.

fig:row2048 demonstrates one solution. Before generating the edge-binary tree , the embeddings that summarize each individual edge-binary tree will be organized into the row-binary forest . This forest is organized into levels, with the bottom -th level as edge-binary tree embeddings. Let be the -th node in the -th level, then


where .

Embedding row-binary forest One way to embed this row-binary forest is to embed the root of each tree in this forest. As there will be at most one root in each level (otherwise the two trees in the same level will be merged into a larger tree for the next level), the number of tree roots will also be at most. Thus we calculate as follows:


Here, is the bit-level ‘and’ operator. Intuitively as in Fenwick tree, the calculation of any prefix sum of length requires the block sums that corresponds to each binary digit in the binary bits representation of integer . Recall the operator defined in sec:row_gen, here when , and equals to zero when . With the embedding of at each state served as ‘glue’, we can connect all the individual row generation modules in an autoregressive way.

Updating row-binary forest It is also efficient to update such forest every time a new is obtained after the generation of . Such an updating procedure is similar to the Fenwick tree update. As each level of this forest has at most one root node, merging in the way defined in (11) will happen at most once per each level, which effectively makes the updating cost to be .

1:  function update_forest(
3:     for  to  do
5:        if such exists and is an even number then
8:        end if
9:     end for
10:     Update using Eq (12)
11:     Return: ,
12:  end function
13:  Input: The number of nodes
15:  for  to  do
16:     Let be the root of an empty edge-binary tree
17:      recursive
18:      update_forest()
19:  end for
20:  Return: with and
Algorithm 2 Generating graph using BiGG

alg:graph_gen summarizes the entire procedure for sampling a graph from our model in a fully autoregressive manner.

BiGG generates a graph with node and edges in time. In the extreme case where , the overall complexity becomes . The generation of each edge-binary tree in sec:row_gen requires time complexity proportional to the number of nodes in the tree. The Fenwick tree query and update both take time, hence maintaining the data structure takes . The overall complexity is . For a sparse graph , hence . For a complete graph, where , each will be a full binary tree with leaves, hence and the overall complexity would be .

3 Optimization

In this section, we propose several ways to scale up the training of our auto-regressive model. For simplicity, we focus on how to speed up the training with a single graph. Training multiple graphs can be easily extended.

3.1 Training with synchronizations

A classical autoregressive model like LSTM is fully sequential, which allows no concurrency across steps. Thus training LSTM with a sequence of length takes of synchronized computation. In this section, we show how to exploit the characteristic of BiGG to increase concurrency during training.

Figure 4: Parallelizing computation during training. The four stages are executed sequentially. In each stage, the embeddings of nodes with the same color can be computed concurrently.

Given a graph , estimating the likelihood under the current model can be divided into four steps.

  1. [wide,noitemsep,topsep=0.5pt,parsep=0pt,partopsep=0.5pt]

  2. Computing : as the graph is given during training, the corresponding edge-binary trees are also known. From Eq (7) we can see that the embeddings of all nodes at the same depth of tree can be computed concurrently without dependence, and the synchronization only happens between different depths. Thus steps of synchronization is sufficient.

  3. Computing : the row-binary forest grows monotonically, thus the forest in the end contains all the embeddings needed for computing . Similarly, computing synchronizes steps.

  4. Computing : as each runs an LSTM independently, this stage simply runs LSTM on a batch of sequences with length .

  5. Computing and likelihood: the last step computes the likelihood using Eq (8) and  (9). This is similar to the first step, except that the computation happens in a top-down direction in each .

fig:par_train demonstrates this process. In summary, the four stages each take steps of synchronization. This allows us to train large graphs much more efficiently than a simple sequential autoregressive model.

3.2 Model parallelism

It is possible that during training the graph is too large to fit into memory. Thus to train on large graphs, model parallelism is more important than data parallelism.

To split the model, as well as intermediate computations, into different machines, we divide the adjacency matrix into multiple consecutive chunks of rows, where each machine is responsible for one chunk. It is easy to see that by doing so, Stage 1 and Stage 4 mentioned in sec:par_train can be executed concurrently on all the machines without any synchronization, as the edge-binary trees can be processed independently once the conditioning states like are made ready by synchronization.

Figure 5: Model parallelism for training single graph. Red circled nodes are computed on GPU 1 but is required by GPU 2 as well.

fig:model_par illustrates the situation when training a graph with 7 nodes using 2 GPUs. During Stage 1, GPU 1 and 2 work concurrently to compute and , respectively. In Stage 2, the embeddings and are needed by GPU 2 when computing and . We denote such embeddings as ‘g-messages’. Note that such ‘g-messages’ will transit in the opposite direction when doing a backward pass in the gradient calculation. Passing ‘g-messages’ introduces serial dependency across GPUs. However as the number of such embeddings is upper bounded by depth of row-binary forest, the communication cost is manageable.

3.3 Reducing memory consumption

Sublinear memory cost:

Another way to handle the memory issue when training large graphs is to recompute certain portions of the hidden layers in the neural network when performing backpropagation, to avoid storing such layers during the forward pass.

Chen et al. (2016) introduces a computation scheduling mechanism for sequential structured neural networks that achieves memory growth for an -layer neural network.

We divide rows as in sec:model_par. During the forward pass, only the ‘g-messages’ between chunks are kept. The only difference from the previous section is that the edge-binary tree embeddings are recomputed, due to the single GPU limitation. The memory cost will be:


Here accounts for the memory holding the ‘g-message’, and accounts for the memory of in each chunk. The optimal is achieved when , hence and the corresponding memory cost is . Also note that such sublinear cost requires only one additional feedforward in Stage 1, so this will not hurt much of the training speed.

Bits compression: The vector summarizes the edge-binary tree structure rooted at node for -th row in adjacency matrix , as defined in Eq (7). As node represents the interval of the row, another equivalent way is to directly use , , the binary vector to represent . Each where is also a binary vector. Thus no neural network computation is needed in the subtree rooted at node . Suppose we use such bits representation for any nodes that have the corresponding interval length no larger than , then for a full edge-binary tree (, connects to every other node in graph) which has nodes in the tree, the corresponding storage required for neural part is which essentially reduces the memory consumption of neural network to of the original cost. Empirically we use in all experiments, which saves of the memory during training without losing any information in representation.

Note that to represent an interval of length , we use vector where

That is to say, we use ternary bit vector to encode both the interval length and the binary adjacency information.

3.4 Position encoding:

During generation of , each tree node of the edge-binary tree knows the span which corresponds to the columns it will cover. One way is to augment with the position encoding as:


where PE is the position encoding using sine and cosine functions of different frequencies as in Vaswani et al. (2017). Similarly, the in Eq (12) can be augmented by in a similar way. With such augmentation, the model will know more context into the future, and thus help improve the generative quality.

Please refer to our released open source code located at for more implementation and experimental details.

4 Experiment

4.1 Model Quality Evaluation on Benchmark Datasets

In this part, we compare the quality of our model with previous work on a set of benchmark datasets. We present results on median sized general graphs with number of nodes ranging in 0.1k to 5k in sec:exp_general_graph, and on large SAT graphs with up to 20k nodes in sec:exp_sat_graph. In sec:ablation we perform ablation studies of BiGG with different sparsity and node orders.

4.1.1 General graphs

Datasets Methods
Erdos-Renyi GraphVAE GraphRNN-S GraphRNN GRAN BiGG
Grid Deg. 0.79 7.07 0.13 1.12 8.23
Clus. 2.00 7.33 3.73 7.73 3.79
, Orbit 1.08 0.12 0.18 1.03 1.59
, Spec. 0.68 1.44 0.19 1.18 1.62
Protein Deg. 5.64 0.48 4.02 1.06 1.98
Clus. 1.00 7.14 4.79 0.14 4.86
, Orbit 1.54 0.74 0.23 0.88 0.13
, Spec. 9.13 0.11 0.21 1.88 5.13
3D Point Cloud Deg. 0.31 OOM OOM OOM 1.75
Clus. 1.22 OOM OOM OOM 0.51 0.21
, Orbit 1.27 OOM OOM OOM 0.21
, Spec. 4.26 OOM OOM OOM 7.45
Lobster Deg. 0.24 2.09 3.48 9.26 3.73
Clus. 3.82 7.97 4.30 0.00 0.00 0.00
, Orbit 2.42 1.43 2.48 2.19 7.67
, Spec. 0.33 3.94 6.72 1.14 2.71
Err. 1.00 0.91 1.00 0.00 0.12 0.00
Table 1: Performance on benchmark datasets. The MMD metrics uses test functions from . For all the metrics, the smaller the better. Baseline results are obtained from Liao et al. (2019), where OOM indicates the out-of-memory issue.

The general graph benchmark is obtained from Liao et al. (2019) and part of it was also used in (You et al., 2018). This benchmark has four different datasets: (1) Grid, 100 2D grid graphs; (2) Protein, 918 protein graphs (Dobson & Doig, 2003); (3) Point cloud, 3D point clouds of 41 household objects (Neumann et al., 2013); (4) Lobster, 100 random Lobster graphs (Golomb, 1996), which are trees where each node is at most 2 hops away from a backbone path. tab:benchmark contains some statistics about each of these datasets. We use the same protocol as Liao et al. (2019) that splits the graphs into training and test sets.

Baselines: We compare with deep generative models including GraphVAE (Simonovsky & Komodakis, 2018), GraphRNN, GraphRNN-S (You et al., 2018) and GRAN  (Liao et al., 2019). We also include the Erdős–Rényi random graph model that only estimates the edge density. Since our setups are exactly the same, the baseline results are directly copied from Liao et al. (2019).


We use exactly the same evaluation metric as 

Liao et al. (2019), which compares the distance between the distribution of held-out test graphs and the generated graphs. We use maximum mean discrepancy (MMD) with four different test functions, namely the node degree, clustering coefficient, orbit count and the spectra of the graphs from the eigenvalues of the normalized graph Laplacian. Besides the four MMD metrics, we also use the error rate for Lobster dataset. This error rate reflects the fraction of generated graphs that doesn’t have Lobster graph property.

Results: tab:benchmark reports the results on all the four datasets. We can see the proposed BiGG outperforms all other methods on all the metrics. The gain becomes more significant on the largest dataset, , the 3D point cloud. While GraphVAE and GraphRNN gets out of memory, the orbit metric of BiGG is 2 magnitudes better than GRAN. This dataset reflects the scalability issue of existing deep generative models. Also from the Lobster graphs we can see, although GRAN scales better than GraphRNN, it yields worse quality due to its approximation of edge generation with mixture of conditional independent distributions. Our BiGG improves the scalability while also maintaining the expressiveness.

4.1.2 SAT graphs

Clustering Modularity Variable Clause Modularity Modularity
Training-24 0.53 0.08 0.61 0.13 5.30 3.79 5.14 3.13 0.76 0.08 0.70 0.07
G2SAT 0.41 0.18 (23%) 0.55 0.18 (10%) 5.30 3.79 (0%) 7.22 6.38 (40%) 0.71 0.12 (7%) 0.68 0.06 (3%)
BiGG-0.1 0.49 0.21 (8%) 0.36 0.21 (41%) 5.30 3.79 (0%) 3.76 1.21 (27%) 0.58 0.16 (24%) 0.58 0.11 (17%)
BiGG-0.01 0.54 0.13(2%) 0.53 0.21 (13%) 5.30 3.79(0%) 4.28 1.50 (17%) 0.71 0.13 (7%) 0.67 0.09 (4%)
Table 2: Training and generated graph statistics with 24 SAT formulas used in You et al. (2019)

. The neural baselines in tab:benchmark are not applicable due to scalability issue. We report mean and std of different test statistics, as well as the gap between true SAT instances.

Figure 6: Inference time per graph.
Figure 7: Training time per update.
Figure 8: Training memory cost.

In addition to comparing with general graph generative models, in this section we compare against several models that are designated for generating the Boolean Satisfiability (SAT) instances. A SAT instance can be represented using bipartite graph, , the literal-clause graph (LCG). For a SAT instance with variables and clauses, it creates positive and negative literals, respectively. The canonical node ordering assigns 1 to for literals and to for clauses.

The following experiment largely follows G2SAT (You et al., 2019). We use the train/test split of SAT instances obtained from G2SAT website. This result in 24 and 8 training/test SAT instances, respectively. The size of the SAT graphs ranges from 491 to 21869 nodes. Note that the original paper reports results using 10 small training instances instead. For completeness, we also include such results in app:sat together with other baselines from You et al. (2019).

Baseline: We mainly compare the learned model with G2SAT, a specialized deep graph generative model for bipartite SAT graphs. Since BiGG is general purposed, to guarantee the generated adjacency matrix is bipartite, we let our model to generate the upper off-diagonal block of the adjacency matrix only, , .

G2SAT requires additional ‘template graph’ as input when generating the graph. Such template graph is equivalent to specify the node degree of literals in LCG. We can also enforce the degree of each node in our model.


Following G2SAT, we report the mean and standard deviation of statistics with respect to different test functions. These include the modularity, average clustering coefficient and the scale-free structure parameters for different graph representations of SAT instances. Please refer to 

Newman (2001, 2006); Ansótegui et al. (2009); Clauset et al. (2009) for more details. In general, the closer the statistical estimation the better it is.

Results: Following You et al. (2019), we compare the statistics of graphs with the training instances in tab:g2sat_24. To mimic G2SAT which picks the best action among sampled options each step, we perform -sampling variant (which is denoted BiGG-). Such model has

probability to sample from Bernoulli distribution (as in Eq (

8) (9)) each step, and to pick best option otherwise. This is used to demonstrate the capacity of the model. We can see that the proposed BiGG can mostly recover the statistics of training graph instances. This implies that despite being general, the full autoregressive model is capable of modeling complicated graph generative process. We additionally report the statistics of generated SAT instances against the test set in  app:sat, where G2SAT outperforms BiGG in 4/6 metrics. As G2SAT is specially designed for bipartite graphs, the inductive bias it introduces allows the extrapolation to large graphs. Our BiGG is general purposed and has higher capacity, thus also overfit to the small training set more easily.

4.2 Scalability of BiGG

In this section, we will evaluate the scalability of BiGG regarding the time complexity, memory consumption and the quality of generated graphs with respect to the number of nodes in graphs.

4.2.1 Runtime and memory cost

Here we empirically verify the time and memory complexity analyzed in sec:model. We run BiGG on grid graphs with different numbers of nodes that are chosen from . In this case . Additionally, we also plot curves from the theoretical analysis for verification. Specifically, suppose the asymptotic cost function is w.r.t. graph size, then if there exist constants such that , then we can claim . In fig:test_time to 8, the two constants are tuned for better visualization.

fig:test_time reports the time needed to sample a single graph from the learned model. We can see the computation cost aligns well with the ideal curve of .

To evaluate the training time cost, we report the time needed for each round of model update, which consists of forward, backward pass of neural network, together with the update of parameters. As analyzed in sec:par_train, if there is a device with infinite FLOPS, then the time cost would be . We can see from fig:train_time that this analysis is consistent when graph size is less than 5,000. However as graph gets larger, the computation time grows linearly on a single GPU due to the limit of FLOPS and RAM.

Finally fig:memory_cost shows the peak memory cost during training on a single graph. We select the optimal number of chunks as suggested in sec:sqrtn, and thus the peak memory grows as . We can see such sublinear growth of memory can scale beyond sparse graphs with 100k of nodes.

4.2.2 Quality w.r.t graph size

0.5k 1k 5k 10k 50k 100k
Erdős–Rényi 0.84 0.86 0.91 0.93 0.95 0.95
GRAN 0.39 1.06 N/A N/A
Table 3: MMD using orbit test function on grid graphs with different average number of nodes. N/A denotes runtime error during training, due to RAM or file I/O limitations.

In addition to the time and memory cost, we are also interested in the generated graph quality as it gets larger. To do so, we follow the experiment protocols in sec:exp_general_graph on a set of grid graph datasets. The datasets have the average number of nodes ranging in . We train on of the instances, and evaluate results on held-out instances. As calculating spectra is no longer feasible for large graphs, we report MMD with orbit test function in tab:large_grid. For neural generative models we compare against GRAN as it is the most scalable one currently. GRAN fails on training graphs beyond 50k nodes as runtime error occurs due to RAM or file I/O limitations. We can see the proposed BiGG still preserves high quality up to grid graphs with 100k nodes. With the latest advances of GPUs, we believe BiGG would scale further due to its superior asymptomatic complexity over existing methods.

4.3 Ablation study

In this section, we take a deeper look at the performance of BiGG with different node ordering in sec:ordering. We also show the effect of edge density to the generative performance in sec:sparsity.

4.3.1 BiGG with different node ordering

In the previous sections we use DFS or BFS orders. We find these two orders give consistently good performance over a variety of datasets. For the completeness, we also present results with other node orderings.

We use different orders presented in GRAN’s Github implementation. We use the protein dataset with spectral-MMD as evaluation metric. See tab:ordering for the experimental results. In summary: 1) BFS/DFS give our model consistently good performance over all tasks, as it reduces the tree-width for BiGG (similar to Fig5 in GraphRNN) and we suggest to use BFS or DFS by default; 2) BiGG is also flexible enough to take any order, which allows for future research on deciding the best ordering.

DFS BFS Default Kcore Acc Desc
3.64 3.89 4.81 2.60 3.93 4.54
Table 4: BiGG with nodes ordered by DFS, BFS, default, k-core ordering, degree accent ordering and degree descent ordering respectively on protein data. We report spectral-MMD metric here.

Determining the optimal ordering is NP-hard, and learning a good ordering is also difficult, as shown in the prior works. In this paper, we choose a single canonical ordering among graphs, as Li et al. (2018) shows that canonical ordering mostly outperforms variable orders in their Table 2, 3, while Liao et al. (2019) uses single DFS ordering (see their Sec 4.4 or github) for all experiments.

4.3.2 Performance on random graphs with decreasing sparsity

We here present experiments on Erdos-Renyi graphs with on average 500 nodes and different densities. We report spectral MMD metrics for GRAN, GT and BiGG, where GT is the ground truth Erdos-Renyi model for the data.

1% 2% 5% 10%
GRAN 3.50 1.23 7.81 1.31
GT 9.97 4.55 2.82 1.94
BiGG 9.47 5.08 3.18 8.38
Table 5: Graph generation quality with decreasing sparsity. We use spectral-MMD as evaluation metric against held-out test graphs.

Our main focus is on sparse graphs that are more common in the real world, and for which our approach can gain significant speed ups over the alternatives. Nevertheless, as shown in tab:sparsity, we can see that BiGG is consistently doing much better than GRAN while being close to the ground truth across different edge densities.

5 Conclusion

We presented BiGG, a scalable autoregressive generative model for general graphs. It takes complexity for sparse graphs, which substantially improves previous algorithms. We also proposed both time and memory efficient parallel training method that enables comparable or better quality on benchmark and large random graphs. Future work include scaling up it further while also modeling attributed graphs.


We would like to thank Azalia Mirhoseini, Polo Chau, Sherry Yang and anonymous reviewers for valuable comments and suggestions.


Appendix A Detailed discussion of related work

Traditional graph generative models

There has been a lot of work (Erdös & Rényi, 1959; Barabási & Albert, 1999; Albert & lászló Barabási, 2002; Chakrabarti et al., 2004; Robins et al., 2007; Leskovec et al., 2010; Airoldi et al., 2009) on generating graphs with a set of specific properties like degree distribution, diameter, and eigenvalues. While some of them (Erdös & Rényi, 1959; Barabási & Albert, 1999; Albert & lászló Barabási, 2002; Chakrabarti et al., 2004) focused on the statistical mechanics of random and complex networks, (Robins et al., 2007) proposed exponential random graph models, known as model, in estimating network models comes from the area of social sciences, statistics and social network analysis. However the model usually focuses on local structural features of networks. (Airoldi et al., 2009; Leskovec et al., 2010), on the other hand, model the structure of the network as a whole by combining a global model of dense patches of connectivity with a local model.

Most of these classical models are hand-engineered to model a particular family of graphs, and thus do not have the capacity to directly learn the generative model from observed data.

Deep graph generative models

Recent work on deep neural network based graph generative models are typically much more expressive than classic random graph models, at a higher computation cost.

Similar to visual data, there are VAE (Kingma & Welling, 2013) and GAN (Goodfellow et al., 2014) based graph generative models, like GraphVAEs (Kipf & Welling, 2016; Simonovsky & Komodakis, 2018) and NetGAN (Bojchevski et al., 2018). These models typically generates large parts (or all) of the adjacency matrix independently, making them unable to model complex graph structures.

Auto-regressive models, on the other hand, generate a graph sequentially part by part, gaining expressivity but are typically more expensive as large graphs require large numbers of generation steps. Scalability is a constant important topic in this line of work, starting from (Li et al., 2018). More recent work GraphRNN (You et al., 2018) and GRAN (Liao et al., 2019) reported improved scalability and success generating graphs of up to 5,000 nodes. Our novel BiGG model advances this line of work by significantly improving both on model quality and scalability.

Besides models for general graphs, a lot of work also exploit domain knowledge for better performance in specific domains. Examples of this include (Kusner et al., 2017; Dai et al., 2018; Jin et al., 2018; Liu et al., 2018) for modeling molecule graphs, and (You et al., 2019) for SAT instances.

Appendix B More experiment details

Manual Batching

Though theoretically sec:par_train enables the parallelism, the commonly used packages like PyTorch and TensorFlow has limited support for auto-batching operators. Also optimizing over a computation graph with

operators directly would be infeasible. Inspired by previous works (Looks et al., 2017; Neubig et al., 2017; Bradbury & Fu, 2018), we enable the parallelism by performing gathering and dispatching for tree or LSTM cells. This requires a single call of gemm on CUDA, instead of multiple gemv calls for the cell operations. The indices used for gathering and dispatching are computed in a customized c++ op for the sake of speed.

Parameter sharing

As the maximum depth in BiGG is , it is feasible to have different parameters per different layers for tree or LSTM cells. However experimentally it didn’t affect the performance significantly. With this change, not all parameters are updated at each step, as the trees have different heights and some are updated more often than others, which could potentially make the learning harder. Also it would make the model times larger, and limits the potential of generating graphs larger than seen during training. So we still share the parameters of cells like other autoregressive models.

b.1 BiGG setup

Hyperparameters: For the proposed BiGG, we use embedding size together with position encoding for state representation. Learning rate is initialized to and decays to when training loss gets plateau.

All the experiments for both baselines and BiGG are carried out on a single V100 GPU. For the sublinear memory technique mentioned in sec:sqrtn, we choose (, the number of blocks) as small as possible such that the training would be able to fit in a single V100 GPU. This is to make the training feasible while minimizing the synchronization overhead. Empirically the block size used in sublinear memory technique is set to 6,000 for sparse graphs.

b.2 Baselines setup

We include all the baseline numbers from their original papers when applicable, as the most experimental protocols are the same. We run the following baselines when the corresponding numbers are not reported in the literature:

  • Erdős–Rényi: for the results in tab:large_grid, we estimate the graph edge density using the training grid graphs, and generate new Erdős–Rényi random graphs with this density and compare with the held-out test graphs;

  • GraphRNN-S (You et al., 2018): the result on Lobster random graphs in tab:benchmark is obtained by training the GraphRNN using official code 111

    for 3000 epochs;

  • GRAN (Liao et al., 2019): the results of GRAN in tab:large_grid are obtained by training GRAN with grid graph configuration for 8,000 epochs or 3 days, whichever limit the model reaches first.

  • G2SAT (You et al., 2019): For results in tab:g2sat_24 and  tab:g2sat_24_app, we run the code 222 for 1000 epochs. For the test graph comparison in tab:g2sat_24_app, we use the 8 test graphs as ‘template-graphs’ for G2SAT to generate graphs.

Clustering Modularity Variable Clause Modularity Modularity
Training 0.50 0.07 0.58 0.09 3.57 1.08 4.53 1.09 0.74 0.06 0.67 0.05
CA 0.33 0.08 (34%) 0.48 0.10 (17%) 6.30 1.53 (76%) N/A 0.65 0.08 (12%) 0.53 0.05 (16%)
PS(T=0) 0.82 0.04 (64%) 0.72 0.13 (24%) 3.25 0.89 (9%) 4.70 1.59 (4%) 0.86 0.05 (16%) 0.64 0.05 (2%)
PS(T=1.5) 0.30 0.10 (40%) 0.14 0.03 (76%) 4.19 1.10 (17%) 6.86 1.65 (51%) 0.40 0.05 (46%) 0.41 0.05 (35%)
G2SAT 0.41 0.09 (18%) 0.54 0.11 (7%) 3.57 1.08 (0%) 4.79 2.80 (6%) 0.68 0.07 (8%) 0.67 0.03 (6%)
BiGG 0.50 0.07 (0%) 0.58 0.09 (0%) 3.57 1.08 (0%) 4.53 1.09 (0%) 0.73 0.06 (1%) 0.61 0.09 (9%)
BiGG-0.1 0.50 0.07 (0%) 0.58 0.09 (0%) 3.57 1.08 (0%) 4.53 1.09 (0%) 0.74 0.06 (0%) 0.65 0.07 (7%)
Table 6: Training and generated graph statistics with 10 SAT formulas used in You et al. (2019). Baselines results are directly copied from You et al. (2019).
Clustering Modularity Variable Clause Modularity Modularity
Test-8 0.50 0.05 0.68 0.19 4.66 1.92 6.33 3.23 0.84 0.02 0.75 0.05
G2SAT 0.22 0.10 (56%) 0.74 0.11 (9%) 4.66 1.92 (0%) 6.60 4.15 (5%) 0.80 0.09 (5%) 0.68 0.04 (9%)
BiGG 0.37 0.19 (26%) 0.28 0.12 (58%) 4.66 1.92 (0%) 2.74 0.37 (57%) 0.44 0.07 (48%) 0.48 0.09 (36%)
Table 7: Test and generated graph statistics with 8 SAT formulas from G2SAT and BiGG are trained on 24 sat formulas as in tab:g2sat_24

b.3 More experimental results on SAT graphs

In main paper we reported the results on 24 training instances with the evaluation metric used in You et al. (2019). Here we include the results on the subset which is used in the original G2SAT paper (You et al., 2019). This subset contains 10 instances, with 82 to 1122 variables and 327 to 4555 clauses. This result in graphs with 6,799 nodes at most. tab:subset summarizes the results, together with other baseline results that are copied from You et al. (2019). We can see our model can almost perfectly recover the training statistics if the generative process is biased more towards high probability region (, BiGG-0.1 which has 90% chance to use greedy decoding at each step). This again demonstrate that our model is flexible enough to capture complicated distributions of graph structures. On the other hand, as tab:g2sat_24_app shows, the G2SAT beats BiGG on 4 out of 6 metrics. Since G2SAT is a specialized generative model for bipartite graphs, it enjoys more of inductive bias towards this specific problem, and thus would work better in the low-data and extrapolation scenario. Our general purposed model thus would be expected to require more training data due to its high capacity in model space.