1 Introduction
Representing a distribution over graphs provides a principled foundation for tackling many important problems in knowledge graph completion
(Xiao et al., 2016), de novo drug design (Li et al., 2018; Simonovsky & Komodakis, 2018), architecture search (Xie et al., 2019) and program synthesis (Brockschmidt et al., 2018). The effectiveness of graph generative modeling usually depends on learning the distribution given a collection of relevant training graphs. However, training a generative model over graphs is usually quite difficult due to their discrete and combinatorial nature.Classical generative models of graphs, based on random graph theory (Erdős & Rényi, 1960; Barabási & Albert, 1999; Watts & Strogatz, 1998), have been long studied but typically only capture a small set of specific graph properties, such as degree distribution. Despite their computational efficiency, these distribution models are usually too inexpressive to yield competitive results in challenging applications.
Recently, deep graph generative models that exploit the increased capacity of neural networks to learn more expressive graph distributions have been successfully applied to realworld tasks. Prominent examples include VAEbased methods
(Kipf & Welling, 2016; Simonovsky & Komodakis, 2018), GANbased methods (Bojchevski et al., 2018), flow models (Liu et al., 2019; Shi et al., 2020) and autoregressive models (Li et al., 2018; You et al., 2018; Liao et al., 2019). Despite the success of these approaches in modeling small graphs, e.g. molecules with hundreds of nodes, they are not able to scale to graphs with over 10,000 nodes.A key shortcoming of current deep graph generative models is that they attempt to generate a full graph adjacency matrix, implying a computational cost of for a graph with nodes and edges. For large graphs, it is impractical to sustain such a quadratic time and space complexity, which creates an inherent tradeoff between expressiveness and efficiency. To balance this tradeoff, most recent work has introduced various conditional independence assumptions (Liao et al., 2019), ranging from the fully autoregressive but slow GraphRNN (You et al., 2018), to the fast but fully factorial GraphVAE (Simonovsky & Komodakis, 2018).
In this paper, we propose an alternative approach that does not commit to explicit conditional independence assumptions, but instead exploits the fact that most interesting realworld graphs are sparse, in the sense that . By leveraging sparsity, we develop a new graph generative model, BiGG (BIg Graph Generation), that streamlines the generative process and avoids explicit consideration of every entry in an adjacency matrix. The approach is based on three key elements: (1) an process for generating each edge using a binary tree data structure, inspired by RMAT (Chakrabarti et al., 2004); (2) a treestructured autoregressive model for generating the set of edges associated with each node; and (3) an autoregressive model defined over the sequence of nodes. By combining these elements, BiGG can generate a sparse graph in time, which is a substantial improvement over .
For training, the design of BiGG also allows every context embedding in the autoregressive model to be computed in only sequential steps, which enables significant gains in training speed through parallelization. By comparison, the context embedding in GraphRNN requires sequential steps to compute, while GRAN requires steps. In addition, we develop a training mechanism that only requires sublinear memory cost, which in principle makes it possible to train models of graphs with millions of nodes on a single GPU.
On several benchmark datasets, including synthetic graphs and realworld graphs of proteins, 3D mesh and SAT instances, BiGG is able to achieve comparable or superior sample quality than the previous stateoftheart, while being orders of magnitude more scalable.
To summarize the main contributions of this paper:

[leftmargin=*,nolistsep,nosep]

We propose an autoregressive generative model, BiGG, that can generate sparse graphs in time, successfully modeling graphs with 100k nodes on 1 GPU.

The training process can be largely parallelized, requiring only steps to synchronize learning updates.

Memory cost is reduced to for training and for inference.

BiGG not only scales to orders of magnitude larger graphs than current deep models, it also yields comparable or better model quality on several benchmark datasets.
Other related work There has been a lot of work (Chakrabarti et al., 2004; Robins et al., 2007; Leskovec et al., 2010; Airoldi et al., 2009)
on generating graphs with a set of specific properties like degree distribution, diameter, and eigenvalues. All these classical models are handengineered to model a particular family of graphs, and thus not general enough. Besides the general graphs, a lot of work also exploit domain knowledge for better performance in specific domains. Examples of this include
(Kusner et al., 2017; Dai et al., 2018; Jin et al., 2018; Liu et al., 2018) for modeling molecule graphs, and (You et al., 2019) for SAT instances. See app:related for more discussions.2 Model
A graph is defined by a set of nodes and set of edges , where a tuple is used to represent an edge between node and . We denote and as the number of nodes and edges in respectively. Note that a given graph may have multiple equivalent adjacency matrix representations, with different node orderings. However, given an ordering of the nodes , there is a onetoone mapping between the graph structure and the adjacency matrix .
Our goal is to learn a generative model, , given a set of training graphs . In this paper, we assume the graphs are not attributed, and focus on the graph topology only. Such an assumption implies
(1) 
where
is the probability of generating the set of edges
under a particular ordering , which is equivalent to the probability of a particular adjacency matrix . Here, is the distribution of number of nodes in a graph. In this paper, we use a single canonical ordering to model each graph , as in (Li et al., 2018):(2) 
which is clearly a lower bound on (Liao et al., 2019)
. We estimate
directly using the empirical distribution over graph size, and only learn an expressive model for . In the following presentation, we therefore omit the ordering and assume a default canonical ordering of nodes in the graph when appropriate.As will be extremely sparse for large sparse graphs (), generating only the nonzero entries in , , the edge set , will be much more efficient than the full matrix:
(3) 
where each include the indices of the two nodes associated with one edge, resulting in a generation process of steps. We order the edges following the node ordering, hence this process generates all the edges for the first row in , before generating the second row, etc. A naive approach to generating a single edge will be if we factorize and assume both and to be simple multinomials over nodes. This, however, will not give us any benefit over traditional methods.
2.1 Recursive edge generation
The main scalability bottleneck is the large output space. One way to reduce the output space size is to use a hierarchical recursive decomposition, inspired by the classic random graph model RMAT (Chakrabarti et al., 2004). In RMAT, each edge is put into the adjacency matrix by dividing the 2D matrix into 4 equally sized quadrants, and recursively descending into one of the quadrants, until reaching a single entry of the matrix. In this generation process, the complexity of generating each edge is only .
We adopt the recursive decomposition design of RMAT, and further simplify and neuralize it, to make our model efficient and expressive. In our model, we generate edges following the node ordering row by row, so that each edge only needs to be put into the right place in a single row, reducing the process to 1D, as illustrated in fig:edge_gen. For any edge , the process of picking node as one of ’s neighbors starts by dividing the node index interval in half, then recursively descending into one half until reaching a single entry. Each corresponds to a unique sequence of decisions , where is the th decision in the sequence, and is the maximum number of required decisions to specify .
The probability of can then be formulated as
(4) 
where each is the probability of following the decision that leads to at step .
Let us use to denote the set of edges incident to node , and . Generating only a single edge is similar to hierarchical softmax (Mnih & Hinton, 2009), and applying the above procedure repeatedly can generate all of edges in time. But we can do better than that when generating all these edges.
Further improvement using binary trees. As illustrated in the left half of fig:row_gen, the process of jointly generating all of is equivalent to building up a binary tree , where each tree node corresponds to a graph node index interval , and for each the process starts from the root and ends in a leaf .
Taking this perspective, we propose a more efficient generation process for , which generates the tree directly instead of repeatedly generating each leaf through a path from the root. We propose a recursive process that builds up the tree following a depthfirst or inorder traversal order, where we start at the root, and recursively for each tree node : (1) decide if has a left child denoted as , and (2) if so recurse into and generate the left subtree, and then (3) decide if has a right child denoted as , (4) if so recurse into and generate the right subtree, and (5) return to ’s parent. This process is shown in alg:row_gen, which will be elaborated in next section.
Overloading the notation a bit, we use to denote the probability that tree node has a left child, when , or does not have a left child, when , under our model, and similarly define . Then the probability of generating or equivalently tree is
(5) 
This new process generates each tree node exactly once, hence the time complexity is proportional to the tree size , and it is clear that , since is the number of leaf nodes in the tree and is the max depth of the tree. The time saving comes from removing the duplicated effort near the root of the tree. When is large, as some fraction of when is one of the “hub” nodes in the graph, the tree becomes dense and our new generation process will be significantly faster, as the time complexity becomes close to while generating each leaf from the root would require time.
In the following, we present our approach to make this model fully autoregressive, making and depend on all the decisions made so far in the process of generating the graph, and make this model neuralized so that all the probability values in the model come from expressive deep neural networks.
2.2 Autoregressive conditioning for generating
In this section we consider how to add autoregressive conditioning to and when generating .
In our generation process, the decision about whether exists for a particular tree node is made after
, all its ancestors, and all the left subtrees of the ancestors are generated. We can use a topdown context vector
to summarize all these contexts, and modify to . Similarly, the decision about is made after generating and its dependencies, and ’s entire leftsubtree (see fig:row_gen right half for illustration). We therefore need both the topdown context , as well as the bottomup context that summarizes the subtree rooted at , if any. The autoregressive model for therefore becomes(6) 
and we can recursively define
(7)  
where and are two TreeLSTM cells (Tai et al., 2015) that combine information from the incoming nodes into a single node state, and and represents the embedding vector for the binary values “left” and “right”. We initialize , and discuss in the next section.
The distributions can then be parameterized as
(8)  
(9) 
2.3 Full autoregressive model
With the efficient recursive edge generation and autoregressive conditioning presented in sec:edge_gen and sec:row_gen respectively, we are ready to present the full autoregressive model for generating the entire adjacency matrix .
The full model will utilize the autoregressive model for as building blocks. Specifically, we are going to generate the adjacency matrix row by row:
(10) 
Let be the embedding that summarizes , suppose we have an efficient mechanism to encode into , then we can effectively use to generate and thus the entire process would become autoregressive. Again, since there are rows in total, using a chain structured LSTM would make the history length too long for large graphs. Therefore, we use an approach inspired by the Fenwick tree (Fenwick, 1994) which is a data structure that maintains the prefix sum efficiently. Given an array of numbers, the Fenwick tree allows calculating any prefix sum or updating any single entry in for a sequence of length . We build such a tree to maintain and update the prefix embeddings. We denote it as rowbinary forest as such data structure is a forest of binary trees.
fig:row2048 demonstrates one solution. Before generating the edgebinary tree , the embeddings that summarize each individual edgebinary tree will be organized into the rowbinary forest . This forest is organized into levels, with the bottom th level as edgebinary tree embeddings. Let be the th node in the th level, then
(11) 
where .
Embedding rowbinary forest One way to embed this rowbinary forest is to embed the root of each tree in this forest. As there will be at most one root in each level (otherwise the two trees in the same level will be merged into a larger tree for the next level), the number of tree roots will also be at most. Thus we calculate as follows:
(12) 
Here, is the bitlevel ‘and’ operator. Intuitively as in Fenwick tree, the calculation of any prefix sum of length requires the block sums that corresponds to each binary digit in the binary bits representation of integer . Recall the operator defined in sec:row_gen, here when , and equals to zero when . With the embedding of at each state served as ‘glue’, we can connect all the individual row generation modules in an autoregressive way.
Updating rowbinary forest It is also efficient to update such forest every time a new is obtained after the generation of . Such an updating procedure is similar to the Fenwick tree update. As each level of this forest has at most one root node, merging in the way defined in (11) will happen at most once per each level, which effectively makes the updating cost to be .
alg:graph_gen summarizes the entire procedure for sampling a graph from our model in a fully autoregressive manner.
BiGG generates a graph with node and edges in time. In the extreme case where , the overall complexity becomes . The generation of each edgebinary tree in sec:row_gen requires time complexity proportional to the number of nodes in the tree. The Fenwick tree query and update both take time, hence maintaining the data structure takes . The overall complexity is . For a sparse graph , hence . For a complete graph, where , each will be a full binary tree with leaves, hence and the overall complexity would be .
3 Optimization
In this section, we propose several ways to scale up the training of our autoregressive model. For simplicity, we focus on how to speed up the training with a single graph. Training multiple graphs can be easily extended.
3.1 Training with synchronizations
A classical autoregressive model like LSTM is fully sequential, which allows no concurrency across steps. Thus training LSTM with a sequence of length takes of synchronized computation. In this section, we show how to exploit the characteristic of BiGG to increase concurrency during training.
Given a graph , estimating the likelihood under the current model can be divided into four steps.

[wide,noitemsep,topsep=0.5pt,parsep=0pt,partopsep=0.5pt]

Computing : as the graph is given during training, the corresponding edgebinary trees are also known. From Eq (7) we can see that the embeddings of all nodes at the same depth of tree can be computed concurrently without dependence, and the synchronization only happens between different depths. Thus steps of synchronization is sufficient.

Computing : the rowbinary forest grows monotonically, thus the forest in the end contains all the embeddings needed for computing . Similarly, computing synchronizes steps.

Computing : as each runs an LSTM independently, this stage simply runs LSTM on a batch of sequences with length .
fig:par_train demonstrates this process. In summary, the four stages each take steps of synchronization. This allows us to train large graphs much more efficiently than a simple sequential autoregressive model.
3.2 Model parallelism
It is possible that during training the graph is too large to fit into memory. Thus to train on large graphs, model parallelism is more important than data parallelism.
To split the model, as well as intermediate computations, into different machines, we divide the adjacency matrix into multiple consecutive chunks of rows, where each machine is responsible for one chunk. It is easy to see that by doing so, Stage 1 and Stage 4 mentioned in sec:par_train can be executed concurrently on all the machines without any synchronization, as the edgebinary trees can be processed independently once the conditioning states like are made ready by synchronization.
fig:model_par illustrates the situation when training a graph with 7 nodes using 2 GPUs. During Stage 1, GPU 1 and 2 work concurrently to compute and , respectively. In Stage 2, the embeddings and are needed by GPU 2 when computing and . We denote such embeddings as ‘gmessages’. Note that such ‘gmessages’ will transit in the opposite direction when doing a backward pass in the gradient calculation. Passing ‘gmessages’ introduces serial dependency across GPUs. However as the number of such embeddings is upper bounded by depth of rowbinary forest, the communication cost is manageable.
3.3 Reducing memory consumption
Sublinear memory cost:
Another way to handle the memory issue when training large graphs is to recompute certain portions of the hidden layers in the neural network when performing backpropagation, to avoid storing such layers during the forward pass.
Chen et al. (2016) introduces a computation scheduling mechanism for sequential structured neural networks that achieves memory growth for an layer neural network.We divide rows as in sec:model_par. During the forward pass, only the ‘gmessages’ between chunks are kept. The only difference from the previous section is that the edgebinary tree embeddings are recomputed, due to the single GPU limitation. The memory cost will be:
(13) 
Here accounts for the memory holding the ‘gmessage’, and accounts for the memory of in each chunk. The optimal is achieved when , hence and the corresponding memory cost is . Also note that such sublinear cost requires only one additional feedforward in Stage 1, so this will not hurt much of the training speed.
Bits compression: The vector summarizes the edgebinary tree structure rooted at node for th row in adjacency matrix , as defined in Eq (7). As node represents the interval of the row, another equivalent way is to directly use , , the binary vector to represent . Each where is also a binary vector. Thus no neural network computation is needed in the subtree rooted at node . Suppose we use such bits representation for any nodes that have the corresponding interval length no larger than , then for a full edgebinary tree (, connects to every other node in graph) which has nodes in the tree, the corresponding storage required for neural part is which essentially reduces the memory consumption of neural network to of the original cost. Empirically we use in all experiments, which saves of the memory during training without losing any information in representation.
Note that to represent an interval of length , we use vector where
That is to say, we use ternary bit vector to encode both the interval length and the binary adjacency information.
3.4 Position encoding:
During generation of , each tree node of the edgebinary tree knows the span which corresponds to the columns it will cover. One way is to augment with the position encoding as:
(14) 
where PE is the position encoding using sine and cosine functions of different frequencies as in Vaswani et al. (2017). Similarly, the in Eq (12) can be augmented by in a similar way. With such augmentation, the model will know more context into the future, and thus help improve the generative quality.
Please refer to our released open source code located at
https://github.com/googleresearch/googleresearch/tree/master/bigg for more implementation and experimental details.4 Experiment
4.1 Model Quality Evaluation on Benchmark Datasets
In this part, we compare the quality of our model with previous work on a set of benchmark datasets. We present results on median sized general graphs with number of nodes ranging in 0.1k to 5k in sec:exp_general_graph, and on large SAT graphs with up to 20k nodes in sec:exp_sat_graph. In sec:ablation we perform ablation studies of BiGG with different sparsity and node orders.
4.1.1 General graphs
Datasets  Methods  
ErdosRenyi  GraphVAE  GraphRNNS  GraphRNN  GRAN  BiGG  
Grid  Deg.  0.79  7.07  0.13  1.12  8.23  
Clus.  2.00  7.33  3.73  7.73  3.79  
,  Orbit  1.08  0.12  0.18  1.03  1.59  
,  Spec.  0.68  1.44  0.19  1.18  1.62  
Protein  Deg.  5.64  0.48  4.02  1.06  1.98  
Clus.  1.00  7.14  4.79  0.14  4.86  
,  Orbit  1.54  0.74  0.23  0.88  0.13  
,  Spec.  9.13  0.11  0.21  1.88  5.13  
3D Point Cloud  Deg.  0.31  OOM  OOM  OOM  1.75  
Clus.  1.22  OOM  OOM  OOM  0.51  0.21  
,  Orbit  1.27  OOM  OOM  OOM  0.21  
,  Spec.  4.26  OOM  OOM  OOM  7.45  
Lobster  Deg.  0.24  2.09  3.48  9.26  3.73  
Clus.  3.82  7.97  4.30  0.00  0.00  0.00  
,  Orbit  2.42  1.43  2.48  2.19  7.67  
,  Spec.  0.33  3.94  6.72  1.14  2.71  
Err.  1.00  0.91  1.00  0.00  0.12  0.00 
The general graph benchmark is obtained from Liao et al. (2019) and part of it was also used in (You et al., 2018). This benchmark has four different datasets: (1) Grid, 100 2D grid graphs; (2) Protein, 918 protein graphs (Dobson & Doig, 2003); (3) Point cloud, 3D point clouds of 41 household objects (Neumann et al., 2013); (4) Lobster, 100 random Lobster graphs (Golomb, 1996), which are trees where each node is at most 2 hops away from a backbone path. tab:benchmark contains some statistics about each of these datasets. We use the same protocol as Liao et al. (2019) that splits the graphs into training and test sets.
Baselines: We compare with deep generative models including GraphVAE (Simonovsky & Komodakis, 2018), GraphRNN, GraphRNNS (You et al., 2018) and GRAN (Liao et al., 2019). We also include the Erdős–Rényi random graph model that only estimates the edge density. Since our setups are exactly the same, the baseline results are directly copied from Liao et al. (2019).
Evaluation:
We use exactly the same evaluation metric as
Liao et al. (2019), which compares the distance between the distribution of heldout test graphs and the generated graphs. We use maximum mean discrepancy (MMD) with four different test functions, namely the node degree, clustering coefficient, orbit count and the spectra of the graphs from the eigenvalues of the normalized graph Laplacian. Besides the four MMD metrics, we also use the error rate for Lobster dataset. This error rate reflects the fraction of generated graphs that doesn’t have Lobster graph property.Results: tab:benchmark reports the results on all the four datasets. We can see the proposed BiGG outperforms all other methods on all the metrics. The gain becomes more significant on the largest dataset, , the 3D point cloud. While GraphVAE and GraphRNN gets out of memory, the orbit metric of BiGG is 2 magnitudes better than GRAN. This dataset reflects the scalability issue of existing deep generative models. Also from the Lobster graphs we can see, although GRAN scales better than GraphRNN, it yields worse quality due to its approximation of edge generation with mixture of conditional independent distributions. Our BiGG improves the scalability while also maintaining the expressiveness.
4.1.2 SAT graphs
Method  VIG  VCG  LCG  

Clustering  Modularity  Variable  Clause  Modularity  Modularity  
Training24  0.53 0.08  0.61 0.13  5.30 3.79  5.14 3.13  0.76 0.08  0.70 0.07 
G2SAT  0.41 0.18 (23%)  0.55 0.18 (10%)  5.30 3.79 (0%)  7.22 6.38 (40%)  0.71 0.12 (7%)  0.68 0.06 (3%) 
BiGG0.1  0.49 0.21 (8%)  0.36 0.21 (41%)  5.30 3.79 (0%)  3.76 1.21 (27%)  0.58 0.16 (24%)  0.58 0.11 (17%) 
BiGG0.01  0.54 0.13(2%)  0.53 0.21 (13%)  5.30 3.79(0%)  4.28 1.50 (17%)  0.71 0.13 (7%)  0.67 0.09 (4%) 
. The neural baselines in tab:benchmark are not applicable due to scalability issue. We report mean and std of different test statistics, as well as the gap between true SAT instances.
In addition to comparing with general graph generative models, in this section we compare against several models that are designated for generating the Boolean Satisfiability (SAT) instances. A SAT instance can be represented using bipartite graph, , the literalclause graph (LCG). For a SAT instance with variables and clauses, it creates positive and negative literals, respectively. The canonical node ordering assigns 1 to for literals and to for clauses.
The following experiment largely follows G2SAT (You et al., 2019). We use the train/test split of SAT instances obtained from G2SAT website. This result in 24 and 8 training/test SAT instances, respectively. The size of the SAT graphs ranges from 491 to 21869 nodes. Note that the original paper reports results using 10 small training instances instead. For completeness, we also include such results in app:sat together with other baselines from You et al. (2019).
Baseline: We mainly compare the learned model with G2SAT, a specialized deep graph generative model for bipartite SAT graphs. Since BiGG is general purposed, to guarantee the generated adjacency matrix is bipartite, we let our model to generate the upper offdiagonal block of the adjacency matrix only, , .
G2SAT requires additional ‘template graph’ as input when generating the graph. Such template graph is equivalent to specify the node degree of literals in LCG. We can also enforce the degree of each node in our model.
Evaluation:
Following G2SAT, we report the mean and standard deviation of statistics with respect to different test functions. These include the modularity, average clustering coefficient and the scalefree structure parameters for different graph representations of SAT instances. Please refer to
Newman (2001, 2006); Ansótegui et al. (2009); Clauset et al. (2009) for more details. In general, the closer the statistical estimation the better it is.Results: Following You et al. (2019), we compare the statistics of graphs with the training instances in tab:g2sat_24. To mimic G2SAT which picks the best action among sampled options each step, we perform sampling variant (which is denoted BiGG). Such model has
probability to sample from Bernoulli distribution (as in Eq (
8) (9)) each step, and to pick best option otherwise. This is used to demonstrate the capacity of the model. We can see that the proposed BiGG can mostly recover the statistics of training graph instances. This implies that despite being general, the full autoregressive model is capable of modeling complicated graph generative process. We additionally report the statistics of generated SAT instances against the test set in app:sat, where G2SAT outperforms BiGG in 4/6 metrics. As G2SAT is specially designed for bipartite graphs, the inductive bias it introduces allows the extrapolation to large graphs. Our BiGG is general purposed and has higher capacity, thus also overfit to the small training set more easily.4.2 Scalability of BiGG
In this section, we will evaluate the scalability of BiGG regarding the time complexity, memory consumption and the quality of generated graphs with respect to the number of nodes in graphs.
4.2.1 Runtime and memory cost
Here we empirically verify the time and memory complexity analyzed in sec:model. We run BiGG on grid graphs with different numbers of nodes that are chosen from . In this case . Additionally, we also plot curves from the theoretical analysis for verification. Specifically, suppose the asymptotic cost function is w.r.t. graph size, then if there exist constants such that , then we can claim . In fig:test_time to 8, the two constants are tuned for better visualization.
fig:test_time reports the time needed to sample a single graph from the learned model. We can see the computation cost aligns well with the ideal curve of .
To evaluate the training time cost, we report the time needed for each round of model update, which consists of forward, backward pass of neural network, together with the update of parameters. As analyzed in sec:par_train, if there is a device with infinite FLOPS, then the time cost would be . We can see from fig:train_time that this analysis is consistent when graph size is less than 5,000. However as graph gets larger, the computation time grows linearly on a single GPU due to the limit of FLOPS and RAM.
Finally fig:memory_cost shows the peak memory cost during training on a single graph. We select the optimal number of chunks as suggested in sec:sqrtn, and thus the peak memory grows as . We can see such sublinear growth of memory can scale beyond sparse graphs with 100k of nodes.
4.2.2 Quality w.r.t graph size
0.5k  1k  5k  10k  50k  100k  

Erdős–Rényi  0.84  0.86  0.91  0.93  0.95  0.95 
GRAN  0.39  1.06  N/A  N/A  
BiGG 
In addition to the time and memory cost, we are also interested in the generated graph quality as it gets larger. To do so, we follow the experiment protocols in sec:exp_general_graph on a set of grid graph datasets. The datasets have the average number of nodes ranging in . We train on of the instances, and evaluate results on heldout instances. As calculating spectra is no longer feasible for large graphs, we report MMD with orbit test function in tab:large_grid. For neural generative models we compare against GRAN as it is the most scalable one currently. GRAN fails on training graphs beyond 50k nodes as runtime error occurs due to RAM or file I/O limitations. We can see the proposed BiGG still preserves high quality up to grid graphs with 100k nodes. With the latest advances of GPUs, we believe BiGG would scale further due to its superior asymptomatic complexity over existing methods.
4.3 Ablation study
In this section, we take a deeper look at the performance of BiGG with different node ordering in sec:ordering. We also show the effect of edge density to the generative performance in sec:sparsity.
4.3.1 BiGG with different node ordering
In the previous sections we use DFS or BFS orders. We find these two orders give consistently good performance over a variety of datasets. For the completeness, we also present results with other node orderings.
We use different orders presented in GRAN’s Github implementation. We use the protein dataset with spectralMMD as evaluation metric. See tab:ordering for the experimental results. In summary: 1) BFS/DFS give our model consistently good performance over all tasks, as it reduces the treewidth for BiGG (similar to Fig5 in GraphRNN) and we suggest to use BFS or DFS by default; 2) BiGG is also flexible enough to take any order, which allows for future research on deciding the best ordering.
DFS  BFS  Default  Kcore  Acc  Desc 

3.64  3.89  4.81  2.60  3.93  4.54 
Determining the optimal ordering is NPhard, and learning a good ordering is also difficult, as shown in the prior works. In this paper, we choose a single canonical ordering among graphs, as Li et al. (2018) shows that canonical ordering mostly outperforms variable orders in their Table 2, 3, while Liao et al. (2019) uses single DFS ordering (see their Sec 4.4 or github) for all experiments.
4.3.2 Performance on random graphs with decreasing sparsity
We here present experiments on ErdosRenyi graphs with on average 500 nodes and different densities. We report spectral MMD metrics for GRAN, GT and BiGG, where GT is the ground truth ErdosRenyi model for the data.
1%  2%  5%  10%  

GRAN  3.50  1.23  7.81  1.31 
GT  9.97  4.55  2.82  1.94 
BiGG  9.47  5.08  3.18  8.38 
Our main focus is on sparse graphs that are more common in the real world, and for which our approach can gain significant speed ups over the alternatives. Nevertheless, as shown in tab:sparsity, we can see that BiGG is consistently doing much better than GRAN while being close to the ground truth across different edge densities.
5 Conclusion
We presented BiGG, a scalable autoregressive generative model for general graphs. It takes complexity for sparse graphs, which substantially improves previous algorithms. We also proposed both time and memory efficient parallel training method that enables comparable or better quality on benchmark and large random graphs. Future work include scaling up it further while also modeling attributed graphs.
Acknowledgements
We would like to thank Azalia Mirhoseini, Polo Chau, Sherry Yang and anonymous reviewers for valuable comments and suggestions.
References
 Airoldi et al. (2009) Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. Mixed membership stochastic blockmodels. In Advances in Neural Information Processing Systems 21, pp. 33–40, 2009.
 Albert & lászló Barabási (2002) Albert, R. and lászló Barabási, A. Statistical mechanics of complex networks. Rev. Mod. Phys, 2002.
 Ansótegui et al. (2009) Ansótegui, C., Bonet, M. L., and Levy, J. On the structure of industrial sat instances. In International Conference on Principles and Practice of Constraint Programming, pp. 127–141. Springer, 2009.
 Barabási & Albert (1999) Barabási, A.L. and Albert, R. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
 Bojchevski et al. (2018) Bojchevski, A., Shchur, O., Zügner, D., and Günnemann, S. Netgan: Generating graphs via random walks. arXiv preprint arXiv:1803.00816, 2018.

Bradbury & Fu (2018)
Bradbury, J. and Fu, C.
Automatic batching as a compiler pass in pytorch.
In Workshop on Systems for ML, 2018.  Brockschmidt et al. (2018) Brockschmidt, M., Allamanis, M., Gaunt, A. L., and Polozov, O. Generative code modeling with graphs. arXiv preprint arXiv:1805.08490, 2018.
 Chakrabarti et al. (2004) Chakrabarti, D., Zhan, Y., and Faloutsos, C. Rmat: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining, pp. 442–446. SIAM, 2004.
 Chen et al. (2016) Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
 Clauset et al. (2009) Clauset, A., Shalizi, C. R., and Newman, M. E. Powerlaw distributions in empirical data. SIAM review, 51(4):661–703, 2009.
 Dai et al. (2018) Dai, H., Tian, Y., Dai, B., Skiena, S., and Song, L. Syntaxdirected variational autoencoder for structured data. arXiv preprint arXiv:1802.08786, 2018.
 Dobson & Doig (2003) Dobson, P. D. and Doig, A. J. Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology, 330(4):771–783, 2003.
 Erdös & Rényi (1959) Erdös, P. and Rényi, A. On random graphs i. Publicationes Mathematicae Debrecen, 6:290–297, 1959.
 Erdős & Rényi (1960) Erdős, P. and Rényi, A. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.
 Fenwick (1994) Fenwick, P. M. A new data structure for cumulative frequency tables. Software: Practice and experience, 24(3):327–336, 1994.
 Golomb (1996) Golomb, S. W. Polyominoes: puzzles, patterns, problems, and packings, volume 16. Princeton University Press, 1996.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Jin et al. (2018) Jin, W., Barzilay, R., and Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. arXiv preprint arXiv:1802.04364, 2018.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kipf & Welling (2016) Kipf, T. N. and Welling, M. Variational graph autoencoders. arXiv preprint arXiv:1611.07308, 2016.

Kusner et al. (2017)
Kusner, M. J., Paige, B., and HernándezLobato, J. M.
Grammar variational autoencoder.
In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1945–1954. JMLR. org, 2017.  Leskovec et al. (2010) Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., and Ghahramani, Z. Kronecker graphs: An approach to modeling networks. The Journal of Machine Learning Research, 11:985–1042, 2010.
 Li et al. (2018) Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, P. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324, 2018.
 Liao et al. (2019) Liao, R., Li, Y., Song, Y., Wang, S., Hamilton, W., Duvenaud, D. K., Urtasun, R., and Zemel, R. Efficient graph generation with graph recurrent attention networks. In Advances in Neural Information Processing Systems, pp. 4257–4267, 2019.
 Liu et al. (2019) Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K. Graph normalizing flows. In Advances in Neural Information Processing Systems, pp. 13556–13566, 2019.
 Liu et al. (2018) Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A. Constrained graph variational autoencoders for molecule design. In Advances in Neural Information Processing Systems, pp. 7795–7804, 2018.
 Looks et al. (2017) Looks, M., Herreshoff, M., Hutchins, D., and Norvig, P. Deep learning with dynamic computation graphs. arXiv preprint arXiv:1702.02181, 2017.
 Mnih & Hinton (2009) Mnih, A. and Hinton, G. E. A scalable hierarchical distributed language model. In Advances in neural information processing systems, pp. 1081–1088, 2009.
 Neubig et al. (2017) Neubig, G., Goldberg, Y., and Dyer, C. Onthefly operation batching in dynamic computation graphs. In Advances in Neural Information Processing Systems, pp. 3971–3981, 2017.
 Neumann et al. (2013) Neumann, M., Moreno, P., Antanas, L., Garnett, R., and Kersting, K. Graph kernels for object category prediction in taskdependent robot grasping. In Online Proceedings of the Eleventh Workshop on Mining and Learning with Graphs, pp. 0–6, 2013.
 Newman (2001) Newman, M. E. Clustering and preferential attachment in growing networks. Physical review E, 64(2):025102, 2001.
 Newman (2006) Newman, M. E. Modularity and community structure in networks. Proceedings of the national academy of sciences, 103(23):8577–8582, 2006.
 Robins et al. (2007) Robins, G., Pattison, P., Kalish, Y., and Lusher, D. An introduction to exponential random graph (p*) models for social networks. Social Networks, 29(2):173–191, 2007.
 Shi et al. (2020) Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, J. Graphaf: a flowbased autoregressive model for molecular graph generation. arXiv preprint arXiv:2001.09382, 2020.
 Simonovsky & Komodakis (2018) Simonovsky, M. and Komodakis, N. Graphvae: Towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pp. 412–422. Springer, 2018.
 Tai et al. (2015) Tai, K. S., Socher, R., and Manning, C. D. Improved semantic representations from treestructured long shortterm memory networks. arXiv preprint arXiv:1503.00075, 2015.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
 Watts & Strogatz (1998) Watts, D. J. and Strogatz, S. H. Collective dynamics of ‘smallworld’networks. nature, 393(6684):440, 1998.
 Xiao et al. (2016) Xiao, H., Huang, M., and Zhu, X. Transg: A generative model for knowledge graph embedding. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2316–2325, 2016.

Xie et al. (2019)
Xie, S., Kirillov, A., Girshick, R., and He, K.
Exploring randomly wired neural networks for image recognition.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 1284–1293, 2019.  You et al. (2018) You, J., Ying, R., Ren, X., Hamilton, W. L., and Leskovec, J. Graphrnn: Generating realistic graphs with deep autoregressive models. arXiv preprint arXiv:1802.08773, 2018.
 You et al. (2019) You, J., Wu, H., Barrett, C., Ramanujan, R., and Leskovec, J. G2sat: Learning to generate sat formulas. In Advances in Neural Information Processing Systems, pp. 10552–10563, 2019.
Appendix A Detailed discussion of related work
Traditional graph generative models
There has been a lot of work (Erdös & Rényi, 1959; Barabási & Albert, 1999; Albert & lászló Barabási, 2002; Chakrabarti et al., 2004; Robins et al., 2007; Leskovec et al., 2010; Airoldi et al., 2009) on generating graphs with a set of specific properties like degree distribution, diameter, and eigenvalues. While some of them (Erdös & Rényi, 1959; Barabási & Albert, 1999; Albert & lászló Barabási, 2002; Chakrabarti et al., 2004) focused on the statistical mechanics of random and complex networks, (Robins et al., 2007) proposed exponential random graph models, known as model, in estimating network models comes from the area of social sciences, statistics and social network analysis. However the model usually focuses on local structural features of networks. (Airoldi et al., 2009; Leskovec et al., 2010), on the other hand, model the structure of the network as a whole by combining a global model of dense patches of connectivity with a local model.
Most of these classical models are handengineered to model a particular family of graphs, and thus do not have the capacity to directly learn the generative model from observed data.
Deep graph generative models
Recent work on deep neural network based graph generative models are typically much more expressive than classic random graph models, at a higher computation cost.
Similar to visual data, there are VAE (Kingma & Welling, 2013) and GAN (Goodfellow et al., 2014) based graph generative models, like GraphVAEs (Kipf & Welling, 2016; Simonovsky & Komodakis, 2018) and NetGAN (Bojchevski et al., 2018). These models typically generates large parts (or all) of the adjacency matrix independently, making them unable to model complex graph structures.
Autoregressive models, on the other hand, generate a graph sequentially part by part, gaining expressivity but are typically more expensive as large graphs require large numbers of generation steps. Scalability is a constant important topic in this line of work, starting from (Li et al., 2018). More recent work GraphRNN (You et al., 2018) and GRAN (Liao et al., 2019) reported improved scalability and success generating graphs of up to 5,000 nodes. Our novel BiGG model advances this line of work by significantly improving both on model quality and scalability.
Appendix B More experiment details
Manual Batching
Though theoretically sec:par_train enables the parallelism, the commonly used packages like PyTorch and TensorFlow has limited support for autobatching operators. Also optimizing over a computation graph with
operators directly would be infeasible. Inspired by previous works (Looks et al., 2017; Neubig et al., 2017; Bradbury & Fu, 2018), we enable the parallelism by performing gathering and dispatching for tree or LSTM cells. This requires a single call of gemm on CUDA, instead of multiple gemv calls for the cell operations. The indices used for gathering and dispatching are computed in a customized c++ op for the sake of speed.Parameter sharing
As the maximum depth in BiGG is , it is feasible to have different parameters per different layers for tree or LSTM cells. However experimentally it didn’t affect the performance significantly. With this change, not all parameters are updated at each step, as the trees have different heights and some are updated more often than others, which could potentially make the learning harder. Also it would make the model times larger, and limits the potential of generating graphs larger than seen during training. So we still share the parameters of cells like other autoregressive models.
b.1 BiGG setup
Hyperparameters: For the proposed BiGG, we use embedding size together with position encoding for state representation. Learning rate is initialized to and decays to when training loss gets plateau.
All the experiments for both baselines and BiGG are carried out on a single V100 GPU. For the sublinear memory technique mentioned in sec:sqrtn, we choose (, the number of blocks) as small as possible such that the training would be able to fit in a single V100 GPU. This is to make the training feasible while minimizing the synchronization overhead. Empirically the block size used in sublinear memory technique is set to 6,000 for sparse graphs.
b.2 Baselines setup
We include all the baseline numbers from their original papers when applicable, as the most experimental protocols are the same. We run the following baselines when the corresponding numbers are not reported in the literature:

Erdős–Rényi: for the results in tab:large_grid, we estimate the graph edge density using the training grid graphs, and generate new Erdős–Rényi random graphs with this density and compare with the heldout test graphs;

GraphRNNS (You et al., 2018): the result on Lobster random graphs in tab:benchmark is obtained by training the GraphRNN using official code ^{1}^{1}1https://github.com/JiaxuanYou/graphgeneration
for 3000 epochs;

GRAN (Liao et al., 2019): the results of GRAN in tab:large_grid are obtained by training GRAN with grid graph configuration for 8,000 epochs or 3 days, whichever limit the model reaches first.

G2SAT (You et al., 2019): For results in tab:g2sat_24 and tab:g2sat_24_app, we run the code ^{2}^{2}2https://github.com/JiaxuanYou/G2SAT for 1000 epochs. For the test graph comparison in tab:g2sat_24_app, we use the 8 test graphs as ‘templategraphs’ for G2SAT to generate graphs.
Method  VIG  VIG  LCG  

Clustering  Modularity  Variable  Clause  Modularity  Modularity  
Training  0.50 0.07  0.58 0.09  3.57 1.08  4.53 1.09  0.74 0.06  0.67 0.05 
CA  0.33 0.08 (34%)  0.48 0.10 (17%)  6.30 1.53 (76%)  N/A  0.65 0.08 (12%)  0.53 0.05 (16%) 
PS(T=0)  0.82 0.04 (64%)  0.72 0.13 (24%)  3.25 0.89 (9%)  4.70 1.59 (4%)  0.86 0.05 (16%)  0.64 0.05 (2%) 
PS(T=1.5)  0.30 0.10 (40%)  0.14 0.03 (76%)  4.19 1.10 (17%)  6.86 1.65 (51%)  0.40 0.05 (46%)  0.41 0.05 (35%) 
G2SAT  0.41 0.09 (18%)  0.54 0.11 (7%)  3.57 1.08 (0%)  4.79 2.80 (6%)  0.68 0.07 (8%)  0.67 0.03 (6%) 
BiGG  0.50 0.07 (0%)  0.58 0.09 (0%)  3.57 1.08 (0%)  4.53 1.09 (0%)  0.73 0.06 (1%)  0.61 0.09 (9%) 
BiGG0.1  0.50 0.07 (0%)  0.58 0.09 (0%)  3.57 1.08 (0%)  4.53 1.09 (0%)  0.74 0.06 (0%)  0.65 0.07 (7%) 
Method  VIG  VIG  LCG  

Clustering  Modularity  Variable  Clause  Modularity  Modularity  
Test8  0.50 0.05  0.68 0.19  4.66 1.92  6.33 3.23  0.84 0.02  0.75 0.05 
G2SAT  0.22 0.10 (56%)  0.74 0.11 (9%)  4.66 1.92 (0%)  6.60 4.15 (5%)  0.80 0.09 (5%)  0.68 0.04 (9%) 
BiGG  0.37 0.19 (26%)  0.28 0.12 (58%)  4.66 1.92 (0%)  2.74 0.37 (57%)  0.44 0.07 (48%)  0.48 0.09 (36%) 
b.3 More experimental results on SAT graphs
In main paper we reported the results on 24 training instances with the evaluation metric used in You et al. (2019). Here we include the results on the subset which is used in the original G2SAT paper (You et al., 2019). This subset contains 10 instances, with 82 to 1122 variables and 327 to 4555 clauses. This result in graphs with 6,799 nodes at most. tab:subset summarizes the results, together with other baseline results that are copied from You et al. (2019). We can see our model can almost perfectly recover the training statistics if the generative process is biased more towards high probability region (, BiGG0.1 which has 90% chance to use greedy decoding at each step). This again demonstrate that our model is flexible enough to capture complicated distributions of graph structures. On the other hand, as tab:g2sat_24_app shows, the G2SAT beats BiGG on 4 out of 6 metrics. Since G2SAT is a specialized generative model for bipartite graphs, it enjoys more of inductive bias towards this specific problem, and thus would work better in the lowdata and extrapolation scenario. Our general purposed model thus would be expected to require more training data due to its high capacity in model space.