1 Introduction
Graphs are ubiquitous and fundamental data structures in the real world, appearing in various fields such as social science, chemistry, and biology. For example, knowledge bases, molecular structures, and relations between objects in an image can be represented as graphs, which allows us to efficiently capture the essence of these data. Moreover, in applications such as drug discovery or network simulation, graph generative models that can approximate distributions over graphs on a specific domain and derive new samples from it are very important. Compared with generation tasks for images or natural language, however, graph generation tasks are significantly difficult due to the necessity of modeling complex local/global dependencies between nodes and edges, as well as the intractable properties of graphs themselves, such as discreteness, variable number of nodes and edges, and uncertainty of node ordering.
Although graph generation tasks encompass several challenges as described above, scalability is one of the most important for applications in a wide range of realworld domains. In this work, we define scalability from three perspectives: yielding graph scalability, data scalability, and label scalability. Graph scalability denotes scalability to large graphs with many nodes, data scalability denotes scalability to large datasets with many data, and label scalability denotes scalability to graphs with many node/edge labels within the limit of practical time/space complexity, especially on training. In addition to these perspectives, it is possible to consider other viewpoints, such as the number of edges, diameter of graphs, or total number of nodes in datasets. However, because these perspectives are closely related to the above three points, we adopt these three as a mutually exclusive and collectively exhaustive division.
In recent years, an increasing number of graph generative models based on machine learning have been proposed, which have demonstrated great performances in several tasks, such as link prediction, molecular property optimization, and network structure optimization Li et al. (2018); You et al. (2018b); Grover et al. (2018); Luo et al. (2018); You et al. (2018a); Liu et al. (2018); Simonovsky and Komodakis (2018); Wang et al. (2017); Li et al. (2018). However, to the best of our knowledge, no model is scalable in all three contexts. For example, DeepGMG Li et al. (2018) can generate only small graphs with fewer than 40 nodes, GraphRNN You et al. (2018b) cannot generate node/edgelabeled graphs, and these models have weak compatibility with training parallelization.
In this work, we propose Graph Generative Model with Graph Attention Mechanism
(GRAM) for generating realworld graphs that is scalable in all three contexts, especially during training. Given a set of graphs, our model approximates their distribution in an unsupervised manner. In order to achieve graph scalability, we employ an autoregressive, sequential generation process that is flexible to variable nodes and edges, formulate the likelihood of graphs in a simple manner to simplify the generation process and exploit the properties of realworld graphs such as community structure and sparseness of edges to reduce the computational cost on training. With regard to data scalability, we apply a novel graph attention mechanism that is a simple expansion of the attention mechanism used in the field of natural language processing
Vaswani et al.to graphs. This graph attention mechanism does not include sequentially dependent hidden states as Recurrent Neural Network (RNN) does, which improves the parallelizability of training significantly. Compared with other graph attention mechanisms such as
Veličković et al. (2018); AbuElHaija et al. (2018); Ishiguro et al. (2019), ours is architecturally simpler, computationally lightweight and general; it is also not limited in its applicability to generation tasks. Finally, for label scalability, we formulate the likelihood of graphs assuming multiple node/edge labels and use graph convolution and graph attention layers, the number of whose parameters do not depend directly on the number of labels.Moreover, we introduce a nondomainspecific evaluation metric for generation tasks of node/edgelabeled graphs. Because such a method does not currently exist, prior works relied on domainspecific metrics or visual inspection, which made unified and objective evaluation difficult. Thus, we construct a general evaluation metric that dispenses with domainspecific knowledge and considers node/edge labels, combining a graph kernel and Maximum Mean Discrepancy (MMD) Gretton et al. (2012). Although a similar statistical test based on MMD via a graph kernel is used for schema matching of protein graphs in Kriegel et al. (2006), this work, to the best of our knowledge, is the first to use a graph kernel and MMD as an evaluation metric for graph generation tasks.
Our experiments on datasets of protein and molecular graphs associated with each scalability demonstrated that our models can scale up to handle large graphs and datasets that previous methods faced difficulty handling, and demonstrated results that were competitive with or superior than those of baseline methods.
In conclusion, the contributions of this work are as follows:

We propose GRAM, a generative model for realworld graphs that is scalable especially on training.

We propose a novel graph attention mechanism that is general and architecturally simple.

We define scalability in graph generation tasks from three perspectives: number of nodes, data, labels.

We construct a nondomainspecific, general evaluation metric for node/edgelabeled graph generation tasks combining a graph kernel and MMD.
2 Related Work
Although there are several traditional graph generative models Erdös and Rényi (1959); Albert and Barabási (2002); Leskovec et al. (2010); Robins et al. (2007); Airoldi et al. (2009), here, we focus on latest machine learningbased approaches, which outperform traditional methods in various graph generation tasks.
In terms of their generation process, existing graph generative models are classified into at least two types: tensor generation models and sequential generation models. Tensor generation models such as
Simonovsky and Komodakis (2018); De Cao and Kipf (2018); Grover et al. (2018) generate a graph by outputting tensors that correspond to the graph. Although architecturally simple and easy to optimize for small graphs, these models face difficulty in generating large graphs owing to the nonunique correspondence between a graph and tensors or because of limitations in the predefined maximum number of nodes. In contrast, sequential generation models such as Li et al. (2018); You et al. (2018b) generate graphs by adding nodes and edges one by one, which alleviates the above problems and generates larger graphs. To achieve graph scalability, we employ the latter. However, to generate a graph with nodes and edges, DeepGMG Li et al. (2018) requires at least operations because of its complex generation process. Although GraphRNN You et al. (2018b) reduces its time complexity to with a constant value , utilizing the property of breadthfirst search and RNN, it cannot generate node/edgelabeled graphs mainly because it does not calculate the features of each node, relying mainly on the information in the hidden states of RNN. Moreover, these models include sequentially dependent hidden states, which makes training parallelization difficult. In contrast, our model employs a graph attention mechanism without sequentially dependent hidden states, which improves the parallelizability of training significantly. In addition, by utilizing the properties of graphs such as community structure and sparseness of edges, we reduce the complexity to almost a linear order ofwhile conducting a rich feature extraction.
As another classification, there are unsupervised learning approaches and reinforcement learning approaches. Given samples of graphs, unsupervised learning models
Li et al. (2018); You et al. (2018b); Simonovsky and Komodakis (2018); Grover et al. (2018) approximate the distribution of these graphs in an unsupervised manner. Reinforcement learning models Luo et al. (2018); You et al. (2018a); De Cao and Kipf (2018); Liu et al. (2018); Li et al. (2018) learn to generate graphs that maximize a given objective function called return. Although reinforcement learning approaches have demonstrated promising results on several tasks such as molecular generation and network architecture optimization, we employ an unsupervised approach, which is considered to be more advantageous to the case where new samples similar to the training samples of arbitrary graphs are required or where the reward functions cannot be designed (e.g., a family of pharmaceutical molecules against a certain disease is known, but the mechanism by which they work is not entirely known).3 Proposed Method
In this section, we first describe notations of graphs and define the graph generation task we tackle. Then, we describe the graph attention mechanism we propose, our scalable graph generative model GRAM, and its variant. GRAM approximates a distribution over graphs in an autoregressive manner and utilizes the novel graph attention mechanism to improve the parallelizability of training significantly. One step in the generation process of GRAM is illustrated in Figure 2.
3.1 Notations and Problem Definition
In this work, we define a graph as a data structure that consists of its node set and edge set . In the case of a directed graph, we define the direction of as from to . We assume nodes and edges are associated with multiple node/edge labels and denote the number of them as and , respectively. For graph representation, we employ tensor representation. Specifically, given node ordering , we represent a pair comprising a graph and node ordering as a pair of tensors . Note that is a permutation function over . stores information about nodes, whose
th row is a onehot vector corresponding to the label of
. stores information about edges, whose th element is a onehot vector corresponding to the label of . If no edge exists between and , we replace it with zero vector. Note that and correspond uniquely. Finally, for simplicity, we do not consider selflooping or multiple edges, and we assume that independent elements of are limited to its upper triangular elements. A possible extension to selflooping edges is to add a step to estimate them, and for multiple edges, we can prepare additional labels that correspond to them.
The graph generation task we target is, given samples from a distribution of graphs , to approximate in a way that we can derive new samples from it.
3.2 Graph Attention Mechanism
The aim of employing a graph attention mechanism is to efficiently take in the information of distant nodes by attention. Our graph attention mechanism is an expansion of the attention mechanism used in the field of natural language processing Vaswani et al. applied to the field of graphs. One of the significant differences between graphs and sentences is that we cannot define absolute coordinates in a graph, which makes it difficult to embed positional information into nodes as in Vaswani et al. . To alleviate this problem, we focus on multihead attention in Vaswani et al. , and we introduce bias terms in the projection to subspaces, which are functions of the shortest path length between two nodes. Following Vaswani et al. , we denote matrices into which query, key, and value vectors are stacked as , , and , respectively. In addition, we assume the th row corresponds to the feature vector of node . With these notations, the operation in graph attention mechanism is defined as
(1)  
where represents the number of projections to subspaces. Additionally, the operation of is defined as
(2)  
Each attention weight is calculated as
(3)  
The parameters to be updated are four weight matrices, , , and , and three bias terms, , and , where , and represent dimensions of input, output, key, and value vector, respectively. We used different parameters for calculating each head.
To consider the geometric relation between two nodes, we used functions of the shortest path length between and as , and . Other possible approaches, for example, are using network flow instead of shortest path length or utilizing functional approximation. Furthermore, because path length is discrete, we used different weight parameters for each path length.
Specifically, setting yields the original multihead attention. As in Vaswani et al. , we added a twolayer feedforward neural network (FNN), which was applied after the above operations. We denote these operations including FNN as the graph attention mechanism in the following sections.
3.3 GRAM: Scalable Generative Models for Graphs
We first describe the likelihood formulation of graphs and then provide an overview of GRAM and its variant.
3.3.1 Likelifood Formulation of Graphs
To approximate distributions over graphs in an autoregressive manner, here, we formulate the likelihood of a pair of a graph and node ordering
and decompose it into a product of conditional probabilities. Because
uniquely defines and vice versa, we have , which is decomposed into a product of conditional probabilities as(4) 
where represents a partial tensor and "" represents all indices and other notations follow as these. We omit the last dimension of each tensor for simplicity. In addition, uniquely defines a subgraph of , which we denote as . To keep notations clear, we represent as in the following equations. With this formulation, the likelihood of a graph is defined as marginal, .
On training, given a set of samples from , our GRAM approximates the joint probability . Although the choice of provides room for discussion, we adopt the breadthfirst search used in You et al. (2018b). More precisely, our model approximates by approximating the conditional probabilities and . As for , we used a sampling distribution from training data.
On testing, we sequentially sample and from the approximated distribution and we get when EOS is output. This can be viewed as sampling from the approximated distribution . Especially by focusing only on , we can view it as sampling from the marginal .
3.3.2 Model Overview
Here, we provide an overview of GRAM. For simplicity, we assume to be an identity permutation (i.e., ) in the following sections.
The architecture of GRAM consists of three networks: a feature extractor, a node estimator, and an edge estimator. The feature extractor calculates node feature vectors and a graph feature vector by summing up them, the node estimator predicts the label of the newly added node , and the edge estimator predicts the labels of edges between previously generated nodes and the new node . In other words, the node estimator approximates and the edge estimator approximates in Equation 4.
3.3.3 Feature Extractor
Given a pair of tensors , the feature extractor calculates the node feature vectors and a graph feature vector of the corresponding subgraph . It consists of feature extraction blocks and a subsequent graph pooling layer. We used in the experiments.
A feature extraction block is composed of a graph convolution layer and a graph attention layer stacked in parallel, as illustrated in Figure 2. To keep compatibility with the communityoriented variant described in Section 3.4, we used divided feature vectors to keep the output of graph attention layers from flowing into the input of the graph convolution layers. We aim to extract local information by using graph convolution layers and global information by using graph attention layers. Although there are various types of graph convolutions, we employed the one used in Johnson et al. (2018). ^{1}^{1}1In Johnson et al. (2018), directed graphs are assumed as input. In the case of undirected graphs, we consider only and ignore , illustrated in Figure 3 in Johnson et al. (2018). Roughly, the selected graph convolution type convolutes the features of neighboring nodes and edges into each node and edge in a graph. A graph attention layer operates selfattention, in which query/key/value vectors are all node feature vectors. To reduce the computational cost, we restricted the range of attention to neighboring nodes. Also, to exploit lowlevel features of graphs, we stacked degree, clustering coefficient, and distance from the center of the graph into each node vector, which are all nondomainspecific statistics in graphs.
The graph pooling layer computes a graph feature vector by summing up all node feature vectors in the graph. To improve its expressive power in aggregation, we used a gating network as in Li et al. (2018). Specifically, the operation in the graph pooling layer is defined as
(5) 
where is a twolayer FNN and
is a sigmoid function.
3.3.4 Node Estimator
The node estimator predicts the label of the new node . Concretely, it computes from as
(6) 
where is a threelayer FNN, the dimension of whose last layer is
including EOS and the activation function is a softmax function. We terminate the generation process when EOS is output.
3.3.5 Edge Estimator
The edge estimator predicts the labels of the edges between previously generated nodes and the new node . More precisely, it computes from and previously predicted edge labels as
(7) 
where is the embedded vector of the predicted label of . is calculated by a graph attention in which the query vector is and the value/key vectors are , where denotes the embedded vector of the predicted label of . Thereby, we aim to express the dependency to in . We use a threelayer FNN as , in which the dimension of the last layer is including EOS with a softmax function as the activation function. When EOS is output, we do not add an edge.
However, to generate a graph with nodes, the edge estimation process requires operations for one edge, resulting in operations for one step and operations in total, which is a significant obstacle to achieving graph scalability. Therefore, we resort to a remedy obtained from empirical observation. Specifically, our inspection on each attention weight in the edge estimation showed that most of the weights for nodes that were predicted to have no edge between (i.e., ) were nearly zero, while the weights for nodes that were predicted to have edges (i.e., ) took larger values. This observation suggests that among , those that have edges between are important and the others are not in the prediction of . Hence, we deterministically set these weights to zero, which means we do not consider them in graph attention. With this approximation (i.e., ), we reduce the number of required operations to , where . In addition, because holds, we can assume when , which is often the case with many realworld graphs.
3.4 cGRAM: CommunityOriented Variant
In addition to GRAM described above, here we present a communityoriented variant, cGRAM, in which we aimed to reduce the computational cost significantly by posing restrictions about community structures.
In realworld graphs, the distribution of edges is biased, not uniform, which allows us to divide the node set into subsets called community, in which nodes are densely connected with each other Fortunato and Castellano (2012). We utilize this property to reduce computational complexity. Specifically, denoting communities in a subgraph as , in this model, we assume that the nodes that have edges between the new node are restricted to within the same community and that the label of depends only on . This assumption is mainly based on the property of community structure, where intracommunity connections are more dense than intercommunity connections and that nodes in the same community have stronger relations than those in other communities Fortunato and Castellano (2012). Concretely, we modify the graph pooling of Equation 5 to a community pooling defined as
(8) 
and we replace with in the subsequent operations. Additionally, we restrict the range of graph attention to the same community. With these assumptions and restrictions, on training we only have to consider nodes in and their neighbors in the feature extraction, the node estimation, and the edge estimation, which improves computational performance significantly. On testing, we assume the probability that belongs to each community is equal (i.e., ) and select the target community randomly.
4 Experiments
To evaluate the performance and scalability of our models, we compared GRAM and cGRAM with prior graph generative models in the generation tasks for two types of realworld graphs, protein graphs and molecular graphs, which correspond to graph scalability and data scalability, respectively. As for label scalability, in this work, we only considered whether the model could handle multiple node/edge labels or not because we thought the performance would be strongly affected by its architecture design and thus fair comparison is difficult.
4.1 Protein Graphs: Graph Scalability
To evaluate graph scalability, we performed an experiment using protein graphs that contain a relatively large number of nodes. Basically, the experiment settings followed You et al. (2018b).
We used a dataset of 918 protein graphs Dobson and Doig (2003), in which each node corresponds to an amino acid and two nodes that are chemically connected or spatially close are connected by an edge. In detail, the number of nodes is and the number of node/edge labels is . We allocated 80% of the graphs for training and the rest for testing.
We compared our models with GraphRNN, GraphRNNS You et al. (2018b)
as recent deep learningbased baselines. In addition, we reported the results of ErdösRényi model (ER)
Erdös and Rényi (1959), BarabásiAlbert (BA) Albert and Barabási (2002), Kronecker Graphs Leskovec et al. (2010), and Mixed Membership Stochastic Blockmodels (MMSB) Airoldi et al. (2009) as traditional baselines.To measure the quality of the generated graphs, we used MMD Gretton et al. (2012) for three types of graph features: degree, clustering coefficient, and orbit counts, proposed in You et al. (2018b). These metrics allowed us to measure the distance between the distribution of the generated graphs and that of real graphs quantitively. We reported the average of the squared MMD score with three runs.
Table 2 summarizes the results.^{2}^{2}2We used the code provided by http://github.com/snapstandard/GraphRNN. To correctly calculate the MMD score and for equal comparison, we omit the cleaning postprocess in the code and reevaluated all models. Our models performed better than most of the baselines in terms of clustering coefficient and orbit counts, and cGRAM was the best. Considering that these are middle/highlevel features of graphs, this is likely due to employing a graph attention in the edge estimation, which can consider the relative position between two nodes. However, our models performed poorly in terms of degree compared with GraphRNN, which is likely because the way that nodes are connected involves their spatial proximity; because this is difficult to estimate and noisy, it is possible that the rich feature extraction backfired and thus it failed to capture lowlevel features. On average, cGRAM performed the best. We presume the reason is that focusing on one community made training stable and led to better performance because the size of a community is smaller and relatively more constant than that of a graph.
Deg.  Clus.  Orbit  
ER Erdös and Rényi (1959)  0.154  1.788  1.098 
BA Albert and Barabási (2002)  1.452  1.713  0.914 
Kronecker Leskovec et al. (2010)  1.048  1.799  0.668 
MMSB Airoldi et al. (2009)  0.623  1.793  1.261 
GraphRNNS You et al. (2018b)  0.274  0.293  0.204 
GraphRNN You et al. (2018b)  0.087  1.055  0.774 
GRAM (ours)  0.257  0.431  0.101 
cGRAM (ours)  0.171  0.119  0.077 
Valid  Novel  GKMMD  

DeepGMG Li et al. (2018)  89.2  89.1  0.0200 
GRAM (ours)  94.0  94.0  0.0192 
cGRAM (ours)  96.1  94.4  0.0214 
4.2 Molecular Graphs: Data Scalability
To evaluate data scalability, we conducted an experiment using a dataset that consists of 250k molecular graphs, following Li et al. (2018).
We used the 250k samples from ZINC dataset Sterling and Irwin (2015) provided by Kusner et al. (2017), where nodes and edges correspond to heavy atoms and chemical bonds, respectively. The number of nodes is and the number of node/edge labels is .
We used DeepGMG Li et al. (2018) as a baseline.^{3}^{3}3To evaluate DeepGMG in GKMMD, we used generated molecular graphs provided by the author of Li et al. (2018). With an emphasis on the ability to approximate distributions of arbitrary realworld graphs, we excluded other graph generative models that use domainspecific knowledge or are based on reinforcement learning.
Following Li et al. (2018), we generated 100k graphs and calculated the percentages of graphs that were valid as molecules and the percentage of novel samples that did not appear in the training set. In addition, combining a graph kernel and MMD, we constructed a general evaluation metric to measure the distance between the distribution of the generated graphs and that of real graphs, which are both node/edge labeled.
MMD is a test statistic to determine whether two sets of samples from distribution
and are derived from the same distribution (i.e., ), and especially when its function class is a unit ball in a reproducing kernel Hilbert space (RKHS) , we can derive the squared MMD as(9) 
where is the associated kernel function Gretton et al. (2012).
Graph kernels are kernel functions over graphs, and we used Neighborhood Subgraph Pairwise Distance Kernel (NSPDK) Costa and Grave (2010), which measures the similarity of two graphs by matching pairs of subgraphs with different radii and distances. Because NSPDK is a positivedefinite kernel Costa and Grave (2010), it follows that it defines a unique RKHS Aronszajn (1950). This fact allows us to calculate the squared MMD using Equation 9. Moreover, because NSPDK considers node/edge labels, we used this graph kernel MMD (GKMMD) as an evaluation metric for node/edgelabeled graph generation tasks. To reduce the computational cost, we sampled 100 graphs from the generated and the real graphs and then reported the average squared GKMMD with 10 runs.
Table 2 lists the results. GRAM and cGRAM achieved higher valid and novel percentages, and GRAM performed the best in terms of GKMMD, which is likely due to the rich feature extraction and the consideration of the distance between nodes when estimating edges. We suppose the poor performance of cGRAM on GKMMD is due to the restrictions on community structure because molecular graphs are relatively small.
5 Conclusion
In this work, we tackled the problem of scalability as it is one of the most important challenges in graph generation tasks. We first defined scalability from three perspectives, and then proposed a scalable graph generative model, GRAM, and its variant. In addition, we proposed a novel graph attention mechanism as a key portion of the model and constructed graph kernel MMD as a general evaluation metric for node/edgelabeled graph generation tasks. In our experiment, which used protein graphs and molecular graphs, we verified the scalability and competitive or superior performances of our models.
Acknowledgments
This work was supported by JST CREST Grant Number JPMJCR1403, Japan. We would like to thank Atsuhiro Noguchi for his helpful advice on graphs. We would also like to show our gratitude to Yujia Li for providing data of generated molecular graphs and giving permission to use it for comparison in the experiment. Finally, we would like to appreciate every member in the lab for beneficial discussions, which inspired our work greatly.
References
 AbuElHaija et al. (2018) AbuElHaija, S., Perozzi, B., AlRfou, R., and Alemi, A. A. (2018). Watch your step: Learning node embeddings via graph attention. In Advances in Neural Information Processing Systems 31.
 Airoldi et al. (2009) Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2009). Mixed membership stochastic blockmodels. In Advances in Neural Information Processing Systems 21.
 Albert and Barabási (2002) Albert, R. and Barabási, A.L. (2002). Statistical mechanics of complex networks. Rev. Mod. Phys., 74:47–97.
 Aronszajn (1950) Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404.
 Blondel et al. (2008) Blondel, V. D., Guillaume, J.L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008.
 Costa and Grave (2010) Costa, F. and Grave, K. D. (2010). Fast neighborhood subgraph pairwise distance kernel. In Proceedings of the 27th International Conference on International Conference on Machine Learning.
 De Cao and Kipf (2018) De Cao, N. and Kipf, T. (2018). MolGAN: An implicit generative model for small molecular graphs. arXiv eprints, page arXiv:1805.11973.
 Dobson and Doig (2003) Dobson, P. D. and Doig, A. J. (2003). Distinguishing enzyme structures from nonenzymes without alignments. Journal of Molecular Biology, 330(4):771 – 783.
 Erdös and Rényi (1959) Erdös, P. and Rényi, A. (1959). On random graphs, i. Publicationes Mathematicae (Debrecen), 6:290–297.
 Fortunato (2010) Fortunato, S. (2010). Community detection in graphs. Physics reports, 486(35):75–174.
 Fortunato and Castellano (2012) Fortunato, S. and Castellano, C. (2012). Community structure in graphs. Computational Complexity: Theory, Techniques, and Applications, pages 490–512.
 Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel twosample test. J. Mach. Learn. Res., 13:723–773.
 Grover et al. (2018) Grover, A., Zweig, A., and Ermon, S. (2018). Graphite: Iterative Generative Modeling of Graphs. arXiv eprints, page arXiv:1803.10459.
 Ishiguro et al. (2019) Ishiguro, K., Maeda, S.i., and Koyama, M. (2019). Graph Warp Module: an Auxiliary Module for Boosting the Power of Graph Neural Networks. arXiv eprints, page arXiv:1902.01020.

Johnson et al. (2018)
Johnson, J., Gupta, A., and FeiFei, L. (2018).
Image generation from scene graphs.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
.  Kriegel et al. (2006) Kriegel, H.P., Borgwardt, K. M., Gretton, A., Schölkopf, B., Rasch, M. J., and Smola, A. J. (2006). Integrating structured biological data by Kernel Maximum Mean Discrepancy. Bioinformatics, 22(14):e49–e57.
 Kusner et al. (2017) Kusner, M. J., Paige, B., and HernándezLobato, J. M. (2017). Grammar Variational Autoencoder. arXiv eprints, page arXiv:1703.01925.
 Lancichinetti and Fortunato (2009) Lancichinetti, A. and Fortunato, S. (2009). Community detection algorithms: a comparative analysis. Physical review E, 80(5):056117.
 Leskovec et al. (2010) Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., and Ghahramani, Z. (2010). Kronecker graphs: An approach to modeling networks. J. Mach. Learn. Res., 11:985–1042.
 Li et al. (2018) Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, P. (2018). Learning Deep Generative Models of Graphs. arXiv eprints, page arXiv:1803.03324.
 Li et al. (2018) Li, Y., Zhang, L., and Liu, Z. (2018). Multiobjective de novo drug design with conditional graph generative model. Journal of Cheminformatics, 10(1):33.

Liu et al. (2018)
Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A. (2018).
Constrained graph variational autoencoders for molecule design.
In Advances in Neural Information Processing Systems 31.  Luo et al. (2018) Luo, R., Tian, F., Qin, T., Chen, E., and Liu, T.Y. (2018). Neural architecture optimization. In Advances in Neural Information Processing Systems 31.
 Robins et al. (2007) Robins, G., Pattison, P., Kalish, Y., and Lusher, D. (2007). An introduction to exponential random graph (p*) models for social networks. Social Networks, 29(2):173 – 191.
 Simonovsky and Komodakis (2018) Simonovsky, M. and Komodakis, N. (2018). Graphvae: Towards generation of small graphs using variational autoencoders. In ICANN.
 Sterling and Irwin (2015) Sterling, T. and Irwin, J. J. (2015). Zinc 15 – ligand discovery for everyone. Journal of Chemical Information and Modeling, 55(11):2324–2337.
 (27) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30.
 Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. (2018). Graph Attention Networks. International Conference on Learning Representations.
 Wang et al. (2017) Wang, H., Wang, J., Wang, J., Zhao, M., Zhang, W., Zhang, F., Xie, X., and Guo, M. (2017). GraphGAN: Graph Representation Learning with Generative Adversarial Nets. arXiv eprints, page arXiv:1711.08267.
 You et al. (2018a) You, J., Liu, B., Ying, Z., Pande, V., and Leskovec, J. (2018a). Graph convolutional policy network for goaldirected molecular graph generation. In Advances in Neural Information Processing Systems 31.
 You et al. (2018b) You, J., Ying, R., Ren, X., Hamilton, W. L., and Leskovec, J. (2018b). Graphrnn: Generating realistic graphs with deep autoregressive models. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018.
Appendix A Appendix
a.1 Complexity Analysis
In this section, we compare the computational complexity on training of our models with that of existing models. Following Vaswani et al. , we consider two demands: the total computational complexity and the minimum sequential operations with parallelization. Note that we assume that all matrixvector products can be conducted with time complexity. Table 3 summarizes the evaluation results.
To generate a graph with nodes and edges, DeepGMG Li et al. (2018) requires at least operations. Because the initialization of the state of the new node depends on the state in the previous step, the amount of its minimum sequential operations is . On the other hand, GraphRNN You et al. (2018b) requires operations with a constant value , utilizing the properties of breadthfirst search and relying mainly on the hidden states of RNN. However, the number of its minimum sequential operations is because of the sequential dependency of the hidden state of RNN; moreover, it cannot generate node/edgelabeled graphs.
Next, we evaluate GRAM. We denote the average number of neighboring nodes as and reuse , defined in Section 3.3.5. Focusing on one step of the generation process, The feature extractor requires operations, considering a graph convolution layer requires , and a graph attention layers requires . In addition, the node estimator requires operations, and the edge estimator requires operations. Therefore, the total complexity of GRAM is . Note that all the estimation operations of the node estimator and the edge estimator can be parallelized on training because our model has no sequentially dependent hidden states as does RNN. As for cGRAM, we can similarly derive its total complexity as , where and denote the average number of nodes and edges, in one community, respectively.
From the above analysis, we can say that our GRAM, while keeping rich feature extraction, requires fewer operations than DeepGMG when holds, which is the often case in many realworld graphs. In addition, assuming the size of each community is nearly constant and smaller than that of a graph, the total complexity of cGRAM can be regarded as almost linear of , which is competitive to GraphRNN, while ours keeps feature extraction. More importantly, all the estimation operations can be parallelized, which facilitates training with multiple computing nodes. Therefore, we can expect graph scalability and data scalability of our models. We also expect label scalability because our models are flexible to variable number of labels by modifying only the dimensions of the input and output layers.
a.2 Inspection on Attention Weights in Edge Estimation
In a brief experiment on protein graphs where we consider all of the previously generated nodes in edge estimation (i.e. its total computational complexity is ), we examined the distributions of attention weights for nodes that are predicted to have edges between the new node and for those not. Figure 3 shows the two distributions of attention weights in one step in edge estimation (i.e., in the calculation of ). The blue one represents the distribution of weights for nodes that are predicted to not have edges between the new node (i.e. ), and the orange one represents the distribution of weights for nodes that are predicted to have edges between the new node (i.e. ). From the figure, we can see that the former distribution is sharper around the zero compared with the latter, most of whose weights take nearly zero. As for molecular graphs, we observed a similar result.
a.3 Comunity Detection Algorithm
Here we describe the community detection algorithm used in this work. Although it is possible to include community detection into the learning framework, we did not consider this option for simplicity.
While various algorithms to detect communities in a graph are proposed Lancichinetti and Fortunato (2009), we use the algorithm modified from Louvain algorithm Blondel et al. (2008), which is simple and computationally lightweight. Louvain algorithm detects communities by maximizing modularity in a greedy manner. Modularity is a function to measure the quality of partition and is calculated as:
(10) 
where denotes the weight of the edge between and , represents the community that belongs to, and Blondel et al. (2008). Note that if and otherwise. Specifically, it leverages the fact that the change in modularity by adding/deleting a node to/from a community can be calculated easily.
However, we found using this algorithm simply can produce quite small communities, which makes training unstable. To alleviate this problem, we introduced coverage Fortunato (2010) as a reguralization term. Coverage is defined as a ratio of innercommunity edges to all edges. More precisely, we modify the objective function of Louvain algorithm to with a regularization parameter . Note that setting yields the original Louvain algorithm and setting returns only one community without partitioning. Even with this modification, the change can also be calculated quite simply, hence its time complexity is almost the same as the original. We set for molecular graphs and for protein graphs.
a.4 Visual Inspection on Generated Molecular Graphs
For qualitative evaluation, we listed molecules in the training set and those generated by each model in Figure 7, 7, 7, 7. As for GRAM and DeepGMG, we could not see the significant difference from training set, whereas cGRAM generated some small molecules, which was considered to be captured in the poor performance in GKMMD.
a.5 Detail of Experiment Settings
For convenience of implementation, we started the generation process from a small subgraph with nodes (i.e., replace with in Equation 4). We set for protein graphs and for molecular graphs.
In the experiment of protein graphs, we used Tesla V100
4 for training. We trained the models with batch size 2048 through 200 epoch. It took about 10 hours for GRAM and 5 hours for cGRAM. We set
.In the experiment of molecular graphs, we used Tesla V100 2 for training. We trained the models with batch size 8192 through 40 epoch. It took about 10 hours for GRAM and 8 hours for cGRAM. We set . Although 250k samples are limited amount compared with a large number of molecular graphs, this training time indicates the scalability to larger datasets.