Graphs are ubiquitous and fundamental data structures in the real world, appearing in various fields such as social science, chemistry, and biology. For example, knowledge bases, molecular structures, and relations between objects in an image can be represented as graphs, which allows us to efficiently capture the essence of these data. Moreover, in applications such as drug discovery or network simulation, graph generative models that can approximate distributions over graphs on a specific domain and derive new samples from it are very important. Compared with generation tasks for images or natural language, however, graph generation tasks are significantly difficult due to the necessity of modeling complex local/global dependencies between nodes and edges, as well as the intractable properties of graphs themselves, such as discreteness, variable number of nodes and edges, and uncertainty of node ordering.
Although graph generation tasks encompass several challenges as described above, scalability is one of the most important for applications in a wide range of real-world domains. In this work, we define scalability from three perspectives: yielding graph scalability, data scalability, and label scalability. Graph scalability denotes scalability to large graphs with many nodes, data scalability denotes scalability to large datasets with many data, and label scalability denotes scalability to graphs with many node/edge labels within the limit of practical time/space complexity, especially on training. In addition to these perspectives, it is possible to consider other viewpoints, such as the number of edges, diameter of graphs, or total number of nodes in datasets. However, because these perspectives are closely related to the above three points, we adopt these three as a mutually exclusive and collectively exhaustive division.
In recent years, an increasing number of graph generative models based on machine learning have been proposed, which have demonstrated great performances in several tasks, such as link prediction, molecular property optimization, and network structure optimization Li et al. (2018); You et al. (2018b); Grover et al. (2018); Luo et al. (2018); You et al. (2018a); Liu et al. (2018); Simonovsky and Komodakis (2018); Wang et al. (2017); Li et al. (2018). However, to the best of our knowledge, no model is scalable in all three contexts. For example, DeepGMG Li et al. (2018) can generate only small graphs with fewer than 40 nodes, GraphRNN You et al. (2018b) cannot generate node/edge-labeled graphs, and these models have weak compatibility with training parallelization.
In this work, we propose Graph Generative Model with Graph Attention Mechanism
(GRAM) for generating real-world graphs that is scalable in all three contexts, especially during training. Given a set of graphs, our model approximates their distribution in an unsupervised manner. In order to achieve graph scalability, we employ an autoregressive, sequential generation process that is flexible to variable nodes and edges, formulate the likelihood of graphs in a simple manner to simplify the generation process and exploit the properties of real-world graphs such as community structure and sparseness of edges to reduce the computational cost on training. With regard to data scalability, we apply a novel graph attention mechanism that is a simple expansion of the attention mechanism used in the field of natural language processingVaswani et al.
to graphs. This graph attention mechanism does not include sequentially dependent hidden states as Recurrent Neural Network (RNN) does, which improves the parallelizability of training significantly. Compared with other graph attention mechanisms such asVeličković et al. (2018); Abu-El-Haija et al. (2018); Ishiguro et al. (2019), ours is architecturally simpler, computationally lightweight and general; it is also not limited in its applicability to generation tasks. Finally, for label scalability, we formulate the likelihood of graphs assuming multiple node/edge labels and use graph convolution and graph attention layers, the number of whose parameters do not depend directly on the number of labels.
Moreover, we introduce a non-domain-specific evaluation metric for generation tasks of node/edge-labeled graphs. Because such a method does not currently exist, prior works relied on domain-specific metrics or visual inspection, which made unified and objective evaluation difficult. Thus, we construct a general evaluation metric that dispenses with domain-specific knowledge and considers node/edge labels, combining a graph kernel and Maximum Mean Discrepancy (MMD) Gretton et al. (2012). Although a similar statistical test based on MMD via a graph kernel is used for schema matching of protein graphs in Kriegel et al. (2006), this work, to the best of our knowledge, is the first to use a graph kernel and MMD as an evaluation metric for graph generation tasks.
Our experiments on datasets of protein and molecular graphs associated with each scalability demonstrated that our models can scale up to handle large graphs and datasets that previous methods faced difficulty handling, and demonstrated results that were competitive with or superior than those of baseline methods.
In conclusion, the contributions of this work are as follows:
We propose GRAM, a generative model for real-world graphs that is scalable especially on training.
We propose a novel graph attention mechanism that is general and architecturally simple.
We define scalability in graph generation tasks from three perspectives: number of nodes, data, labels.
We construct a non-domain-specific, general evaluation metric for node/edge-labeled graph generation tasks combining a graph kernel and MMD.
2 Related Work
Although there are several traditional graph generative models Erdös and Rényi (1959); Albert and Barabási (2002); Leskovec et al. (2010); Robins et al. (2007); Airoldi et al. (2009), here, we focus on latest machine learning-based approaches, which outperform traditional methods in various graph generation tasks.
In terms of their generation process, existing graph generative models are classified into at least two types: tensor generation models and sequential generation models. Tensor generation models such asSimonovsky and Komodakis (2018); De Cao and Kipf (2018); Grover et al. (2018) generate a graph by outputting tensors that correspond to the graph. Although architecturally simple and easy to optimize for small graphs, these models face difficulty in generating large graphs owing to the non-unique correspondence between a graph and tensors or because of limitations in the pre-defined maximum number of nodes. In contrast, sequential generation models such as Li et al. (2018); You et al. (2018b) generate graphs by adding nodes and edges one by one, which alleviates the above problems and generates larger graphs. To achieve graph scalability, we employ the latter. However, to generate a graph with nodes and edges, DeepGMG Li et al. (2018) requires at least operations because of its complex generation process. Although GraphRNN You et al. (2018b) reduces its time complexity to with a constant value , utilizing the property of breadth-first search and RNN, it cannot generate node/edge-labeled graphs mainly because it does not calculate the features of each node, relying mainly on the information in the hidden states of RNN. Moreover, these models include sequentially dependent hidden states, which makes training parallelization difficult. In contrast, our model employs a graph attention mechanism without sequentially dependent hidden states, which improves the parallelizability of training significantly. In addition, by utilizing the properties of graphs such as community structure and sparseness of edges, we reduce the complexity to almost a linear order of
while conducting a rich feature extraction.
3 Proposed Method
In this section, we first describe notations of graphs and define the graph generation task we tackle. Then, we describe the graph attention mechanism we propose, our scalable graph generative model GRAM, and its variant. GRAM approximates a distribution over graphs in an autoregressive manner and utilizes the novel graph attention mechanism to improve the parallelizability of training significantly. One step in the generation process of GRAM is illustrated in Figure 2.
3.1 Notations and Problem Definition
In this work, we define a graph as a data structure that consists of its node set and edge set . In the case of a directed graph, we define the direction of as from to . We assume nodes and edges are associated with multiple node/edge labels and denote the number of them as and , respectively. For graph representation, we employ tensor representation. Specifically, given node ordering , we represent a pair comprising a graph and node ordering as a pair of tensors . Note that is a permutation function over . stores information about nodes, whose
-th row is a one-hot vector corresponding to the label of. stores information about edges, whose -th element is a one-hot vector corresponding to the label of . If no edge exists between and , we replace it with zero vector. Note that and correspond uniquely. Finally, for simplicity, we do not consider self-looping or multiple edges, and we assume that independent elements of are limited to its upper triangular elements
. A possible extension to self-looping edges is to add a step to estimate them, and for multiple edges, we can prepare additional labels that correspond to them.
The graph generation task we target is, given samples from a distribution of graphs , to approximate in a way that we can derive new samples from it.
3.2 Graph Attention Mechanism
The aim of employing a graph attention mechanism is to efficiently take in the information of distant nodes by attention. Our graph attention mechanism is an expansion of the attention mechanism used in the field of natural language processing Vaswani et al. applied to the field of graphs. One of the significant differences between graphs and sentences is that we cannot define absolute coordinates in a graph, which makes it difficult to embed positional information into nodes as in Vaswani et al. . To alleviate this problem, we focus on multi-head attention in Vaswani et al. , and we introduce bias terms in the projection to subspaces, which are functions of the shortest path length between two nodes. Following Vaswani et al. , we denote matrices into which query, key, and value vectors are stacked as , , and , respectively. In addition, we assume the -th row corresponds to the feature vector of node . With these notations, the operation in graph attention mechanism is defined as
where represents the number of projections to subspaces. Additionally, the operation of is defined as
Each attention weight is calculated as
The parameters to be updated are four weight matrices, , , and , and three bias terms, , and , where , and represent dimensions of input, output, key, and value vector, respectively. We used different parameters for calculating each head.
To consider the geometric relation between two nodes, we used functions of the shortest path length between and as , and . Other possible approaches, for example, are using network flow instead of shortest path length or utilizing functional approximation. Furthermore, because path length is discrete, we used different weight parameters for each path length.
Specifically, setting yields the original multi-head attention. As in Vaswani et al. , we added a two-layer feedforward neural network (FNN), which was applied after the above operations. We denote these operations including FNN as the graph attention mechanism in the following sections.
3.3 GRAM: Scalable Generative Models for Graphs
We first describe the likelihood formulation of graphs and then provide an overview of GRAM and its variant.
3.3.1 Likelifood Formulation of Graphs
To approximate distributions over graphs in an autoregressive manner, here, we formulate the likelihood of a pair of a graph and node ordering
and decompose it into a product of conditional probabilities. Becauseuniquely defines and vice versa, we have , which is decomposed into a product of conditional probabilities as
where represents a partial tensor and "" represents all indices and other notations follow as these. We omit the last dimension of each tensor for simplicity. In addition, uniquely defines a subgraph of , which we denote as . To keep notations clear, we represent as in the following equations. With this formulation, the likelihood of a graph is defined as marginal, .
On training, given a set of samples from , our GRAM approximates the joint probability . Although the choice of provides room for discussion, we adopt the breadth-first search used in You et al. (2018b). More precisely, our model approximates by approximating the conditional probabilities and . As for , we used a sampling distribution from training data.
On testing, we sequentially sample and from the approximated distribution and we get when EOS is output. This can be viewed as sampling from the approximated distribution . Especially by focusing only on , we can view it as sampling from the marginal .
3.3.2 Model Overview
Here, we provide an overview of GRAM. For simplicity, we assume to be an identity permutation (i.e., ) in the following sections.
The architecture of GRAM consists of three networks: a feature extractor, a node estimator, and an edge estimator. The feature extractor calculates node feature vectors and a graph feature vector by summing up them, the node estimator predicts the label of the newly added node , and the edge estimator predicts the labels of edges between previously generated nodes and the new node . In other words, the node estimator approximates and the edge estimator approximates in Equation 4.
3.3.3 Feature Extractor
Given a pair of tensors , the feature extractor calculates the node feature vectors and a graph feature vector of the corresponding subgraph . It consists of feature extraction blocks and a subsequent graph pooling layer. We used in the experiments.
A feature extraction block is composed of a graph convolution layer and a graph attention layer stacked in parallel, as illustrated in Figure 2. To keep compatibility with the community-oriented variant described in Section 3.4, we used divided feature vectors to keep the output of graph attention layers from flowing into the input of the graph convolution layers. We aim to extract local information by using graph convolution layers and global information by using graph attention layers. Although there are various types of graph convolutions, we employed the one used in Johnson et al. (2018). 111In Johnson et al. (2018), directed graphs are assumed as input. In the case of undirected graphs, we consider only and ignore , illustrated in Figure 3 in Johnson et al. (2018). Roughly, the selected graph convolution type convolutes the features of neighboring nodes and edges into each node and edge in a graph. A graph attention layer operates self-attention, in which query/key/value vectors are all node feature vectors. To reduce the computational cost, we restricted the range of attention to -neighboring nodes. Also, to exploit low-level features of graphs, we stacked degree, clustering coefficient, and distance from the center of the graph into each node vector, which are all non-domain-specific statistics in graphs.
The graph pooling layer computes a graph feature vector by summing up all node feature vectors in the graph. To improve its expressive power in aggregation, we used a gating network as in Li et al. (2018). Specifically, the operation in the graph pooling layer is defined as
where is a two-layer FNN and
is a sigmoid function.
3.3.4 Node Estimator
The node estimator predicts the label of the new node . Concretely, it computes from as
where is a three-layer FNN, the dimension of whose last layer is
including EOS and the activation function is a softmax function. We terminate the generation process when EOS is output.
3.3.5 Edge Estimator
The edge estimator predicts the labels of the edges between previously generated nodes and the new node . More precisely, it computes from and previously predicted edge labels as
where is the embedded vector of the predicted label of . is calculated by a graph attention in which the query vector is and the value/key vectors are , where denotes the embedded vector of the predicted label of . Thereby, we aim to express the dependency to in . We use a three-layer FNN as , in which the dimension of the last layer is including EOS with a softmax function as the activation function. When EOS is output, we do not add an edge.
However, to generate a graph with nodes, the edge estimation process requires operations for one edge, resulting in operations for one step and operations in total, which is a significant obstacle to achieving graph scalability. Therefore, we resort to a remedy obtained from empirical observation. Specifically, our inspection on each attention weight in the edge estimation showed that most of the weights for nodes that were predicted to have no edge between (i.e., ) were nearly zero, while the weights for nodes that were predicted to have edges (i.e., ) took larger values. This observation suggests that among , those that have edges between are important and the others are not in the prediction of . Hence, we deterministically set these weights to zero, which means we do not consider them in graph attention. With this approximation (i.e., ), we reduce the number of required operations to , where . In addition, because holds, we can assume when , which is often the case with many real-world graphs.
3.4 cGRAM: Community-Oriented Variant
In addition to GRAM described above, here we present a community-oriented variant, cGRAM, in which we aimed to reduce the computational cost significantly by posing restrictions about community structures.
In real-world graphs, the distribution of edges is biased, not uniform, which allows us to divide the node set into subsets called community, in which nodes are densely connected with each other Fortunato and Castellano (2012). We utilize this property to reduce computational complexity. Specifically, denoting communities in a subgraph as , in this model, we assume that the nodes that have edges between the new node are restricted to within the same community and that the label of depends only on . This assumption is mainly based on the property of community structure, where intra-community connections are more dense than inter-community connections and that nodes in the same community have stronger relations than those in other communities Fortunato and Castellano (2012). Concretely, we modify the graph pooling of Equation 5 to a community pooling defined as
and we replace with in the subsequent operations. Additionally, we restrict the range of graph attention to the same community. With these assumptions and restrictions, on training we only have to consider nodes in and their -neighbors in the feature extraction, the node estimation, and the edge estimation, which improves computational performance significantly. On testing, we assume the probability that belongs to each community is equal (i.e., ) and select the target community randomly.
To evaluate the performance and scalability of our models, we compared GRAM and cGRAM with prior graph generative models in the generation tasks for two types of real-world graphs, protein graphs and molecular graphs, which correspond to graph scalability and data scalability, respectively. As for label scalability, in this work, we only considered whether the model could handle multiple node/edge labels or not because we thought the performance would be strongly affected by its architecture design and thus fair comparison is difficult.
4.1 Protein Graphs: Graph Scalability
To evaluate graph scalability, we performed an experiment using protein graphs that contain a relatively large number of nodes. Basically, the experiment settings followed You et al. (2018b).
We used a dataset of 918 protein graphs Dobson and Doig (2003), in which each node corresponds to an amino acid and two nodes that are chemically connected or spatially close are connected by an edge. In detail, the number of nodes is and the number of node/edge labels is . We allocated 80% of the graphs for training and the rest for testing.
We compared our models with GraphRNN, GraphRNN-S You et al. (2018b)
as recent deep learning-based baselines. In addition, we reported the results of Erdös-Rényi model (E-R)Erdös and Rényi (1959), Barabási-Albert (B-A) Albert and Barabási (2002), Kronecker Graphs Leskovec et al. (2010), and Mixed Membership Stochastic Blockmodels (MMSB) Airoldi et al. (2009) as traditional baselines.
To measure the quality of the generated graphs, we used MMD Gretton et al. (2012) for three types of graph features: degree, clustering coefficient, and orbit counts, proposed in You et al. (2018b). These metrics allowed us to measure the distance between the distribution of the generated graphs and that of real graphs quantitively. We reported the average of the squared MMD score with three runs.
Table 2 summarizes the results.222We used the code provided by http://github.com/snap-standard/GraphRNN. To correctly calculate the MMD score and for equal comparison, we omit the cleaning post-process in the code and re-evaluated all models. Our models performed better than most of the baselines in terms of clustering coefficient and orbit counts, and cGRAM was the best. Considering that these are middle/high-level features of graphs, this is likely due to employing a graph attention in the edge estimation, which can consider the relative position between two nodes. However, our models performed poorly in terms of degree compared with GraphRNN, which is likely because the way that nodes are connected involves their spatial proximity; because this is difficult to estimate and noisy, it is possible that the rich feature extraction backfired and thus it failed to capture low-level features. On average, cGRAM performed the best. We presume the reason is that focusing on one community made training stable and led to better performance because the size of a community is smaller and relatively more constant than that of a graph.
|E-R Erdös and Rényi (1959)||0.154||1.788||1.098|
|B-A Albert and Barabási (2002)||1.452||1.713||0.914|
|Kronecker Leskovec et al. (2010)||1.048||1.799||0.668|
|MMSB Airoldi et al. (2009)||0.623||1.793||1.261|
|GraphRNN-S You et al. (2018b)||0.274||0.293||0.204|
|GraphRNN You et al. (2018b)||0.087||1.055||0.774|
|DeepGMG Li et al. (2018)||89.2||89.1||0.0200|
4.2 Molecular Graphs: Data Scalability
To evaluate data scalability, we conducted an experiment using a dataset that consists of 250k molecular graphs, following Li et al. (2018).
We used the 250k samples from ZINC dataset Sterling and Irwin (2015) provided by Kusner et al. (2017), where nodes and edges correspond to heavy atoms and chemical bonds, respectively. The number of nodes is and the number of node/edge labels is .
We used DeepGMG Li et al. (2018) as a baseline.333To evaluate DeepGMG in GK-MMD, we used generated molecular graphs provided by the author of Li et al. (2018). With an emphasis on the ability to approximate distributions of arbitrary real-world graphs, we excluded other graph generative models that use domain-specific knowledge or are based on reinforcement learning.
Following Li et al. (2018), we generated 100k graphs and calculated the percentages of graphs that were valid as molecules and the percentage of novel samples that did not appear in the training set. In addition, combining a graph kernel and MMD, we constructed a general evaluation metric to measure the distance between the distribution of the generated graphs and that of real graphs, which are both node/edge labeled.
MMD is a test statistic to determine whether two sets of samples from distributionand are derived from the same distribution (i.e., ), and especially when its function class is a unit ball in a reproducing kernel Hilbert space (RKHS) , we can derive the squared MMD as
where is the associated kernel function Gretton et al. (2012).
Graph kernels are kernel functions over graphs, and we used Neighborhood Subgraph Pairwise Distance Kernel (NSPDK) Costa and Grave (2010), which measures the similarity of two graphs by matching pairs of subgraphs with different radii and distances. Because NSPDK is a positive-definite kernel Costa and Grave (2010), it follows that it defines a unique RKHS Aronszajn (1950). This fact allows us to calculate the squared MMD using Equation 9. Moreover, because NSPDK considers node/edge labels, we used this graph kernel MMD (GK-MMD) as an evaluation metric for node/edge-labeled graph generation tasks. To reduce the computational cost, we sampled 100 graphs from the generated and the real graphs and then reported the average squared GK-MMD with 10 runs.
Table 2 lists the results. GRAM and cGRAM achieved higher valid and novel percentages, and GRAM performed the best in terms of GK-MMD, which is likely due to the rich feature extraction and the consideration of the distance between nodes when estimating edges. We suppose the poor performance of cGRAM on GK-MMD is due to the restrictions on community structure because molecular graphs are relatively small.
In this work, we tackled the problem of scalability as it is one of the most important challenges in graph generation tasks. We first defined scalability from three perspectives, and then proposed a scalable graph generative model, GRAM, and its variant. In addition, we proposed a novel graph attention mechanism as a key portion of the model and constructed graph kernel MMD as a general evaluation metric for node/edge-labeled graph generation tasks. In our experiment, which used protein graphs and molecular graphs, we verified the scalability and competitive or superior performances of our models.
This work was supported by JST CREST Grant Number JPMJCR1403, Japan. We would like to thank Atsuhiro Noguchi for his helpful advice on graphs. We would also like to show our gratitude to Yujia Li for providing data of generated molecular graphs and giving permission to use it for comparison in the experiment. Finally, we would like to appreciate every member in the lab for beneficial discussions, which inspired our work greatly.
- Abu-El-Haija et al. (2018) Abu-El-Haija, S., Perozzi, B., Al-Rfou, R., and Alemi, A. A. (2018). Watch your step: Learning node embeddings via graph attention. In Advances in Neural Information Processing Systems 31.
- Airoldi et al. (2009) Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2009). Mixed membership stochastic blockmodels. In Advances in Neural Information Processing Systems 21.
- Albert and Barabási (2002) Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Rev. Mod. Phys., 74:47–97.
- Aronszajn (1950) Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404.
- Blondel et al. (2008) Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008.
- Costa and Grave (2010) Costa, F. and Grave, K. D. (2010). Fast neighborhood subgraph pairwise distance kernel. In Proceedings of the 27th International Conference on International Conference on Machine Learning.
- De Cao and Kipf (2018) De Cao, N. and Kipf, T. (2018). MolGAN: An implicit generative model for small molecular graphs. arXiv e-prints, page arXiv:1805.11973.
- Dobson and Doig (2003) Dobson, P. D. and Doig, A. J. (2003). Distinguishing enzyme structures from non-enzymes without alignments. Journal of Molecular Biology, 330(4):771 – 783.
- Erdös and Rényi (1959) Erdös, P. and Rényi, A. (1959). On random graphs, i. Publicationes Mathematicae (Debrecen), 6:290–297.
- Fortunato (2010) Fortunato, S. (2010). Community detection in graphs. Physics reports, 486(3-5):75–174.
- Fortunato and Castellano (2012) Fortunato, S. and Castellano, C. (2012). Community structure in graphs. Computational Complexity: Theory, Techniques, and Applications, pages 490–512.
- Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. J. Mach. Learn. Res., 13:723–773.
- Grover et al. (2018) Grover, A., Zweig, A., and Ermon, S. (2018). Graphite: Iterative Generative Modeling of Graphs. arXiv e-prints, page arXiv:1803.10459.
- Ishiguro et al. (2019) Ishiguro, K., Maeda, S.-i., and Koyama, M. (2019). Graph Warp Module: an Auxiliary Module for Boosting the Power of Graph Neural Networks. arXiv e-prints, page arXiv:1902.01020.
- Johnson et al. (2018) Johnson, J., Gupta, A., and Fei-Fei, L. (2018). Image generation from scene graphs. In .
- Kriegel et al. (2006) Kriegel, H.-P., Borgwardt, K. M., Gretton, A., Schölkopf, B., Rasch, M. J., and Smola, A. J. (2006). Integrating structured biological data by Kernel Maximum Mean Discrepancy. Bioinformatics, 22(14):e49–e57.
- Kusner et al. (2017) Kusner, M. J., Paige, B., and Hernández-Lobato, J. M. (2017). Grammar Variational Autoencoder. arXiv e-prints, page arXiv:1703.01925.
- Lancichinetti and Fortunato (2009) Lancichinetti, A. and Fortunato, S. (2009). Community detection algorithms: a comparative analysis. Physical review E, 80(5):056117.
- Leskovec et al. (2010) Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., and Ghahramani, Z. (2010). Kronecker graphs: An approach to modeling networks. J. Mach. Learn. Res., 11:985–1042.
- Li et al. (2018) Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, P. (2018). Learning Deep Generative Models of Graphs. arXiv e-prints, page arXiv:1803.03324.
- Li et al. (2018) Li, Y., Zhang, L., and Liu, Z. (2018). Multi-objective de novo drug design with conditional graph generative model. Journal of Cheminformatics, 10(1):33.
Liu et al. (2018)
Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A. (2018).
Constrained graph variational autoencoders for molecule design.In Advances in Neural Information Processing Systems 31.
- Luo et al. (2018) Luo, R., Tian, F., Qin, T., Chen, E., and Liu, T.-Y. (2018). Neural architecture optimization. In Advances in Neural Information Processing Systems 31.
- Robins et al. (2007) Robins, G., Pattison, P., Kalish, Y., and Lusher, D. (2007). An introduction to exponential random graph (p*) models for social networks. Social Networks, 29(2):173 – 191.
- Simonovsky and Komodakis (2018) Simonovsky, M. and Komodakis, N. (2018). Graphvae: Towards generation of small graphs using variational autoencoders. In ICANN.
- Sterling and Irwin (2015) Sterling, T. and Irwin, J. J. (2015). Zinc 15 – ligand discovery for everyone. Journal of Chemical Information and Modeling, 55(11):2324–2337.
- (27) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30.
- Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. (2018). Graph Attention Networks. International Conference on Learning Representations.
- Wang et al. (2017) Wang, H., Wang, J., Wang, J., Zhao, M., Zhang, W., Zhang, F., Xie, X., and Guo, M. (2017). GraphGAN: Graph Representation Learning with Generative Adversarial Nets. arXiv e-prints, page arXiv:1711.08267.
- You et al. (2018a) You, J., Liu, B., Ying, Z., Pande, V., and Leskovec, J. (2018a). Graph convolutional policy network for goal-directed molecular graph generation. In Advances in Neural Information Processing Systems 31.
- You et al. (2018b) You, J., Ying, R., Ren, X., Hamilton, W. L., and Leskovec, J. (2018b). Graphrnn: Generating realistic graphs with deep auto-regressive models. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018.
Appendix A Appendix
a.1 Complexity Analysis
In this section, we compare the computational complexity on training of our models with that of existing models. Following Vaswani et al. , we consider two demands: the total computational complexity and the minimum sequential operations with parallelization. Note that we assume that all matrix-vector products can be conducted with time complexity. Table 3 summarizes the evaluation results.
To generate a graph with nodes and edges, DeepGMG Li et al. (2018) requires at least operations. Because the initialization of the state of the new node depends on the state in the previous step, the amount of its minimum sequential operations is . On the other hand, GraphRNN You et al. (2018b) requires operations with a constant value , utilizing the properties of breadth-first search and relying mainly on the hidden states of RNN. However, the number of its minimum sequential operations is because of the sequential dependency of the hidden state of RNN; moreover, it cannot generate node/edge-labeled graphs.
Next, we evaluate GRAM. We denote the average number of -neighboring nodes as and reuse , defined in Section 3.3.5. Focusing on one step of the generation process, The feature extractor requires operations, considering a graph convolution layer requires , and a graph attention layers requires . In addition, the node estimator requires operations, and the edge estimator requires operations. Therefore, the total complexity of GRAM is . Note that all the estimation operations of the node estimator and the edge estimator can be parallelized on training because our model has no sequentially dependent hidden states as does RNN. As for cGRAM, we can similarly derive its total complexity as , where and denote the average number of nodes and edges, in one community, respectively.
From the above analysis, we can say that our GRAM, while keeping rich feature extraction, requires fewer operations than DeepGMG when holds, which is the often case in many real-world graphs. In addition, assuming the size of each community is nearly constant and smaller than that of a graph, the total complexity of cGRAM can be regarded as almost linear of , which is competitive to GraphRNN, while ours keeps feature extraction. More importantly, all the estimation operations can be parallelized, which facilitates training with multiple computing nodes. Therefore, we can expect graph scalability and data scalability of our models. We also expect label scalability because our models are flexible to variable number of labels by modifying only the dimensions of the input and output layers.
a.2 Inspection on Attention Weights in Edge Estimation
In a brief experiment on protein graphs where we consider all of the previously generated nodes in edge estimation (i.e. its total computational complexity is ), we examined the distributions of attention weights for nodes that are predicted to have edges between the new node and for those not. Figure 3 shows the two distributions of attention weights in one step in edge estimation (i.e., in the calculation of ). The blue one represents the distribution of weights for nodes that are predicted to not have edges between the new node (i.e. ), and the orange one represents the distribution of weights for nodes that are predicted to have edges between the new node (i.e. ). From the figure, we can see that the former distribution is sharper around the zero compared with the latter, most of whose weights take nearly zero. As for molecular graphs, we observed a similar result.
a.3 Comunity Detection Algorithm
Here we describe the community detection algorithm used in this work. Although it is possible to include community detection into the learning framework, we did not consider this option for simplicity.
While various algorithms to detect communities in a graph are proposed Lancichinetti and Fortunato (2009), we use the algorithm modified from Louvain algorithm Blondel et al. (2008), which is simple and computationally lightweight. Louvain algorithm detects communities by maximizing modularity in a greedy manner. Modularity is a function to measure the quality of partition and is calculated as:
where denotes the weight of the edge between and , represents the community that belongs to, and Blondel et al. (2008). Note that if and otherwise. Specifically, it leverages the fact that the change in modularity by adding/deleting a node to/from a community can be calculated easily.
However, we found using this algorithm simply can produce quite small communities, which makes training unstable. To alleviate this problem, we introduced coverage Fortunato (2010) as a reguralization term. Coverage is defined as a ratio of inner-community edges to all edges. More precisely, we modify the objective function of Louvain algorithm to with a regularization parameter . Note that setting yields the original Louvain algorithm and setting returns only one community without partitioning. Even with this modification, the change can also be calculated quite simply, hence its time complexity is almost the same as the original. We set for molecular graphs and for protein graphs.
a.4 Visual Inspection on Generated Molecular Graphs
For qualitative evaluation, we listed molecules in the training set and those generated by each model in Figure 7, 7, 7, 7. As for GRAM and DeepGMG, we could not see the significant difference from training set, whereas cGRAM generated some small molecules, which was considered to be captured in the poor performance in GK-MMD.
a.5 Detail of Experiment Settings
For convenience of implementation, we started the generation process from a small subgraph with nodes (i.e., replace with in Equation 4). We set for protein graphs and for molecular graphs.
In the experiment of protein graphs, we used Tesla V100 4 for training.
We trained the models with batch size 2048 through 200 epoch.
It took about 10 hours for GRAM and 5 hours for cGRAM.
4 for training. We trained the models with batch size 2048 through 200 epoch. It took about 10 hours for GRAM and 5 hours for cGRAM. We set.
In the experiment of molecular graphs, we used Tesla V100 2 for training. We trained the models with batch size 8192 through 40 epoch. It took about 10 hours for GRAM and 8 hours for cGRAM. We set . Although 250k samples are limited amount compared with a large number of molecular graphs, this training time indicates the scalability to larger datasets.