graphgen
GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation
view repo
Graph generative models have been extensively studied in the data mining literature. While traditional techniques are based on generating structures that adhere to a pre-decided distribution, recent techniques have shifted towards learning this distribution directly from the data. While learning-based approaches have imparted significant improvement in quality, some limitations remain to be addressed. First, learning graph distributions introduces additional computational overhead, which limits their scalability to large graph databases. Second, many techniques only learn the structure and do not address the need to also learn node and edge labels, which encode important semantic information and influence the structure itself. Third, existing techniques often incorporate domain-specific rules and lack generalizability. Fourth, the experimentation of existing techniques is not comprehensive enough due to either using weak evaluation metrics or focusing primarily on synthetic or small datasets. In this work, we develop a domain-agnostic technique called GraphGen to overcome all of these limitations. GraphGen converts graphs to sequences using minimum DFS codes. Minimum DFS codes are canonical labels and capture the graph structure precisely along with the label information. The complex joint distributions between structure and semantic labels are learned through a novel LSTM architecture. Extensive experiments on million-sized, real graph datasets show GraphGen to be 4 times faster on average than state-of-the-art techniques while being significantly better in quality across a comprehensive set of 11 different metrics. Our code is released at https://github.com/idea-iitd/graphgen.
READ FULL TEXT VIEW PDFGraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation
None
Modeling and generating real-world graphs have applications in several domains, such as understanding interaction dynamics in social networks(Wang et al., 2018; Grover et al., 2019; Tran et al., 2019), graph classification(Ranu and Singh, 2009b; Ranu et al., 2011; Banerjee et al., 2016)
, and anomaly detection
(Ranu and Singh, 2009a). Owing to their wide applications, development of generative models for graphs has a rich history, and many methods have been proposed. However, a majority of the techniques, make assumptions about certain properties of the graph database, and generate graphs that adhere to those assumed properties(Rényi, 1959; Albert and Barabási, 2002; Airoldi et al., 2008; Leskovec et al., 2010; Zhang et al., 2019). A key area that lacks development is the ability to directly learn patterns, both local as well as global, from a given set of observed graphs and use this knowledge to generate graphs instead of making prior assumptions. In this work, we bridge this gap by developing a domain-agnostic graph generative model for labeled graphs without making any prior assumption about the dataset. Such an approach reduces the usability barrier for non-technical users and applies to domains where distribution assumptions made by traditional techniques do not fit well, e.g. chemical compounds, protein interaction networks etc.Modeling graph structures is a challenging task due to the inherent complexity of encoding both local and global dependencies in link formation among nodes. We briefly summarize how existing techniques tackle this challenge and their limitations.
Technique | Domain-agnostic | Node Labels | Edge Labels | Scalable | Complexity |
MolGAN(De Cao and Kipf, 2018) | ✗ | ✓ | ✓ | ✗ | |
NeVAE(Samanta et al., 2019) | ✗ | ✓ | ✓ | ✓ | |
GCPN(You et al., 2018a) | ✗ | ✓ | ✓ | ✗ | O |
LGGAN(Fan and Huang, 2019) | ✓ | ✓ | ✗ | ✗ | |
Graphite(Grover et al., 2019) | ✓ | ✓ | ✗ | ✗ | |
DeepGMG(Li et al., 2018a) | ✓ | ✓ | ✓ | ✗ | |
GraphRNN(You et al., 2018b) | ✓ | ✗ | ✗ | ✓ | |
GraphVAE(Simonovsky and Komodakis, 2018) | ✓ | ✓ | ✓ | ✗ | |
GRAN(Liao et al., 2019) | ✓ | ✗ | ✗ | ✗ | |
GraphGen | ✓ | ✓ | ✓ | ✓ |
Traditional graph generative models: Several models exist(Airoldi et al., 2008; Albert and Barabási, 2002; Rényi, 1959; Robins et al., 2007; Leskovec et al., 2010; Watts and Strogatz, 1998) that are engineered towards modeling a pre-selected family of graphs, such as random graphs(Rényi, 1959), small-world networks (Watts and Strogatz, 1998), and scale-free graphs(Albert and Barabási, 2002). The apriori assumption introduces several limitations. First, due to pre-selecting the family of graphs, i.e. distributions modeling some structural properties, they cannot adapt to datasets that either do not fit well or evolve to a different structural distribution. Second, these models only incorporate structural properties and do not look into the patterns encoded by labels. Third, these models assume the graph to be homogeneous. In reality, it is possible that different local communities (subgraphs) within a graph adhere to different structural distributions.
Learning-based graph generative models: To address the above outlined limitations, recent techniques have shifted towards a learning-based approach(Li et al., 2018a; You et al., 2018b; Fan and Huang, 2019; Simonovsky and Komodakis, 2018; Kawai et al., 2019; Liao et al., 2019). While impressive progress has been made through this approach, there is scope to further improve the performance of graph generative models. We outline the key areas we target in this work.
Domain-agnostic modeling: Several models have been proposed recently that target graphs from a specific domain and employ domain-specific constraints(Samanta et al., 2019; Popova et al., 2019; De Cao and Kipf, 2018; You et al., 2018a; Jin et al., 2018; Liu et al., 2018; Li et al., 2018b) in the generative task. While these techniques produce excellent results on the target domain, they do not generalize well to graphs from other domains.
Labeled graph generation:
Real-world graphs, such as protein interaction networks, chemical compounds, and knowledge graphs, encode semantic information through node and edge labels. To be useful in these domains, we need a generative model that jointly captures the relationship between structure and labels (both node and edge labels).
Data Scalability: It is not uncommon to find datasets containing millions of graphs(Irwin and Shoichet, 2005). Consequently, it is important for any generative model to scale to large training datasets so that all of the information can be exploited for realistic graph generation. Many of the existing techniques do not scale to large graph databases(You et al., 2018b; Li et al., 2018a; Liao et al., 2019; Fan and Huang, 2019; Simonovsky and Komodakis, 2018). Often this non-scalability stems from dealing with exponential representations of the same graph, complex neural architecture, and large parameter space. For example, LGGAN(Fan and Huang, 2019) models graph through its adjacency matrix and therefore has complexity, where is the number of nodes in the graph. GraphRNN(You et al., 2018b)
is the most scalable among existing techniques due to employing a smaller recurrent neural network (RNN) of
complexity, where is a hyper-parameter. In GraphRNN, the sequence representations are constructed by performing a large number of breadth-first-search (BFS) enumerations for each graph. Consequently, even if the size of the graph dataset is small, the number of unique sequences fed to the neural network is much larger, which in turn affects scalability.Table 1 summarizes the limitations of learning-based approaches. GraphGen addresses all of the above outlined limitations and is the first technique that is domain-agnostic, assumption-free, models both labeled and unlabeled graphs, and data scalable. Specifically, we make the following contributions.
We propose the problem of learning a graph generative model that is assumption-free, domain-agnostic, and captures the complex interplay between semantic labels and graph structure (§ 2).
To solve the proposed problem, we develop an algorithm called GraphGen that employs the unique combination of graph canonization
with deep learning. Fig.
1 outlines the pipeline. Given a database of training graphs, we first construct the canonical label of each graph(Kuramochi and Karypis, 2004; Yan and Han, 2002). The canonical label of a graph is a string representation such that two graphs are isomorphic if and only if they have the same canonical label. We use DFS codesfor canonization. Through canonization, the graph generative modeling task converts to a sequence modeling task, for which we use Long Short-term Memory (LSTM)
(Hochreiter and Schmidhuber, 1997) (§ 3).We perform an extensive empirical evaluation on real million-sized graph databases spanning three different domains. Our empirical evaluation establishes that we are significantly better in quality than the state-of-the-art techniques across an array of metrics, more robust, and times faster on average (§ 4).
As a notational convention, a graph is represented by the tuple consisting of node set and edge set . Graphs may be annotated with node and edge labels. Let and be the node and edge label mappings respectively where and are the set of all node and edge labels respectively. We use the notation and to denote the labels of node and edge respectively. We assume that the graph is connected and there are no self-loops.
The goal of a graph generative model is to learn a distribution over graphs, from a given set of observed graphs that is drawn from an underlying hidden distribution . Each could have a different number of nodes and edges, and could possibly have a different set of node labels and edge labels. The graph generative model is effective if the learned distribution is similar to the hidden distribution, i.e., . More simply, the learned generative model should be capable of generating similar graphs of varied sizes from the same distribution as without any prior assumption about structure or labeling.
Large and variable output space: The structure of a graph containing nodes can be characterized using an adjacency matrix, which means a large output space of values. With node and edge labels, this problem becomes even more complex as the adjacency matrix is no longer binary. Furthermore, the mapping from a graph to its adjacency matrix is not one-to-one. A graph with nodes can be equivalently represented using adjacency matrices corresponding to each possible node ordering. Finally, itself varies from graph to graph.
Joint distribution space: While we want to learn the hidden graph distribution , defining itself is a challenge since graph structures are complex and composed of various properties such as node degree, clustering coefficients, node and edge labels, number of nodes and edges, etc. One could learn a distribution over each of these properties, but that is not enough since these distributions are not independent. The key challenge is therefore to learn the complex dependencies between various properties.
Local and global dependencies: The joint distribution space itself may not be homogeneous since the dependence between various graph parameters varies across different regions of a graph. For example, not all regions of a graph are equally dense.
We overcome these challenges through GraphGen.
Instead of using adjacency matrices to capture graph structures, we perform graph canonization to convert each graph into a unique sequence. This step provides two key advantages. First, the length of the sequence is for graph instead of . Although in the worst case, , in real graphs, this is rarely seen. Second, there is a one-to-one mapping between a graph and its canonical label instead of mappings. Consequently, the size of the output space is significantly reduced, which results in faster and better modeling.
First, we formally define the concept of graph canonization.
Graph is isomorphic to if there exists a bijection such that for every vertex and for every edge . If the graphs are labeled, we also need to ensure that the labels of mapped nodes and edges are the same, i.e., and .
The canonical label of a graph is a string representation such that two graphs have the same label if and only if they are isomorphic to each other.
We use DFS codes(Yan and Han, 2002) to construct graph canonical labels. DFS code encodes a graph into a unique edge sequence by performing a depth first search (DFS). The DFS traversal may start from any node in the graph. To illustrate, consider Fig. 2(a). A DFS traversal starting at on this graph is shown in Fig. 2(b) where the bold edges show the edges that are part of the DFS traversal. During the DFS traversal, we assign each node a timestamp based on when it is discovered; the starting node has the timestamp . Fig. 2(b) shows the timestamps if the order of DFS traversal is . We use the notation to denote the timestamp of node . Any edge that is part of the DFS traversal is guaranteed to have , and we call such an edge a forward edge. On the other hand, the edges that are not part of the DFS traversal, such as in Fig. 2(b), are called backward edges. In Fig. 2(b), the bold edges represent the forward edges, and the dashed edge represents the backward edge. Based on the timestamps assigned, an edge is described using a -tuple , where and denote the node and edge labels of and respectively.
Given a DFS traversal, our goal is to impose a total ordering on all edges of the graph. Forward edges already have an ordering between them based on when they are traversed in DFS. To incorporate backward edges, we enforce the following rule:
Any backward edge must be ordered before all forward edges of the form .
Any backward edge must be ordered after the forward edge of the form , i.e., the first forward edge pointing to .
Among the backward edges from the same node of the form and , has a higher order if .
With the above rules, corresponding to any DFS traversal of a graph, we can impose a total ordering on all edges of a graph and can thus convert it to a sequence of edges that are described using their -tuple representations. We call this sequence representation a DFS code.
Since each graph may have multiple DFS traversals, we choose the lexicographically smallest DFS code based upon the lexicographical ordering proposed in(Yan and Han, 2002). Hereon, the lexicographically smallest DFS code is referred to as the minimum DFS code. We use the notation to denote the minimum DFS code of graph , where .
is a canonical label for any graph .(Yan and Han, 2002)
Constructing the minimum DFS code of a graph is equivalent to solving the graph isomorphism problem. Both operations have a worst-case computation cost of since we may need to evaluate all possible permutations of the nodes to identify the lexicographically smallest one. Therefore an important question arises: How can we claim scalability with a factorial computation complexity? To answer this question, we make the following observations.
Labeled graphs are ubiquitous: Most real graphs are labeled. Labels allow us to drastically prune the search space and identify the minimum DFS code quickly. For example, in Fig. 2(a), the only possible starting nodes are or as they contain the lexicographically smallest node label “X”. Among them, their 2-hop neighborhoods are enough to derive that must be the starting node of the minimum DFS code as is the lexicographically smallest possible tuple after in the graph in Fig. 2(a). We empirically substantiate this claim in § 4.3.2.
Invariants: What happens if the graph is unlabeled or has less diversity in labels? In such cases, vertex and edge invariants can be used as node and edge labels. Invariants are properties of nodes and edges that depend only on the structure and not on graph representations such as adjacency matrix or edge sequence. Examples include node degree, betweenness centrality, clustering coefficient, etc. We empirically study this aspect further in § 4.4.
Precise training: Existing techniques rely on the neural network to learn the multiple representations that correspond to the same graph. Many-to-one mappings introduce impreciseness and bloat the modeling task with redundant information. In GraphGen, we feed precise one-to-one graph representations, which allows us to use a more lightweight neural architecture. Dealing with multiple graph representations are handled algorithmically, which leads to significant improvement in quality and scalability.
Owing to the conversion of graphs to minimum DFS codes, and given that DFS codes are canonical labels, modeling a database of graphs is equivalent to modeling their sequence representations , i.e., . To model the sequential nature of the DFS codes, we use an auto-regressive model to learn . At inference time, we sample DFS codes from this distribution instead of directly sampling graphs. Since the mapping of a graph to its minimum DFS code is a bijection, i.e., , the graph structure along with all node and edge labels can be fully constructed from the sampled DFS code.
Fig.3 provides an overview of GraphGen model. For graph with , the proposed auto-regressive model generates sequentially from to . This means that our model produces a single edge at a time, and since it belongs to a DFS sequence, it can only be between two previously generated nodes (backward edge) or one of the already generated nodes and an unseen node (forward edge), in which case a new node is also formed. Consequently, only the first generated edge produces two new nodes, and all subsequent edges produce at most one new node.
We now describe the proposed algorithm to characterize in detail. Since
is sequential in nature, we decompose the probability of sampling
from as a product of conditional distribution over its elements as follows:(1) |
where is the number of edges and is end-of-sequence token EOS to allow variable length sequences. We denote as in further discussions.
Recall, each element is an edge tuple of the form , where . We make the simplifying assumption that timestamps , , node labels , and edge label at each generation step of are independent of each other. This makes the model tractable and easier to train. Note that this assumption is not dependent on data and hence is not forming prior bias of any sort on data. Mathematically, Eq. 1 reduces to the follows.
(2) |
Eq. 2 is extremely complex as it has to capture complete information about an upcoming edge, i.e., the nodes to which the edge is connected, their labels, and the label of the edge itself. To capture this highly complex nature of we propose to use expressive neural networks. Recurrent neural networks (RNNs) are capable of learning features and long term dependencies from sequential and time-series data(Salehinejad et al., 2017). Specifically, LSTMs(Hochreiter and Schmidhuber, 1997) have been one of the most popular and efficient methods for reducing the effects of vanishing and exploding gradients in training RNNs. In recent times, LSTMs and stacked LSTMs have delivered promising results in cursive handwriting(Graves et al., 2007), sentence modeling(Bowman et al., 2016) and, speech recognition(Graves et al., 2013).
In this paper, we design a custom LSTM that consists of a state transition function (Eq. 3), an embedding function (Eq. 3), and five separate output functions (Eqs. 4-8) for each of the five components of the -tuple . ( in the following equations represents sampling from multinomial distribution)
(3) | ||||
(4) | ||||
(5) | ||||
(6) | ||||
(7) | ||||
(8) | ||||
(9) |
Here, is the concatenated component wise one-hot encoding of the real edge, and
is an LSTM hidden state vector that encodes the state of the graph generated so far. An embedding function
is used to compress the sparse information possessed by the large edge representation () to a small vector of real numbers (Eq. 3). Given the graph state, , the output of the functions , , , , i.e. , , , , respectively represent the multinomial distribution over possibilities of five components of the newly formed edge tuple. Finally, components of the newly formed edge tuple are sampled from the multinomial distributions parameterized by respective s and concatenated to form the new edge () (Eq. 9).With the above design, the key components that dictate the modeling quality are the multinomial probability distribution parameters
, , , , over each edge tuple , , , , . Our goal is, therefore, to ensure that the learned distributions are as close as possible to the real distributions in the training data. Using the broad mathematical architecture explained above, we will now explain the Graph Generation process followed by the sequence modeling / training process.Given the learned state transition function , embedding function and output functions , , , , , Alg. 1 provides the pseudocode for the graph generation process. In Alg. 1, stores the generated sequence and is initialized along with other variables (line 1-1). , which represents edge tuple formed for sequence , is component-wise sampled from the multinomial distributions learned by output functions (loop at line 1) and finally appended to the sequence (line 1). This process of forming a new edge and appending it to sequence is repeated until any of the five components is sampled as an EOS. Note, since each of the five components of i.e., are one-hot encodings and can be of different size, their EOS could be of different sizes, in which case they will be compared with their respective EOS to indicate the stopping condition.
We now discuss how the required inference functions (Eqs. 3-9) are learned from a set of training graphs.
Given an input graph dataset , Alg. 2 first converts the graph dataset to a sequence dataset of minimum DFS codes (line 2). We initialize the unlearned neural functions by suitable weights (line 2). Further details about these functions are given in Sec. 3.2.5. For each edge in each sequence , a concatenated component-wise one-hot encoding of the real edge is fed to the embedding function , which compresses it and its compressed form along with previous graph state is passed to the transition function . generates the new graph state (line 2). This new state is passed to all the five output functions that provide us the corresponding probability distributions over the components of the new edge tuple (line 2). We concatenate these probability distributions to form (same size as that of ), which is representative of a new edge (line 2). Finally, this component-wise concatenated probability distribution is matched (component-wise) against the real tuple from the DFS code being learned. This implies that we use ground truth rather than the model’s own prediction during successive time steps in training. In contrast to this, GraphGen uses its own predictions during graph generation time. The accuracy of the model is optimized using the cumulative binary cross-entropyloss function. Here represents component , and is taken elementwise on a vector.
(10) |
Variable (One-hot vector) | Dimension |
State transition function, in our architecture is a stacked LSTM-Cell, each of the five output functions , , , , are fully connected, independent
Multi-layer Perceptrons (MLP)
with dropouts and is a simple linear embedding function. The dimension of the one-hot encodings of , , , ,are estimated from the training set. Specifically, the largest values for each element are computed in the first pass while computing the minimum DFS codes of the training graphs. Table
2 shows the dimensions of the one-hot encodings for each component of . Note that each variable has one size extra than their intuitive size to incorporate the EOS token. Since we set the dimensions of and to the largest training graph size, in terms of the number of nodes, the largest generated graph cannot exceed the largest training graph. Similarly, setting the dimensions of , , and to the number of unique node and edge labels in the training set means that our model is bound to produce only labels that have been observed at least once. Since the sizes of each component of is fixed, the size of , which is formed by concatenating these components, is also fixed, and therefore, satisfies the pre-requisite condition for LSTMs to work, i.e., each input must be of the same size. We employ weight sharing among all time steps in LSTM to make the model scalable.Consistent with the complexity analysis of existing models (Recall Table 1), we only analyze the complexity of the forward and backward propagations. The optimization step to learn weights is not included.
The length of the sequence representation for graph is . All operations in the LSTM model consume time per edge tuple and hence the complexity is . The generation algorithm consumes time per edge tuple and for a generated graph with edge set , the complexity is .
In this section, we benchmark GraphGen on real graph datasets and establish that:
Scalability: GraphGen, on average, is times faster than GraphRNN, which is currently the fastest generative model that works for labeled graphs. In addition, GraphGen scales better with graph size, both in quality and efficiency.
The code and datasets used for our empirical evaluation can be downloaded from https://github.com/idea-iitd/graphgen.
# | Name | Domain | No. of graphs | ||||
1 | NCI-H23 (Lung)(National Center for Biotechnology Information, [n. d.]) | Chemical | 24k | [6, 50] | [6, 57] | 11 | 3 |
2 | Yeast(National Center for Biotechnology Information, [n. d.]) | Chemical | 47k | [5, 50] | [5, 57] | 11 | 3 |
3 | MOLT-4 (Leukemia)(National Center for Biotechnology Information, [n. d.]) | Chemical | 33k | [6, 111] | [6, 116] | 11 | 3 |
4 | MCF-7 (Breast)(National Center for Biotechnology Information, [n. d.]) | Chemical | 23k | [6, 111] | [6, 116] | 11 | 3 |
5 | ZINC(Irwin and Shoichet, 2005) | Chemical | 3.2M | [3, 42] | [2, 49] | 9 | 4 |
6 | Enzymes(Borgwardt et al., 2005) | Protein | 575 | [2, 125] | [2, 149] | 3 | ✗ |
7 | Citeseer(Sen et al., 2008) | Citation | 1 | 3312 | 4460 | 6 | ✗ |
8 | Cora(Sen et al., 2008) | Citation | 1 | 2708 | 5278 | 6 | ✗ |
Dataset | Model | Degree | Clustering | Orbit | NSPDK | Avg # Nodes (Gen/Gold) | Avg # Edges (Gen/Gold) | Node Label | Edge Label | Joint Node Label & Degree | Novelty | Uniqueness | Training Time | # Epochs |
Lung | GraphGen | 0.009 | 0.035 | 35.20/35.88 | 36.66/37.65 | 0.001 | 0.205 | 6h | 550 | |||||
GraphRNN | 0.103 | 0.301 | 0.043 | 0.325 | 6.32/35.88 | 6.35/37.65 | 0.193 | 0.005 | 0.836 | 1d 1h | 1900 | |||
DeepGMG | 0.123 | 0.001 | 0.026 | 0.260 | 11.04/35.88 | 10.28/37.65 | 0.083 | 0.002 | 0.842 | 23h | 20 | |||
Yeast | GraphGen | 0.006 | 0.026 | 35.58/32.11 | 36.78/33.22 | 0.001 | 0.093 | 97% | 99% | 6h | 250 | |||
GraphRNN | 0.512 | 0.153 | 0.026 | 0.597 | 26.58/32.11 | 27.01/33.22 | 0.310 | 0.002 | 0.997 | 93% | 90% | 21h | 630 | |
DeepGMG | 0.056 | 0.002 | 0.008 | 0.239 | 34.91/32.11 | 35.08/33.22 | 0.115 | 0.967 | 90% | 89% | 2d 3h | 18 | ||
Mixed: Lung + Leukemia + Yeast + Breast | GraphGen | 0.005 | 0.023 | 37.87/37.6 | 39.24/39.14 | 0.001 | 0.140 | 97% | 99% | 11h | 350 | |||
GraphRNN | 0.241 | 0.039 | 0.331 | 8.19/37.6 | 7.2/39.14 | 0.102 | 0.010 | 0.879 | 62% | 52% | 23h | 400 | ||
Deep GMG | 0.074 | 0.002 | 0.002 | 0.221 | 24.35/37.6 | 24.11/39.14 | 0.092 | 0.002 | 0.912 | 83% | 83% | 1d 23h | 11 | |
Enzymes | GraphGen | 0.243 | 0.198 | 0.016 | 0.051 | 32.44 /32.95 | 52.83/64.15 | 0.005 | - | 0.249 | 98% | 99% | 3h | 4000 |
GraphRNN | 0.090 | 0.151 | 0.038 | 0.067 | 11.87/32.95 | 23.52/64.15 | 0.048 | - | 0.312 | 99% | 97% | 15h | 20900 | |
Citeseer | GraphGen | 0.089 | 0.083 | 0.100 | 0.020 | 35.99/48.56 | 41.81/59.14 | 0.024 | - | 0.032 | 83% | 95% | 4h | 400 |
GraphRNN | 1.321 | 0.665 | 1.006 | 0.052 | 31.42/48.56 | 77.13/59.14 | 0.063 | - | 0.035 | 62% | 100% | 12h | 1450 | |
Cora | GraphGen | 0.061 | 0.117 | 0.089 | 0.012 | 48.66/58.39 | 55.82/68.61 | 0.017 | - | 0.013 | 91% | 98% | 4h | 400 |
GraphRNN | 1.125 | 1.002 | 0.427 | 0.093 | 54.01/58.39 | 226.46/68.61 | 0.085 | - | 0.015 | 70% | 93% | 9h | 400 |
All experiments are performed on a machine running dual Intel Xeon Gold 6142 processor with 16 physical cores each, having 1 Nvidia 1080 Ti GPU card with 12GB GPU memory, and 384 GB RAM with Ubuntu 16.04 operating system.
Table 3 lists the various datasets used for our empirical evaluation. The semantics of the datasets are as follows.
Chemical compounds: The first five datasets are chemical compounds. We convert them to labeled graphs where nodes represent atoms, edges represent bonds, node labels denote the atom-type, and edge labels encode the bond order such as single bond, double bond, etc.
Citation graphs: Cora and Citeseer are citation networks; nodes correspond to publications and an edge represents one paper citing the other. Node labels represent the publication area.
Enzymes: This dataset contains protein tertiary structures representing enzymes from the BRENDA enzyme database(Schomburg et al., 2004). Nodes in a graph (protein) represent secondary structure elements, and two nodes are connected if the corresponding elements are interacting. The node labels indicate the type of secondary structure, which is either helices, turns, or sheets. This is an interesting dataset since it contains only three label-types. We utilize this dataset to showcase the impact of graph invariants. Specifically, we supplement the node labels with node degrees. For example, if a node has label “A” and degree , we alter the label to “5, A”.
We compare GraphGen with DeepGMG(Li et al., 2018a)^{1}^{1}1We modified the DeepGMG implementation for unlabeled graphs provided by Deep Graph Library (DGL) and GraphRNN(You et al., 2018b). For GraphRNN, we use the original code released by the authors. Since this code does not support labels, we extend the model to incorporate node and edge labels based on the discussion provided in the paper(You et al., 2018b).
For both GraphRNN and DeepGMG, we use the parameters recommended in the respective papers. Estimation of in GraphRNN and evaluation of DFS codes^{2}^{2}2We adapted Minimum DFS code implementation from kaviniitm from graph database in GraphGen is done in parallel using 48 threads.
For GraphGen, we use 4 layers of LSTM cells for with hidden state dimension of and the dimension of is set to . Hidden layers of size are used in MLPs for , , , , . We use Adam optimizer with a batch size of for training. We use a dropout of
in MLP and LSTM layers. Furthermore, we use gradient clipping to remove exploding gradients and L2 regularizer to avoid over-fitting.
To evaluate the performance in any dataset, we split it into three parts: the train set, validation set, and test set. Unless specifically mentioned, the default split among training, validation and test is , , of graphs respectively. We stop the training when validation loss is minimized or less than change in validation loss is observed over an extended number of epochs. Note that both Citeseer and Cora represent a single graph. To form the training set in these datasets, we sample subgraphs by performing random walk with restarts from multiple nodes. More specifically, to sample a subgraph, we choose a node with probability proportional to its degree. Next, we initiate random walk with restarts with restart probability . The random walks stop after iterations. Any edge that is sampled at least once during random walk with restarts is part of the sampled subgraph. This process is then repeated times to form a dataset of graphs. The sizes of the sampled subgraphs range from and in Citeseer and and in Cora.
The metrics used in the experiments can be classified into the following categories:
Structural metrics: We use the metrics used by GraphRNN(You et al., 2018b) for structural evaluation: (1) node degree distribution, (2) clustering coefficient distribution of nodes, and (3) orbit count distribution, which measures the number of all orbits with nodes. Orbits capture higher-level motifs that are shared between generated and test graphs. The closer the distributions between the generated and test graphs, the better is the quality. In addition, we also compare the graph size of the generated graphs with the test graphs in terms of (4) Average number of nodes and (5) Average number of edges.
Label Accounting metrics: We compare the distribution of (1) node labels, (2) edge labels, and (3) joint distribution of node labels and degree in generated and test graphs.
Graph Similarity: To capture the similarity of generated graphs in a more holistic manner that considers both structure and labels, we use Neighbourhood Sub-graph Pairwise Distance Kernel (NSPDK)(Costa and Grave, 2010). NSPDK measures the distance between two graphs by matching pairs of subgraphs with different radii and distances. The lower the NSPDK distance, the better is the performance. This quality metric is arguably the most important since it captures the global similarity instead of local individual properties.
Redundancy checks: Consider a generative model that generates the exact same graphs that it saw in the training set. Although this model will perform very well in the above metrics, such a model is practically useless. Ideally, we would want the generated graphs to be diverse and similar, but not identical. To quantify this aspect, we check (1) Novelty, which measures the percentage of generated graphs that are not subgraphs of the training graphs and vice versa. Note that identical graphs are subgraph isomorphic to each other. In other words, novelty checks if the model has learned to generalize unseen graphs. We also compute (2) Uniqueness, which captures the diversity in generated graphs. To compute Uniqueness, we first remove the generated graphs that are subgraph isomorphic to some other generated graphs. The percentage of graphs remaining after this operation is defined as Uniqueness. For example, if the model generates graphs, all of which are identical, the uniqueness is .
To compute the distance between two distributions, like in GraphRNN(You et al., 2018b), we use Maximum Mean Discrepancy (MMD)(Gretton et al., 2012). To quantify quality using a particular metric, we generate graphs and compare them with a random sample of test graphs. On all datasets except Enzymes, we report the average of runs of computing metric, comparing graphs in a single run. Since the Enzymes dataset contains only graphs, we sample graphs randomly from the test set times and report the average.
Table 4 presents the quality achieved by all three benchmarked techniques across different metrics on datasets. We highlight the key observations that emerge from these experiments. Note that, some of the generated graphs may not adhere to the structural properties assumed in § 2 i.e., no self loops, multiples edges or disconnected components, so we prune all the self edges and take the maximum connected component for each generated graph.
Graph and Sub-graph level similarity: On the NSPDK metric, which is the most important due to capturing the global similarity of generated graphs with test graphs, GraphGen is significantly better across all datasets. This same trend also transfers to orbit count distributions. Orbit count captures the presence of motifs also seen in the test set. These two metrics combined clearly establishes that GraphGen model graphs better than GraphRNN and DeepGMG.
Node-level metrics: Among the other metrics, GraphGen remains the dominant performer. To be more precise, GraphGen is consistently the best in both Node and Edge Label preservation, as well as graph size in terms of the number of edges. It even performs the best on the joint Node Label and Degree metric, indicating its superiority in capturing structural and semantic information together. The performance of GraphGen is comparatively less impressive in the Enzymes dataset, where GraphRNN marginally outperforms in the Degree and Clustering Coefficient metrics. Nonetheless, even in Enzyme, among the eight metrics, GraphGen is better in six. This result also highlights the need to not rely on only node-level metrics. Specifically, although GraphRNN models the node degrees and clustering coefficients well, it generates graphs that are much smaller in size and hence the other metrics, including NSPDK, suffers.
Novelty and Uniqueness: Across all datasets, GraphGen has uniqueness of at least , which means it does not generate the same graph multiple times. In contrast, GraphRNN has a significantly lower uniqueness in chemical compound datasets. In Lung, only of the generated graphs are unique. In several of the datasets, GraphRNN also has low novelty, indicating it regenerates graphs (or subgraphs) it saw during training. For example, in Cora, Citeseer and, the mixed chemical dataset, at least of the generated graphs are regenerated from the training set. While we cannot pinpoint the reason behind this performance, training on random graph sequence representations could be a factor. More specifically, even though the generative model may generate new sequences, they may correspond to the same graph. While this is possible in GraphGen as well, the likelihood is much less as it is trained on DFS codes that enable one-to-one mapping with graphs.
Analysis of GraphRNN and DeepGMG: Among GraphRNN and DeepGMG, DeepGMG generally performs better in most metrics. However, DeepGMG is extremely slow and fails to scale on larger networks due to complexity.
GraphRNN’s major weakness is in learning graph sizes. As visible in Table 4, GraphRNN consistently generates graphs that are much smaller than the test graphs. We also observe that this issue arises only in labeled graphs. If the labels are removed while retaining their structures, GraphRNN generates graphs that correctly mirror the test graph size distribution. This result clearly highlights that labels introduce an additional layer of complexity that simple extensions of models built for unlabeled graphs do not solve.
To gain visual feedback on the quality of the generated graphs, in Fig. 4, we present a random sample of graphs from the training set of Lung dataset and those generated by each of the techniques. Visually, GraphGen looks the most similar, while GraphRNN is the most dissimilar. This result is consistent with the quantitative results obtained in Table 4. For example, consistent with our earlier observations, the GraphRNN graphs are much smaller in size and lack larger motifs like benzene rings. In contrast, GraphGen’s graphs are of similar sizes and structures.
In Fig. 5, we perform the same exercise as above with randomly picked graphs from the Cora dataset and those generated by GraphGen and GraphRNN. Since the graphs in this dataset are much larger, it is hard to comment if GraphGen is better than GraphRNN through visual inspection; the layout of a graph may bias our minds. However, we notice one aspect where GraphGen performs well. Since Cora is a citation network, densely connected nodes (communities) have a high affinity towards working in the same publication area (node label/color). This correlation of label homogeneity with communities is also visible in GraphGen(Fig. 5(b)).
After establishing the clear superiority of GraphGen in quality for labeled graph generative modeling, we turn our focus to scalability. To enable modeling of large graph databases, the training must complete within a reasonable time span. The penultimate column in Table 4 sheds light on the scalability of the three techniques. As clearly visible, GraphGen is to times faster than GraphRNN, depending on the dataset. DeepGMG is clearly the slowest, which is consistent with its theoretical complexity of . If we exclude the sequence generation aspect, GraphRNN has a lower time complexity for training than GraphGen; while GraphRNN is linear to the number of nodes in the graph, GraphGen is linear to the number of edges. However, GraphGen is still able to achieve better performance due to training on canonical labels of graphs. In contrast, GraphRNN generates a random sequence representation for a graph in each epoch. Consequently, as shown in the last column of Table 4, GraphRNN runs for a far larger number of epochs to achieve loss minimization in the validation set. Note that DeepGMG converges within the minimum number of epochs. However, the time per epoch of DeepGMG is to times higher than GraphGen. Compared to GraphRNN, the time per epoch of GraphGen is faster on average.
To gain a deeper understanding of the scalability of GraphGen, we next measure the impact of training set size on performance. Figs. 7(a)-7(b) present the growth of training times on ZINC and Cora against the number of graphs in the training set. While ZINC contains 3.2 million graphs, in Cora, we sample up to 1.2 million subgraphs from the original network for the training set. As visible, both GraphRNN and DeepGMG has a much steeper growth rate than GraphGen. In fact, both these techniques fail to finish within 3 days for datasets exceeding 100,000 graphs. In contrast, GraphGen finishes within two and a half days, even on a training set exceeding million graphs. The training times of all techniques are higher in Cora since the graphs in this dataset are much larger in size. We also observe that the growth rate of GraphGen slows at larger training sizes. On investigating further, we observe that with higher training set sizes, the number of epochs required to reach the validation loss minima reduces. Consequently, we observe the trend visible in Figs. 7(a)-7(b). This result is not surprising since the learning per epoch is higher with larger training sets.
An obvious question arises at this juncture: Does lack of scalability to larger training sets hurt the quality of GraphRNN and DeepGMG? Figs. 7(c)-7(e) answer this question. In both ZINC and Cora, we see a steep improvement in the quality (reduction in NSPDK and Orbit) of GraphRNN as the training size is increased from graphs to graphs. The improvement rate of GraphGen is relatively milder, which indicates that GraphRNN has a greater need for larger training sets. However, this requirement is not met as GraphRNN fails to finish within days, even for graphs. Overall, this result establishes that scalability is not only important for higher productivity but also improved modeling.
We next evaluate how the size of the graph itself affects the quality and training time. Towards that end, we partition graphs in a dataset into multiple buckets based on its size. Next, we train and test on each of these buckets individually and measure the impact on quality and efficiency. Figs.7(f) and 7(h) present the impact on quality in chemical compounds (same as Mixed dataset in Table 4) and Cora respectively. In both datasets, each bucket contains exactly graphs. The size of graphs in each bucket is however different, as shown in the -axis of Figs.7(f) and 7(h). Note that Cora contains larger graphs since the network itself is much larger than chemical compounds. The result for DeepGMG is missing on the larger graph-size buckets since it fails to model large graphs. GraphRNN runs out of GPU memory () in the largest bucket size in Cora.
As visible, there is a clear drop in quality (increase in NSPDK) as the graph sizes grow. This result is expected since larger graphs involve larger output spaces, and the larger the output space, the more complex is the modeling task. Furthermore, both GraphGen and GraphRNN convert graph modeling into sequence modeling. It is well known from the literature that auto-regressive neural models struggle to learn long-term dependencies in sequences. When the sequence lengths are large, this limitation impacts the quality. Nonetheless, GraphGen has a more gradual decrease in quality compared to GraphRNN and DeepGMG.
Next, we analyze the impact of graph size on the training time. Theoretically, we expect the training time to grow, and this pattern is reflected in Figs. 7(g) and 7(i). Note that the -axis is in log scale in these plots. Consistent with previous results, GraphGen is faster. In addition to the overall running time, we also show the average time taken by GraphGen in computing the minimum DFS code. Recall that the worst-case complexity of this task is factorial to the number of nodes in the graph. However, we earlier argued that in practice, the running times are much smaller. This experiment adds credence to this claim by showing that DFS code generation is a small portion of the total running time. For example, in the largest size bucket of chemical compounds, DFS code generation of all graphs take seconds out of the total training time of hours. In Cora, on the largest bucket size of edges, the DFS construction component consumes minutes out of the total training time of hours. GraphRNN has a similar pre-processing component as well (GraphRNN-M), where it generates a large number of BFS sequences to estimate a hyper-parameter. We find that this pre-processing time, although of polynomial-time complexity, is slower than minimum DFS code computation in chemical compounds. Overall, the results of this experiment support the proposed strategy of feeding canonical labels to the neural model instead of random sequence representations.
Minimum DFS code generation relies on the node and edge labels to prune the search space and identify the lexicographically smallest code quickly. When graphs are unlabeled or have very few labels, this process may become more expensive. In such situations, labels based on vertex invariants allow us to augment existing labels. To showcase this aspect, we use the Enzymes dataset, which has only node labels and no edge labels. We study the training time and the impact on quality based on the vertex invariants used to label nodes. We study four combinations: (1) existing node labels only, (2) node label + node degree, (3) node label + clustering coefficient (CC), and (4) label + degree + CC. Fig. 6 presents the results. As visible, there is a steady decrease in the DFS code generation time as more invariants are used. No such trend, however, is visible in the total training time as additional features may both increase or decrease the time taken by the optimizer to find the validation loss minima. From the quality perspective, both NSPDK and Orbit show improvement (reduction) through the use of invariants. Between degree and CC, degree provides slightly improved results. Overall, this experiment shows that invariants are useful in both improving the quality and reducing the cost of minimum DFS code generation.
For deep, auto-regressive modeling of sequences, we design an LSTM tailored for DFS codes. Can other neural architectures be used instead? Since a large volume of work has been done in the NLP community on sentence modeling, can those algorithms be used by treating each edge tuple as a word? To answer these questions, we replace LSTM with SentenceVAE(Bowman et al., 2016) and measure the performance. We present the results only in Lung and Cora due to space constraints; a similar trend is observed in other datasets as well. In Table 5, we present the two metrics that best summarizes the impact of this experiment. As clearly visible, SentenceVAE introduces a huge increase in NSPDK distance, which means inferior quality. An even more significant result is observed in Uniqueness, which is close to for SentenceVAE. This means the same structure is repeatedly generated by SentenceVAE. In DFS codes, every tuple is unique due to the timestamps, and hence sentence modeling techniques that rely on repetition of words and phrases, fail to perform well. While this result does not say that LSTMs are the only choice, it does indicate the need for architectures that are custom-made for DFS codes.
Technique | Lung | Cora | ||
NSPDK | Uniqueness | NSPDK | Uniqueness | |
LSTM | ||||
SentenceVAE |
In this paper, we studied the problem of learning generative models for labeled graphs in a domain-agnostic and scalable manner. There are two key takeaways from the conducted study. First, existing techniques model graphs either through their adjacency matrices or sequence representations. In both cases, the mapping from a graph to its representation is one-to-many. Existing models feed multiple representations of the same graph to the model and rely on the model overcoming this information redundancy. In contrast, we construct canonical labels of graphs in the form of minimum DFS codes and reduce the learning complexity. Canonical label construction has non-polynomial time complexity, which could be a factor behind this approach not being adequately explored. Our study shows that although the worst-case complexity is factorial to the number of nodes, by exploiting nodes/edge labels, and vertex invariants, DFS-code construction is a small component of the overall training time (Figs. 7(g) and 7(i)). Consequently, the time complexity of graph canonization is not a practical concern and, feeding more precise graph representations to the training model is a better approach in terms of quality and scalability.
The second takeaway is the importance of quality metrics. Evaluating graph quality is a complicated task due to the multi-faceted nature of labeled structures. It is, therefore, important to deploy enough metrics that cover all aspects of graphs such as local node-level properties, structural properties in the form of motifs and global similarity, graph sizes, node and edge labels, and redundancy in the generated graphs. A learned model is effective only if it performs well across all of the metrics. As shown in this study, GraphGen satisfies this criteria.
The conducted study does not solve all problems related to graph modeling. Several graphs today are annotated with feature vectors. Examples include property graphs and knowledge graphs. Can we extend label modeling to feature vectors? All of the existing learning-based techniques, including GraphGen, cannot scale to graphs containing millions of nodes. Can we overcome this limitation? We plan to explore these questions further by extending the platform provided by GraphGen.
Junction Tree Variational Autoencoder for Molecular Graph Generation. In
ICML(Proceedings of Machine Learning Research)
, Jennifer G. Dy and Andreas Krause (Eds.), Vol. 80. 2328–2337.
Comments
There are no comments yet.