Graphs are ubiquitous in the real world, representing objects and their relationships such as social networks, citation networks, biology networks, traffic networks, etc. Graphs are also known to have complicated structures that contain rich underlying values (Barabási et al., 2016). Tremendous effort has been made in this area, resulting in a rich literature of related papers and methods to deal with various kinds of graph problems, which can be categorized into two types: 1) predicting and analyzing patterns on given graphs. 2) learning the distributions of given graphs and generating more novel graphs. The first type covers many research areas including node classification, graph classification, link prediction, and community detection. Over the past few decades, a significant amount of work has been done in this domain. More recently, representation learning methods, such as deep neural networks for graphs, have also been applied to this aspect. In contrast to the first type, the second type is related to graph generation problem, which is the focus of this paper.
Graph generation entails modeling and generating real-world graphs, and it has applications in several domains, such as understanding interaction dynamics in social networks (Grover et al., 2019; Wang et al., 2018b; Tran et al., 2019), link prediction (Kipf and Welling, 2017; Salha et al., 2019)
, and anomaly detection(Ranu and Singh, 2009)
. Owing to its many applications, the development of generative models for graphs has a rich history, resulting in famous models such as random graphs, small-world models, stochastic block models, and Bayesian network models, which generate graphs based on apriori structural assumptions(Newman, 2018). These graph generation models (Albert and Barabási, 2002; Leskovec et al., 2010; Robins et al., 2007) are engineered towards modeling a pre-selected family of graphs, such as random graphs (Erdös et al., 1959), small-world networks (Watts and Strogatz, 1998), and scale-free graphs (Albert and Barabási, 2002). However, they have limitations. First, due to their simplicity and hand-crafted nature, these random graph models generally have limited capacity to model complex dependencies and are only capable of modeling a few statistical properties of graphs. For example, Erdos–Rényi graphs do not have the heavy-tailed degree distribution that is typical of many real-world networks. Second, the utilization of the apriori assumption limits these traditional techniques from exploring more applications in larger scale of domains, where the apriori knowledge of graphs are always not available.
Considering the limitations of the traditional graph generation techniques, a key open challenge is developing methods that can directly learn generative models from an observed set of graphs. Developing generative models that can learn directly from data is an important step towards improving the fidelity of generated graphs, and it paves the way for new kinds of applications, such as novel drug discovery (Popova et al., 2019; You et al., 2018b), and protein structure modeling (Bacciu et al., 2019; Anand and Huang, 2018; Fan and Huang, 2019)
. Recent advances in deep generative models, such as variational autoencoders (VAE)(Kingma and Welling, 2014)
and generative adversarial networks (GAN)(Goodfellow et al., 2014)
, indicate important progress in generative modeling for complex domains, such as image and text data. Building on these approaches, a number of deep learning models for generating graphs have been proposed, which formalized the promising area ofDeep Generative Models for Graph Generation, which is the focus of this survey.
1.1. Formal Problem Definition
A graph is defined as , where is the set of nodes, and is the set of edges. is an edge connecting nodes
. The graph can be conveniently described in matrix or tensor form using its (weighed) adjacency matrix. If the graph is node-attributed or edge-attributed, there are node attribute matrix assigning attributes to each node or edge attribute tensor assigning attributes to each edge . is the dimension of the edge attributes, and is the dimension of the node attributes.
Given a set of observed graphs sampled from data distribution , where each graph may have different numbers of nodes and edges, the goal of learning generative models for graphs is to learn the distribution of the observed set of graphs. By sampling a graph , new graphs can hence be achieved, which is known as deep graph generation, the short form of deep generative models for graph generation. Sometimes, the generation process can be conditioned on additional information , such that , in order to provide extra control over the graph generation results. The generation process with such conditions is called conditional deep graph generation.
The development of deep generative models for graphs poses unique challenges. In order to address these challenges, in recent years, numerous research works have been carried out to develop the domain of deep graph generation. These challenges are mainly listed below.
Non-unique Representations. In the general deep graph generation, the aim is to learn the distributions of possible graph structures without assuming a fixed set of nodes (e.g., to generate candidate molecules of varying sizes). In this general setting, a graph with nodes can be represented by up to equivalent adjacency matrices, each corresponding to a different, arbitrary node ordering. Such high representation complexity is challenging to model, which makes it expensive to compute and, thereafter, optimize objective functions, like reconstruction error, during training.
Complex Dependency. The nodes and edges of a graph have complex dependency and relationships. For example, in many real-world graphs two nodes are more likely to be connected if they share common neighbors. Therefore, the generation of each node or edge cannot be modeled as an independent event, but need to be generated jointly. One way to formalize the graph generation is to make auto-regressive decisions, which naturally accommodate complex dependencies inside the graphs through sequential formalization of graphs.
Large and Various Output Spaces. To generate a graph with nodes the generative model may have to output values to specify its structure, which makes it expensive, especially for large-scale graph. However, it is common to find graphs containing millions of graphs in real-world, such as citation and social networks. Also, the numbers of nodes and edges vary between different graphs. Consequently, it is important for generative models to scale to large-scale graphs for realistic graph generation and to accommodate such complexity and variability in the output space.
Discrete Objects by Nature
. The standard machine learning techniques, which were developed primarily for continuous data, do not work off-the-shelf, but usually need adjustments. A prominent example is the back-propagation algorithm, which is not directly applicable to graphs, since it works only for continuously differentiable objective functions. To this end, it is usual to project graphs (or their constituents) into a continuous space and represent them as vectors/matrix. However, reconstructing the generated graphs from the continuous representations remains a challenge.
. Sometimes, it is crucial to guide the graph generation process by conditioning it on extra contextual information. For example, in Natural Language Processing (NLP) domain, Abstract Meaning Representation (AMR) structures and dependency graphs(Lyu and Titov, 2018; Zhang et al., 2019b) are generated conditioning on an input sequence. The other example is about molecular optimization (Jin et al., 2018b), which generate the target graph conditioning on an input graph. Thus, the deep graph generation problems can face a more challenging problem setting, which requires learning the conditional distribution of the observed graphs given the condition.
Evaluation for Implicit Properties Evaluating the generated graphs is a very critical but challenging issue, due to the unique properties of graphs which with complex and high-dimensional structure and implicit features. Existing methods use different evaluation metrics. For example, some works (You et al., 2018b; Sun and Li, 2019; Guo et al., 2018) compute the distance of the graph statistic distribution of the graphs in the test set and graphs that are generated, while other works (Liu et al., 2019a; Fan and Huang, 2019)
indirectly use some classifier-based metrics to judge whether the generated graphs are of the same distribution as the training graphs. It is important to systematically review all the existing metrics and choose the approximate ones based on their strengths and limitations according to the application requirements.
Various Validity Requirements. Modeling and understanding graph generation via deep learning involve a wide variety of important applications, including molecule designing (Popova et al., 2019; Jin et al., 2018a), protein structure modeling (Anand and Huang, 2018), AMR parsing in NLP (Lyu and Titov, 2018; Zhang et al., 2019b), et al. These inter-discipline problems have their unique requirements for the validity of the generated graphs. For example, the generated molecule graphs need to have valency validity, while the semantic parsing in NLP requires Part-of-Speech (POS)-related constraint. Thus, addressing the validity requirements for different applications is crucial in enabling wider applications of deep graph generation.
Black-box with Low Reliability. Compared with the traditional graph generation area, deep learning based graph modeling methods are like black-boxes which bear the weaknesses of low interpretability and reliability. Improving the interpretability of the deep graph generative models could be a vital issue in unpacking the black-box of the generation process and paving the way for wider application domains, which are of high sensitivity and require strong reliability, such as smart health and automatic driving. In addition, semantic explanation of the latent representations can further enhance the scientific exploration of the associated application domains.
1.3. Our Contributions
Though recently emerged, deep graph generation has attracted great attentions. Various advanced works on deep graph generation have been conducted, ranging from the one-shot graph generation to sequential graph generation process, accommodating various deep generative learning strategies. These methods aim to solve one or several of the above challenges by works from different fields, including machine learning, bio-informatics, artificial intelligence, human health and social-network mining. However, the methods developed by different research fields tend to use different vocabularies and solve problems from different angles. Also, standard and comprehensive evaluation procedures to validate the developed deep generative models for graphs are lacking. A comprehensive and systematic survey covering the research on deep generative models for graph generation as well as its applications, evaluations, and open problems is imperative yet missing.
To this end, this paper provides a systematic review of deep generative models for graph generation. We categorize methods and problems based on the challenges they address, discuss their underlying assumptions, and compare their advantages and disadvantages. The goal is to help interdisciplinary researchers choose appropriate techniques to solve problems in their applications domains, and more importantly, to help graph generation researchers understand the basic principles as well as identify open research opportunities in deep graph generation domain. As far as we know, this is the first comprehensive survey on deep generative models for graph generation. Below, we summarize the major contributions of this survey:
We propose a taxonomy of deep generative models for graph generation categorized by problem settings and methodologies. The drawbacks, advantages, relations, and difference among different subcategories have been introduced.
We provide a detailed description, analysis, and comparison of deep generative models for graph generation as well as the deep generative models on which they are based.
We summarize and categorize the existing evaluation procedures and metrics of deep generative models for graph generation.
We introduce existing application domains of deep generative models for graph generation as well as the potential benefits and opportunities they bring into the application domains.
We suggest several open problems and promising future research directions in the field of deep generative models for graph generation.
1.4. Relationship with Related Surveys
There are three types of related surveys. The first type mainly centers around the traditional graph generation by classic graph theory and network science (Bonifati et al., 2020), which does not focus on the most recent advancement in deep generative neural networks in artificial intelligence. The second type is about representation learning on graphs (Goyal and Ferrara, 2018; Wu et al., 2020; Zhang et al., 2020). This is a very hot domain in machine learning, especially deep learning. It can benefit a number of downstream tasks including node and graph classification, link prediction, and graph generation. This domain focuses on learning graph embedding given existing graphs. Few works include a handful of deep generative models that could be used for representation learning tasks. The last type is specific to particular applications such as molecule design by deep learning, instead of for this generic technical domain. To the best of our knowledge, there is no systematic survey on deep generative models for graph generation.
1.5. Outline of the Survey
The rest of this survey is organized as follows. In Section 2, we first introduce the preliminary of the existing deep generative models that are used as the base model for learning graph distributions. Then we introduce the definitions of the basic concepts required to understand the deep graph generation problem as well as its extensive problem, conditional deep graph generation. In the next two sections, we provide the taxonomy of deep graph generation, and the taxonomy structure is illustrated in Fig.1. Section 3 compares related works of unconditional deep graph generation problem and summarizes the challenges faced in each. In Section 4, we categorize the conditional deep graph generation in terms of three sub-problem settings. The challenges behind each problem are summarized, and a detailed analysis of different techniques is provided. Lastly, we summarize and categorize the evaluation metrics in Section 5. Then we present the applications that deep graph generation enables in Section 6. At last, we discuss five potential future research directions and conclude this survey in Section 7.
2. Preliminaries Knowledge
In recent years, there has been a resurgence of interest in deep generative models, which have been at the forefront of deep unsupervised learning for the last decade. The reason for that is because they offer a very efficient way to analyze and understand unlabeled data. The idea behind generative models is to capture the inner probabilistic distribution that generates a class of data to generate similar data(Oussidi and Elhassouny, 2018). Emerging approaches such as generative adversarial networks (GANs) (Goodfellow et al., 2014), variational auto-encoders (VAEs) (Kingma and Welling, 2014), generative recursive neural network (generative RNN) (Sutskever et al., 2011) (e.g., pixelRNNs, RNN language models), flow-based learning (Papamakarios et al., 2017)
, and many of their variants and extensions have led to impressive results in myriads of applications. In this section, we provide a review of five popular and classic deep generative models for learning the distributions by observing large amounts of data in any format. They include VAE, GANs, generative RNN, flow-based learning, and Reinforcement Learning, which also form the backbone of the base learning methods of all the existing deep generative models for graph generation.
2.1. Variational Auto-encoders
VAE (Kingma and Welling, 2014)
is a latent variable-based model that pairs a top-down generator with a bottom-up inference network. Instead of directly performing maximum likelihood estimation on the intractable marginal log-likelihood, training is done by optimizing the tractable evidence lower bound (ELBO). Suppose we have a dataset of samplesfrom a distribution parameterized by ground truth generative latent codes (
refers to the length of the latent codes). VAE aims to learn a joint distribution between the latent spaceand the input space .
Specifically, in the probabilistic setting of a VAE, the encoder is defined by a variational posterior , while the decoder is defined by a generative distribution , as represented by the two orange trapezoids in Fig. 2(a). are trainable parameters of the encoder and decoder. The VAE aims to learn a marginal likelihood of the data in a generative process as: . Then the marginal likelihoods of individual data points can be rewritten as follows:
where the first term stands for the non-negative Kullback–Leibler divergence between the true and the approximate posterior; the second term is called the (variational) lower bound on the marginal likelihood. Thus, maximizingis to maximize the lower bound of the true objective:
In order to make the optimization of the above objective tractable in practice, we assume a simple prior distribution as a standard Gaussian
with a diagonal co-variance matrix. Parameterizing the distributions in this way allows for the use of the “reparameterization trick” to estimate gradients of the lower bound with respect to the parameter
, where each random variableis parameterized as Gaussian with a differentiable transformation of a noise variable , that is, is computed as , where and are outputs from the encoder.
2.2. Generative Adversarial Nets
GANs were introduced as an alternative way to train a generative model (Goodfellow et al., 2014)
. GANs are based on a game theory scenario called the min-max game, where a discriminator and a generator compete against each other. The generator generates data from stochastic noise, and the discriminator tries to tell whether it is real (coming from a training set) or fabricated (from the generator). The absolute difference between carefully calculated rewards from both networks is minimized so that both networks learn simultaneously as they try to outperform each other.
Specifically, the architecture of GANs consists of two ‘adversarial’ models: a generative model which captures the data distribution , and a discriminative model
which estimates the probability that a sample comes from the training set rather than, as shown in Fig.2(c). Both and
could be a non-linear mapping function, such as a multi-layer perceptron(SUTER, 1990) parameterized by parameters and . To learn a generator distribution of observed data , the generator builds a mapping function from a prior noise distribution to data space as . And the discriminator, , outputs a single scalar representing the probability that the input data came form the training data rather than sampled from .
The generator and discriminator are both trained simultaneously by adjusting the parameters of to minimize and adjusting the parameters of to minimize , as if they are following the two-player min-max game with value function :
The training of the generator and discriminator is kept alternating until the generator can hopefully generate real-like data that is difficult to discriminate from real samples by a strong discriminator.
. In contrast to VAE, GANs learn to generate samples without assuming an approximate distribution. By utilizing the discriminator, GANs avoid optimizing the explicit likelihood loss function, which may explain their ability to produce high-quality objects as demonstrated byDenton et al. (2015). However, GANs still have drawbacks. One is that they can sometimes be extremely hard to train in adversarial style. They may fall into the divergence trap very easily by getting stuck in a poor local minimum. Mode collapse is also an issue, where the generator produces samples that belong to a limited set of modes, which results in low diversity. Moreover, alternatively training and large computation workloads for two networks can result in long-term convergence process.
2.2.1. Generative Recursive Neural Network
RNN (Mikolov et al., 2010)
is a straightforward adaptation of the standard feed-forward neural network by using their internal state (memory) to process variable length sequential data. At each step, the RNN predicts the output depending on the previous computed hidden states and updates its current hidden state, that it, they have a “memory” that captures information about what has been calculated so far. The RNN’s high dimensional hidden state and nonlinear evolution endow it with great expressive power to integrate information over many iterative steps for accurate predictions. Even if the non-linearity used by each unit is quite simple, iterating it over time leads to very rich dynamics(Sutskever et al., 2011).
A standard RNN is formalized as follows: given a sequence of input vectors , the RNN computes a sequence of hidden states and a sequence of outputs by iterating the following equations from to :
where , , and are learning weight matrices; the vectors and are biases for calculating the hidden states and output at each step, respectively. The expression at step is initialized by a vector,
, and the tanh non-linearity activation function is applied coordinate-wise.
The RNN model can be modified to a generative model for generating the sequential data, as shown in Fig. 2(d). The goal of modeling a sequence is to predict the next element in the sequence given the previous generated elements. More formally, given a training sequence , RNN uses the sequence of its output vectors to parameterize a sequence of predictive distributions . The distribution type of need to be assumed in advance. For example, to determine the category of the discrete data , we can assume a softmax distribution as , where refers to one of the categories of the object, refers to the -th variable in the output vector and refers to the total number of categories of the objects. The objective of modeling sequential data is to maximize the total log likelihood of the training sequence
, which implies that the RNN learns a joint probability distribution of sequences. Then we can generate a sequence by sampling fromstochastically, which is parameterized by the output at each step.
2.3. Flow-based Learning
Normalizing flows (NFs) (Dinh et al., 2017) are a class of generative models that define a parameterized invertible deterministic transformation between two spaces and . is a latent space that follows distribution such as Gaussian, and is a real-world observational space of objects such as images, graphs, and texts. Let be an invertible transformation parameterized by . Then the relationship between the density function of real-world data and that of can be expressed via the change-of-variables formula:
There are two key processes of normalizing flows as a generative model: (1) Calculating data likelihood: given a datapoint , the exact density can be calculated by inverting the transformation ; (2) Sampling: can be sampled from the distribution by first sampling and then performing the transformation . To efficiently perform the above mentioned operations, is required to be invertible with an easily computable Jacobian determinant.
Autoregressive flow (AF), originally proposed in (Papamakarios et al., 2017), is a variant of normalizing flow by providing an easily computable triangular Jacobian determinant. It is specially designed for modeling the conditional distributions in the sequence. Formally, given (D is the dimension of observed sequential data), the autoregressive conditional probabilities for the -th element in the sequence can be parameterized as Gaussian:
where and are unconstrained and positive scalar functions of respectively for computing the mean and deviation. In practice, these functions can be implemented as neural networks. The affine transformation of AF can be written as follows:
where is the randomly sampled value from standard Gaussian. The Jacobian matrix here is triangular, since is non-zero only for . Therefore, the determinant can be efficiently computed through . Specifically, to perform density estimation, we can apply all individual scalar affine transformations in parallel to compute the base density, each of which depends on previous variables ; to sample , we can first sample and compute through the affine transformation, and then each subsequent can be computed sequentially based on .
2.4. Reinforcement Learning and Deep Q-Network
Reinforcement learning (RL) is a commonly used framework for learning controlling policies by a computer algorithm, the so-called agent, through interacting with its environment (Sutton et al., 1998; Silver et al., 2007). Here, we give a brief introduction of this learning strategy as well as its typical form deep Q-learning networks (DQNs) (Mnih et al., 2015) for data generation.
In RL process, an agent is faced with a sequential decision making problem, where interaction with the environment takes place at discrete time steps. The agent takes action at state at time , by following certain policies or rules, which will result in a new state as well as a reward . If we consider infinite horizon problems with a discounted cumulative reward objective ( is the discount factor), the aim of the agent is to find an optimal policy by maximizing its expected discounted cumulative rewards. Q-Learning (Watkins and Dayan, 1992) is a value-based method for solving RL problems by encoding policies through the use of action-value functions:
The optimal value function is denoted as , and an optimal policy can be easily derived by . Typically, Q-value function relies on all possible state-action pairs, which are often impractical to obtain. One solution for addressing this challenge is to approximate using a parameterized function (Sutton et al., 1998).
Based on recent advances in deep learning techniques, Mnih et al. (2015) introduced the DQN. The DQN approximates the Q-value function with a non-linear deep convolutional network, which also automatically creates useful features to represent the internal states of the RL, as shown in Fig. 2(b). In DQN, the agent interacts with the environment in discrete iterations, aiming to maximize its long term reward. DQN has shown great power in generating sequential objects by taking a series of actions (Li et al., 2016). A sequential object is generated based on a sequence of actions that are taken.
During the generation, DQN selects the action at each step using an -greedy implementation. With probability , a random action is selected from the range of possible actions, otherwise the action which results in high Q-value score is selected. To perform experience replay, the agent’s experiences at each time-step are stored in a data set . At each iteration in the learning process, the updates of the learning weights are applied on samples of experience , drawn randomly from the pool of stored samples, with the following loss function:
where refers to the parameters of the Q-network at iteration and refers to the network parameters used to compute the target at iteration . The target network parameters are only updated with the Q-network parameters every several steps and are held fixed between individual updates. The process of generating the data after training is similar to that of the training process.
3. Unconditional Deep Generative Models for Graph Generation
The goal of unconditional deep graph generation is to learn the distribution based on a set of observed realistic graphs being sampled from the real distribution by deep generative models. Based on the style of the generation process, we can categorize the methods into two main branches: (1) Sequential generating: this generates the nodes and edges in a sequential way, one after another, (2) One-shot generating: this refers to building a probabilistic graph model based on the matrix representation that can generate all nodes and edges in one shot. These two ways of generating graphs have their limitations and merits. Sequential generating performs the local decisions made in the preceding one in an efficient way with time complexity of only , but it has difficulty in preserving the long-term dependency. Thus, some global properties (e.g., scale-free property) of the graph are hard to include. Moreover, existing works on sequential generating are limited to a predefined ordering of the sequence, leaving open the role of permutation. One-shot generating methods have the capacity of modeling the global property of a graph by generating and refining the whole graph (i.e. nodes and edges) synchronously through several iterations, but most of them are hard to scale to large graphs since the time complexity is not less than .
|Sequential Generating||Node-sequence-based||Traversal-based||(Khodayar et al., 2019; D’Arcy et al., 2019; Zhang et al., 2019a; You et al., 2018b; Su et al., 2019; Popova et al., 2019; Assouel et al., 2018)|
|Selection-based||(Lim et al., 2019; Li et al., 2018a; Liu et al., 2018; Kearnes et al., 2019)|
|Edge-sequence-based||(Goyal et al., 2020; Bacciu et al., 2019, 2020; Bojchevski et al., 2018)|
|Graph-Motif-sequence-based||(Liao et al., 2019; Jin et al., 2018a; Podda et al., 2020; Gu, 2019)|
|Rule-sequence-based||(Dai et al., 2018; Kusner et al., 2017)|
|One-shot Generating||Adjacency-based||MLP-based||(Simonovsky and Komodakis, 2018; Ma et al., 2018; Anand and Huang, 2018; Fan and Huang, 2019; Pölsterl and Wachinger, 2019; De Cao and Kipf, 2018)|
|Message-Passing-based||(Bresson and Laurent, 2019; Guarino et al., 2017; Flam-Shepherd et al., 2020; Niu et al., 2020)|
|Invertible-transform-based||(Honda et al., 2019; Madhawa et al., 2019)|
|Edge-list-based||Random-walk-based||(Bojchevski et al., 2018; Gamage et al., 2020; Zhang, 2019; Caridá et al., 2019)|
|Node-similarity-based||(Kipf and Welling, 2016; Grover et al., 2019; Shi et al., 2020; Zou and Lerman, 2019; Liu et al., 2019b; Salha et al., 2019)|
3.1. Generating a Graph Sequentially
This type of methods treats the graph generation as a sequential decision making process, wherein nodes and edges are generated one by one (or group by group), conditioned on the sub-graph already generated. By modeling graph generation as a sequential process, these approaches naturally accommodate complex local dependencies between generated edges and nodes. A graph is represented into a sequence of components , where each can be regarded as a generation unit. The distribution of graphs can then be formalized as the joint (conditional) probability of all the components in general. While generating graphs, different components will be generated sequentially, by conditioning on the other parts already generated. One core issue is how to break down the graph generation into sequential generation of its components. Thus, regarding the formalization the unit for sequentialization, there are four common ways: node-sequence-based, edge- sequence-based, graph-motif-sequence-based and rule-sequence-based, as shown on Fig. 1 (left).
Node-sequence-based methods essentially generate the graph by generating one node and its associated edges per step, as shown in Fig. 3 (a). Specifically, the graph can be modeled by a sequence based on a predefined ordering on nodes. Each unit in the sequence of components is represented as a tuple , indicating that at each high-level step, the generator generates one node and all its associated edges set . Here we omit the node and edge attribute symbol for clarity, but we should bear in mind that the generated node and edges can all have attributes (i.e. type, label). Given a newly generated node , existing methods for the generation of its associated edges can be grouped into two: 1) traversal-based, where the edges are formed when traversing the newly generated node and all the existing nodes, and 2) selection-based, which entails determining whether there is an edge between the newly generated node and any of the existing nodes.
Traversal-based. When treating a graph as a sequence of node tuples each of which is denoted as , several approaches (Khodayar et al., 2019; D’Arcy et al., 2019; Zhang et al., 2019a; You et al., 2018b; Su et al., 2019; Shi et al., 2020; Popova et al., 2019) represent each node’s associated edges by the adjacent vector (we assume that the graph is undirected, without the loss of generality), which covers all the potential edges from the newly added node to the previously generated nodes. Thus, we can further represent each unit as . And the sequence can be represented as . The aim is to learn the distribution as:
where refers to the nodes generated before and refers to the adjacent vectors generated before . Such joint probability can be implemented by sequential-based architectures such as generative RNN models (You et al., 2018b; Liu et al., 2019a; Popova et al., 2019; Zhang et al., 2019a) and auto-regressive flow-based learning models (Shi et al., 2020), which are introduced subsequently.
In the generative RNN-based models, the node distributions
are typically assumed as a multivariate Bernoulli distribution that is parameterized by, where refers to the number of node categories. The edge existence distribution can be assumed as the joint of several dependent Bernoulli distributions as follows:
can be regarded as a hierarchical-RNN, where the outer RNN is used for generating the nodes and the inner RNN is used for generating each node’s associated edges. After either a node or edge is generated, a graph-level hidden representation of the already generated sub-graph is calculated and updated through a message passing neural network (MPNN)(Gilmer et al., 2017). Specifically, at each Step , a parameter
will be calculated through a multilayer perceptron (MLP)-based function based on the current graph-level hidden representation. The parameteris used to parameterize the Bernoulli distribution of node existence, from which node is sampled. After that, the adjacent vector is generated by sequentially generating each of its entry. Specifically, at each step in generating , the edge is generated by sampling based on the conditional parameter , which is also calculated through a MLP-based function based on the current graph-level hidden representation.
In addition to RNN-based methods, now we introduce some representative works based on auto-regressive flow-based learning models (Shi et al., 2020). Shi et al. (2020) achieved conditional generation via the flow-based learning as introduced in Section 2.3. Based on the idea to first transform discrete data into continuous data with real-valued noise and dequantization techniques (Dinh et al., 2017), specifically, the discrete unit is pre-processed into continuous data :
where refers to the category of node and
refers to a uniform distribution(Kuipers and Niederreiter, 2012). Then the conditional distributions for the continuous data and
are assumed as Gaussian distribution:
where the mean ,, of the Gaussian distribution for node and edge generation are calculated based on the MLP-based networks whose input is the hidden representations of the already generated graph. The hidden representations of the graph are typically calculated through MPNN.
Several additional works are based on VAE, yet their latent representations are generated sequentially. Su et al. (2019)
propose a graph recurrent neural network with variational Bayes to learn the conditional distributions. It uses the conditional VAE (CVAE)(Sohn et al., 2015) and utilizes three MLP-based networks for modeling three distributions of the generation process, namely prior distribution , node generation distribution , and edge generation distribution . Here refers to the latent representation at Step . During the generation process, at each step , the prior network is first used to draw samples from the learnt prior distribution , which is parameterized by the output of an MLP-based function with the input of the already generated graph. Then the node and its associated edges are generated by sampling from and , respectively, which are parameterized by the outputs of two MLP-based functions with the input of and the already generated graph.
Selection-based. The selection-based methods generate the nodes in the same way as the traversal-based method, but have a different way of generating the associated edge set. Traversing all the existing nodes to generate the associated edge set for each newly generated node is time-consuming and potentially low in efficiency, especially for sparse graphs. It is efficient to directly generate the edge set of by only selecting the neighboring nodes from the already generated nodes. Specifically, for each newly generated node , the selection-based methods generate its relying on two functions: an addEdge function to determine the size of the edge set of node and a selectNode function to select the neighboring nodes sequentially from the partially generated graph (Lim et al., 2019; Li et al., 2018a; Liu et al., 2018; Kearnes et al., 2019).
Specifically, at Step , after generating a node , an addEdge function is used to output a parameter as , following a Bernoulli distribution indicating whether we want to add an edge to the node . Here refers to the node-level hidden states of which is calculated through a node embedding function, e.g., MPNN based on the already-generated parts of the graph. If an edge is determined to be added, the next step is selecting the neighboring node from the existing nodes. To select this neighboring node, we can compute a score (as Eq.14) for each existing node based on selectNode function , which is then passed through a softmax function (Bishop, 2006) to be properly normalized into a distribution of nodes:
The MLP-based function maps pairs of node-level hidden states and to a score for connecting node to the new node . This can be extended to handle discrete edge attributes by making a vector of scores with the same size as the number of the edge attribute’s categories, and taking the softmax over all categories of the edge attribute. Based on the aforementioned procedure, the two functions and are iteratively executed to generate the edges within the edge set of node until the terminal signal from function indicates that no more edges for node are yet to be added.
Edge-sequence-based methods represent the graph as a sequence of edges and generate an edge as well as its two related nodes per step, as shown in Fig. 3 (b). It defines an ordering of the edges in the graph and also an ordering function for indexing the nodes. Then the graph can be modeled by a sequence of edges (Goyal et al., 2020; Bacciu et al., 2019, 2020) and each unit in the sequence is a tuple represented as , where each element of the sequence consists of a pair of nodes’ indexes and for node and , node attribute , and the edge attribute for the edge at Step . The edge-sequence-based methods usually employ two parallel networks for generating two related nodes of the edge respectively. The key problem in generating graphs by a sequence of edges is to pre-define the ordering index function for nodes; thus, based on the index of the generated nodes, the graph can be constructed from the generated sequence of edges.
Goyal et al. (2020) used depth first search (DFS) algorithm (Yan and Han, 2002) as the ordering index function to construct graph canonical index of nodes by performing a DFS. The conditional distribution for generating each edge in graph can be formalized as follows:
refers to the already generated edges and nodes. A customized long short-term memory (LSTM) is designed which consists of a transition state functionfor transferring the hidden state of the last step into that of the current step (in Eq.17), an embedding function for embedding the already generated graph into latent representations (in Eq. 17), and five separate output functions for the above five distribution components (in Eq 17 to Eq. 20). It is assumed that the five elements in one tuple are independent of each others, and thus the inference is operated as:
where refers to the generated tuple at Step and is represented as the concatenation of all the component representations in the tuple. is a graph-level LSTM hidden state vector that encodes the state of the graph generated so far at Step . Given the graph state , the output of five functions , , , , model the categorical distribution of the five components of the newly formed edge tuple, which are paramerized by five vectors , , , , respectively. Finally, the components of the newly formed edge tuple are sampled from the five learnt categorical distributions.
In contrast to the methods mentioned above, which assume that the elements in each tuple are independent of each other, Bacciu et al. (2019) assume the existence of node dependence in a tuple. This method deals with homogeneous graphs without considering the node/edge categories, by representing each tuple in the sequence as and formalizing the distribution as . Then, the first node is sampled in the same way as in Eq. 18, while the second node in the tuple is sampled as follows:
where the function is used for embedding the index of the first generated node in the pair.
Several methods (Liao et al., 2019; Jin et al., 2018a; Podda et al., 2020; Gu, 2019) are proposed to represent a graph as a sequence of graph motifs as , where a block of nodes and edges that constitute each graph motif are generated at each step, as shown in Fig. 3 (c). The varying size of graph motif(i.e. the number of nodes in a graph motif) along with the sampling overlap size (i.e. the overlap between two graph motifs) can allow for the exploration of the efficiency-quality trade-off of the generation model. A key problem in graph motif-based methods is how to connect the newly generated graph motif to the graph portion that has already been generated, considering that there are many potential ways in linking two sub-graphs. This is mainly depends on the definition of the graph motifs. Currently, there are two ways to solve this problem.
One of the ways is designed for generating general graphs; it is similar to the traversal-based node-sequence generation by generating the adjacent vectors for each edge, such as the GraphRNN (You et al., 2018b), except for the generation of several nodes instead of one per step. As described in Section 3.1.1, a graph is represented as a sequence of node-based tuples as , where is generated per step. Based on this node sequence, Liao et al. (2019) (GRANs) regard every recursive nodes tuples as a graph motif and generates each block per step. In this way, the generated nodes in the new graph motif follow the ordering of the nodes in the whole graph and contain all the connection information of the existing and newly generated nodes. To formalize the dependency among the existing and newly generated nodes, GRANs proposes an MPNN-based model to generate the adjacent edge vectors. Specifically, for the -th generation step, a graph that contains the already-generated graph with nodes and the edges among these nodes, as well as the nodes in the newly generated graph motif is constructed. For these new nodes, edges are initially fully added to connect them with each other and the previous nodes. The node-level hidden states of the newly added nodes are all initialized with 0. Then an MPNN-based graph neural network (GNN) (Scarselli et al., 2008) on this augmented graph is used to update the nodes’ hidden states by encoding the graph structure. After several rounds of message passing implementation based on a GNN, the node-level hidden states of both the existing and newly added nodes are used to infer the final distribution of the newly added edges as follows:
where parameterizes the Bernoulli distribution for the edge existence through the MLP-based function , which takes the node-level hidden states as input.
The definition of graph motifs can also involve domain knowledge, such as in the situation of molecules (i.e., graph of atoms) (Jin et al., 2018a; Podda et al., 2020), where the sequence of the graph motifs is generated based on an RNN model. Jin et al. (2018a) propose the Junction-Tree-VAE by first generating a tree-structured scaffold over chemical substructures, and then combining them into a molecule with an MPNN. Specifically, a Tree Decomposition of Molecules algorithm (Rarey and Dixon, 1998) tailored for molecules to decompose the graph into several graph motifs is followed, and each is regarded as a node in the tree structure . To generate a graph , a is first generated and then converted into the final graph. The decoder for generating a consists of both topology prediction function and label prediction function. The topology prediction function models the probability of the current node to have a child, and the label prediction function models a distribution of the labels of all types of . When reproducing a molecular graph that underlies the predicted junction tree , since each motif contains several atoms, the neighboring motifs and can be attached to each other as sub-graphs in many potential ways. To solve this, a scoring function over all the candidates graphs is proposed, and the optimal one that maximizes the scoring function is the final generated graph.
Podda et al. (2020) also deal with the molecule generation problem but with a different way of defining the graph motifs. To break a molecule into a sequence of fragments , they leverage the breaking of retrosynthetically interesting chemical substructures (BRICS) algorithm (Degen et al., 2008), which breaks strategic bonds in a molecule that matches a set of chemical reactions. Specifically, their fragmentation algorithm works by scanning atoms in a sequence from left to right in the order imposed by the simplified molecular input line entry system (SMILES) encoding (Weininger, 1988). As soon as a breakable bond (according to the BRICS rules) is encountered during the scanning process, the molecule is broken into two at that bond. After that, the leftmost fragment is collected, and the process repeats on the rightmost fragment in a recursive fashion. Since the fragment extraction is ordered from left to right according to the SMILES representation, it is possible to reconstruct the original molecule from a sequence of fragments. In this way, they successfully represent a molecule as a sequence, and view a sequence of fragments as a “sentence”; in addition they learn to generate the sentence similar to the work proposed by Bowman et al. (2016) based on skip-gram embedding methods (Le and Mikolov, 2014)
and gated recurrent units (GRUs)(Cho et al., 2014).
Several methods (Dai et al., 2018; Kusner et al., 2017) chose to generate a sequence of production rules or commands, guided by which graph can be constructed sequentially. There are some structured data that often come with formal grammars (e.g. molecule), which results in strict semantic constrain. Thus, to enforce the semantic validity of the generated graphs, graph generation is transformed into generating their parse trees that are derived from context free grammar (CFG), while the parse tree can be further expressed as a sequence of rules based on a pre-defined order.
Kusner et al. (2017)
propose generating a parse tree that describes a discrete object (e.g. arithmetic expressions and molecule) by a grammar; they also proposed a graph generation method named GrammerVAE. An example of using the parse tree for molecule generation: to encode the parse tree, they decompose it into a sequence of production rules by performing a pre-ordered traversal on its branches from left-to-right, and then convert these rules into one-hot indicator vectors, where each dimension corresponds to a rule in the SMILES grammar. The deep convolutional neural network is then mapped into a continuous latent vector. While decoding, the continuous vector
is passed through an RNN which produces a set of unnormalized log probability vectors (or ‘logits’). Each dimension of the logit vectors corresponds to a production rule in the grammar. The model generates the parse trees directly in a top-down direction, by repeatedly expanding the tree with its production rules. The molecules are also generated by following the rules generated sequentially, as shown in Fig.3 (d). Although the CFG provides a mechanism for generating syntactic valid objects, it is still incapable of guaranteeing the model for generating semantic valid objects (Kusner et al., 2017). To deal with this limitation, Dai et al. (2018) propose the syntax-directed variational autoencoder (SD-VAE), in which a semantic restriction component is advanced to the stage of syntax tree generator. This allows for a the generator with both syntactic and semantic validity.
3.2. Generating a Graph in One Shot
These methods learn to map each whole graph into a single latent representation which follows some probabilistic distribution in latent space. Each whole graph can then be generated by directly sampling from this probabilistic distribution in one step. The core issue of these methods is usually how to jointly generate graph topology together with node/edge attributes (if at all). Considering that the graph topology can usually be represented in terms of adjacency matrix and edge list, the existing methods can be categorized as adjacency-based and edge-list based. The former one focuses on directly generating the whole adjacency matrix, while the latter generates the graph topology by examining the existence of edges corresponding to different pairs of nodes.
Adjacency one-shot method assumes complex dependence among the graphs and generates the whole graph in one step but considering the interactions among nodes and edges. Adjacency one-shot method varies based on the decoding techniques where the adjacent matrix or edge attributes tensor and node attribute matrix are jointly generated from a graph-level latent representation . The main challenge is how to ensure correlation among elements of a graph in order to pursue of global properties. In terms of the techniques to tackle this challenge, there are three categories of adjacency one-shot methods elaborated as follows.
MLP-based methods. Most of the one-shot graph generation techniques involves simply constructing the graph decoder using MLP (Simonovsky and Komodakis, 2018; Ma et al., 2018; Anand and Huang, 2018; Fan and Huang, 2019; Pölsterl and Wachinger, 2019; De Cao and Kipf, 2018), where the models’ parameters can be optimized under common frameworks such as VAE and GAN. The MLP-based models ingest a latent graph representation and simultaneously output adjacent matrix and node attribute , as shown in Fig. 4 (a). Specifically, the generator takes D-dimensional vectors
sampled from a statistical distribution such as standard normal distribution and outputs graphs. For each, outputs two continuous and dense objects: , which defines edge attributes and , which denotes node attributes through two simple MLPs. Both and have a probabilistic interpretation since each node and edge attribute is represented with probabilities of categorical distributions of types. To generate the final graph, it is required to obtain the discrete-valued objects and from and , respectively. The existing works have two ways to realize this step detailed as follows.
In the first way, the existing works (Simonovsky and Komodakis, 2018; Anand and Huang, 2018; Ma et al., 2018) use sigmoid activation function to compute and during the training time. At test time, the discrete-valued estimate and can be obtained by taking edge- and node-wise argmax in and . In the other way, existing works (Pölsterl and Wachinger, 2019; De Cao and Kipf, 2018; Fan and Huang, 2019) leverage categorical reparameterization with the Gumbel-Softmax (Jang et al., 2017; Maddison et al., 2017), which is to sample from a categorical distribution during the forward pass (i.e., and and the original continuous-valued and in the backward pass. In this way, these methods can perform continuous-valued operations during the training procedure and do the categorical sampling procedure to finally generate the and .
Message-passing-based methods. Message-passing-based methods generate graphs by iteratively refining the graph topology and node representations of the initialized graph through the MPNN. Specifically, based on the latent representation sampled from a simple distribution (e.g., Gaussian), we usually first generate an initialized adjacent matrix and the initialized node latent representations , where refers to the length of each node representation (here we omit the node ordering symbol for clarity). Then and are updated though MPNN with multiple layers for generating the final graph, as shown in Fig. 4 (c). Existing methods leverage common generative frameworks such as VAE and GANs (Bresson and Laurent, 2019; Guarino et al., 2017; Flam-Shepherd et al., 2020), or have a plain framework based on the score-based generative process (Niu et al., 2020).
For works utilizing common generative frameworks such as VAE and GAN, the decoder is implemented as follows (Bresson and Laurent, 2019; Guarino et al., 2017; Flam-Shepherd et al., 2020). Normally, the first step is about projecting the initial latent representation from the fixed dimensional latent space to an initial state for each node through MLP-based or RNN-based networks. A fully-connected graph is also initialized with the same latent values of each entry in . Next, using the initialized graph we can perform message passing on both the node and edge representations for updating and at layer as:
where , , , and are trainable parameters. Finally, after layers’ updating, the outputs and are used to parameterize the categorical distributions of each edge and node, based on which each edge and node are generated through categorical sampling introduced above.
For the score-based generative modeling process, the core is to design a plain graph generation framework based on score function (Niu et al., 2020). Specifically, existing methods usually first sample , which is the number of nodes to be generated, from the empirical distribution of the number of nodes in the training dataset. Then they sample the adjacent matrix with annealed Langevin dynamics (Song and Ermon, 2019). Specifically, they first initialize the adjacent matrix as with each element following a Normal distribution. Then, they update the adjacent matrix by iteratively sampling from a series of trained conditional score models (i.e. a function parameterized by ) using Langevin dynamics. Here is a sequence of noise levels and refers to the number of noise levels. To implement the score function , MPNN-based score networks, as described in Eq. 24 are introduced. Formally, the output of the score function is given:
where is the number of MPNN layers of the score function network, and and are learnable parameters of MLP-based output layer for each noise level as . and are shared weights and bias respectively of the output layers of all score function models. in the above equation refers to the operation that concatenates all into a vector.
Invertible-transform-based methods Flow-based generative methods can also do one-shot generation, by a unique invertible function between graph and the latent prior sampling from a simple distribution (e.g., Gaussian), as shown in Fig. 4 (b). Based on vanilla flow-based learning techniques introduced in Section 2.3, special forward transformation and backward transformation needs to be designed.
Madhawa et al. (2019) propose the first flow-based one-shot graph generation model called GraphNVP. To get from in the forward transformation, they first convert the discrete variable and into continuous variable and by adding real-valued noise (same as that in Eq. 12), which is known as dequantization. Then two types of reversible affine coupling layers: adjacency coupling layers and node attribute coupling layers are utilized to transform the adjacency matrix and the node attribute matrix into latent representations and , respectively. The th reversible coupling layers are designed as follows:
where and . refers to the th entry of ; denotes element-wise multiplication. Functions and stand for scale and translation operations which can be implemented based on MPNN, and , can be implemented based on MLP networks. To get from in the backward transformation, the reversed operation is conducted based on the above forward transformation operation in Eq. 26 and 27. Specifically, after drawing random samples and , a sequence of inverted adjacency coupling layers is applied on for a probabilistic adjacency matrix , from which a discrete adjacency matrix is constructed by taking node-wise and edge-wise argmax operation. Next a probabilistic feature matrix is generated given the sampled and the generated adjacency matrix through a sequence of inverted node attribute coupling layers. Likewise, the node-wise argmax of is used to get discrete feature matrix .
Honda et al. (2019) propose a graph residual flow (GRF) with more flexible and complex non-linear mappings than the above mentioned coupling flows in GraphNVP. The forward transformation is designed as follows:
where and are the residual blocks for node attribute matrix and adjacency matrix at th layer. The residual block is implemented based on GCNs (Kipf and Welling, 2017) and is proved to be invertible. The backward transformation process is similar to the GraphNVP except for the computation of the inverse of the and by the fixed-point iteration (Behrmann et al., 2019) based on the invertible residual.
This category typically requires a generative model that learns edge probabilities, based on which all the edges are generated independently. These methods are usually used in learning from one large-scale graph and learning to generate the synthetic one given the known nodes. In terms of how the edge probability are generated, existing works are further categorized into two, namely random-walk-based (Bojchevski et al., 2018; Zhang, 2019; Gamage et al., 2020; Caridá et al., 2019) and node-similarity-based (Kipf and Welling, 2016; Grover et al., 2019; Zou and Lerman, 2019; Liu et al., 2019b; Salha et al., 2019).
Random-walk-based. This type of methods generate the edge probability based on a score matrix, which is calculated by the frequency of each edge that appears in a set of generated random walks. Bojchevski et al. (2018) propose NetGAN to mimic the large-scale real-world networks. Specifically, at the first step, a GAN-based generative model is used to learn the distribution of random walks over the observed graph, and then it generates a set of random walks. At the second step, a score matrix is constructed, where each entry denotes the counts of an edge that appears in the set of generated random walks. At last, based on the score matrix, the edge probability matrix is calculated as , which will be used to generate individual edge , based on efficient sampling processes.
Following this, some works propose improving the NetGAN, by changing the way to choose the first node in starting a random walk (Caridá et al., 2019) or learning spatial-temporal random walks for spatial-temporal graph generation (Zhang, 2019). Gamage et al. (2020) generalize the NetGAN by adding two motif-biased random-walk GANs. The edge probability is thus calculated based on the score matrices from three sets of random walks (i.e. , , and ) that are generated from the three GANs. To sample each edge, one view is randomly selected from the three scores matrices. Based on , edge probability is calculated as .
Node-similarity-based. These methods generate the edge probability based on pairwise relationships between the given or sampled nodes’ embedding (as in (Kipf and Welling, 2016)). Specifically, the probability adjacent matrix is generated given the node representations , where refers to the latent representation for node . will be used to generate individual edge , based on efficient sampling processes. Existing methods differ on how to calculate .
Several works (Kipf and Welling, 2016; Grover et al., 2019; Zou and Lerman, 2019) compute based on the inner-product operations of two node embedding and . This reflects the idea that nodes that are close in the embedding space should have a high probability of being connected. These works require a setting where node set is pre-defined and the node attribute is known in advance. Specifically, by first sampling node latent representation from the standard normal distribution, Kipf and Welling (2016) calculate the probability adjacent matrix as . The adjacent matrix is then sampled from which parameterizes the Bernoulli distribution of the edge existence, as similar to work by Zou and Lerman (2019). To further consider the complex dependence among generated edges, Grover et al. (2019) propose an iterative two-step approach that alternates between defining an intermediate graph and then gradually refining the graph through message passing. Formally, given a latent matrix and an input feature matrix , they are iterated over the following sequence of operations: