1 Introduction
Labeled graphs are powerful complex data structures that can describe collections of related objects. Such collections could be atoms forming molecular graphs, users connecting on online social networks, and papers connected by citations. The connected objects, or nodes, may be of different types or classes, and the graphs themselves may belong to particular categories. Methods that reason about this flexible and rich representation can empower analyses of important, complex realworld phenomena. One key approach for reasoning about such graphs is to learn the probability distributions over graphs. In this paper, we introduce a method that learns generative models for labeled graphs in which the nodes and the graphs may have categorical labels.
A highquality generative model should be able to synthesize labeled graphs that preserve global structural properties of realistic graphs. Such a tool could be valuable in various settings. One motivating example application is in situations where data owners wish to share graph data but must protect sensitive information. For example, online social network providers may want to enable the scientific community to study the structural aspects of their user networks, but revealing structure could allow reidentification or other privacyinvading inferences (Zheleva et al., 2012). A generative model that can create realistic graphs that do not represent realworld users could allow for this kind of study.
Recently Goodfellow et al. (2014)
described generative adversarial networks (GANs), which have been widely explored in computer vision and natural language processing
(Zhang et al., 2017; Yu et al., 2017)for generating realistic images and text, as well as performing tasks such as style transfer. GANs are composed of two neural networks. The first is a generator network that learns to map from a latent space to the distribution of the target data, and the second is a discriminator network that tries to distinguish real data from candidates synthesized by the generator. Those two networks compete with each other during training and each improves based on feedback from the other. The success of this general GAN framework has proven it to be a powerful tool for learning the distributions of complex data.
Motivated by the power of GANs, researchers have used them for generating graphs too. Bojchevski et al. (2018) proposed NetGAN, which uses the GAN framework to generate random walks on graphs. De Cao and Kipf (2018)
proposed MolGAN, which generates molecular graphs using the combination of a GAN framework and a reinforcement learning objective. However, there are many limitations of existing methods, such as the generality to graphs with different structures and scalability to different sized graphs. Furthermore, they are unable to generate graphs with node labels, a critical feature of some graphstructured data.
The rapid development of deep learning techniques has also led to advances in representation learning in graphs. Many works have been proposed to use deep learning structures to extract highlevel features from nodes and their neighborhoods to include both node and structure information
(Kipf and Welling, 2017; Hamilton et al., 2017; Fan and Huang, 2017). These methods have been shown to be useful for many applications, such as link prediction and collective classification.Building on these advances, we propose labeled graph generative adversarial network
(LGGAN), a deep generative model trained using a GAN framework to generate graphstructured data with node labels. LGGAN can be used to generate various kinds of graphstructured data, such as citation graphs, knowledge graphs, and protein graphs. Specifically, the generator in an LGGAN generates an adjacency matrix as well as labels for the nodes, and its discriminator uses a graph convolution network
(Kipf and Welling, 2017)with residual connections to identify real graphs using adaptive, structureaware higherlevel graph features. Our approach is the first deep generative method that addresses the generation of labeled graphstructured data. In experiments, we demonstrate that our model can generate realistic graphs that preserve important properties from the training graphs. We evaluate our model on various datasets with different graph types—such as ego networks and proteins—and with different sizes. Our experiments demonstrate that LGGAN effectively learns distributions of different graph structures and that it can scale up to generate large graphs without losing much quality.
2 Related Work
Generative graph models were pioneered by Erdös and Rényi (1959), who introduced random graphs where each possible edge appears with a fixed independent probability. More realistic models followed, such as the preferential attachment model of (Albert and Barabási, 2002), which grows graphs by adding nodes and connecting them to existing nodes with probability proportional to their current degrees. Goldenberg et al. (2010) proposed the stochastic block model (SBM), and Airoldi et al. (2008) proposed the mixedmembership stochastic block model (MMSB). The SBM is a more complex version of the ErdösRényi (ER) model that can generate graphs with multiple communities. In SBMs, instead of assuming that each pair of nodes has identical probability to connect, they predefine the number of communities in the generated graph and have a probability matrix of connections among different types of nodes. Compared to the ER model, SBMs are more useful since they can learn more nuanced distributions of graphs from data. However, SBMs are still limited in that they can only generate graphs with this kind of community structure.
With the recent development of deep learning, some works have proposed deep models to represent the distribution of graphs. Li et al. (2018) proposed DeepGMG, which introduced a framework based on graph neural networks. They generate graphs by expansion, adding new structures at each step. Li et al. (2018)
proposed GraphRNN, which decomposes the graph generation into generating node and edge sequences from a hierarchical recurrent neural network. Simultaneously, researchers have also been developing other implicit yet powerful methods for generating graphs, especially based on the success of generative adversarial networks
(Goodfellow et al., 2014). For example, Bojchevski et al. (2018) proposed NetGAN, which uses the GAN framework to generate random walks on graphs from which structure can be inferred, and De Cao and Kipf (2018) proposed MolGAN to generate molecular graphs using the combination of the GAN framework and reinforcement learning.However, these recently proposed deep models are either limited to generating small graphs with 40 or fewer nodes (Li et al., 2018), or to generating specific types of graphs such as molecular graphs (You et al., 2018a; De Cao and Kipf, 2018) (with no straightforward generalization to other domains due to specialized tools to calculate moleculespecific loss). Most broadly, most of these recently proposed methods can not generate labeled graphs.
3 Model
In this section, we introduce LGGAN and how it trains deep generative models for graphstructured data with node labels.
3.1 Alternate GAN Frameworks
Since the GAN framework was introduced by Goodfellow et al. (2014), many variations have been proposed that proved to be powerful for generation tasks. Therefore, we adopt three popular variations and compare how well they perform for our task of labeled graph generation. We use the traditional, original GAN approach as the first approach. Beyond the traditional GAN framework, we use two other methods that include extra information of classification labels for the graphs themselves. The first follows the conditional GAN (Mirza and Osindero, 2014) framework, which feeds the graph label as an extra input to the generator in addition to the noise . We can use this label to generate graphs of different types. To improve on this, our last variation uses the auxiliary conditional GAN (ACGAN) (Odena et al., 2017)
framework, in which the discriminator not only distinguishes whether the graph is real or fake, but it also incorporates a classifier of the graph labels. Due to space, details on these tests are in the appendix. Based on the results, in the following experiments, we choose to use the framework of ACGAN in our model.
3.2 Architecture
In this section, we provide details on the LGGAN architecture. As in the standard GAN framework, LGGAN consists of two main components: a generator and a discriminator . The generator takes a sample from a prior distribution and generates a labeled graph represented by an adjacency matrix and a node label matrix . The discriminator then trains to distinguish samples from the dataset and samples from the generator. In LGGAN, both the generator and the discriminator are trained using CTGAN (Wei et al., 2018), an improved version based on the improved Wasserstein GAN approach (Gulrajani et al., 2017).
Generator
LGGAN’s generative model uses a multilayer perceptron (MLP) to produce the graph. The generator
takes a random vector
sampled from a standard normal distribution and outputs two matrices: (1)
, which is a onehot vector that defines the node labels; and (2) the adjacency matrix , which defines the connections among nodes in graphs. The architecture uses a fixed maximum number of nodes , but it is capable of generating structures of fewer nodes by dropping the nodes that are not connected to any of the other node in the generated graph. Since both the adjacency matrix and the label matrix are discrete and the categorical sampling procedure is nondifferentiable, we directly use the continuous objects and during the training procedure. The original GAN structure uses the JensenShannon (JS) divergence to measure the distance, which then cannot be used to generate discrete data. Therefore, we use variants of GANs that are based on Wasserstein distance, such as CTGAN (Wei et al., 2018) and WGANGP (Gulrajani et al., 2017), which has been shown to be applicable to discrete data such as text.Discriminator The discriminator takes a graph sample as input, represented by an adjacency matrix and a node label matrix , and it outputs a scalar value and a onehot vector for the class label. For the discriminator, we use a graph convolutional network (GCN) (Kipf and Welling, 2017), which is a powerful neural network architecture for representation learning with complex graph structures. GCNs propagate information along graph edges with graph convolution layers. GCNs are also permutationinvariant, which is important when analyzing graphs because they are usually considered unordered objects.
We add residual connections between hidden layers of the GCN to allow the model to fuse information from previous layers. We find that these residual connections circumvent the issues reported by Kipf and Welling (2017) that limit their GCNs to only two or three layers. Allowing more depth is important because some graph types, such as proteins and molecules, have complex structure that require incorporation of information from nodes further in graph distance than are reachable with only three graph convolutions.
With residual connections, each layer of our GCN discriminator propagates with the following rule:
(1) 
where is the output matrix at the th layer,
is the identity matrix,
is the adjacency matrix of the graph with selfconnections added, is the diagonal degree matrix of graph , is the trainable weight matrix at the th layer, anddenotes an activation function (such as the sigmoid or ReLU
(Nair and Hinton, 2010)). Since we do not include node attributes, we set , where is the identity matrix.After layers of propagation via graph convolutions, we aggregate the outputs from each layer with an aggregation function agg
, such as concatenation and maxpooling. We then concatenate the aggregated matrix with the node label matrix
and output as the final representation we learned for graph :(2) 
The representation of the graph will further be processed by a linear layer to produce the outputs of the discriminator: the graphlevel scalar probability of the input being real data and a classifier to predict the category that the graph belongs to with a onehot vector . We illustrate the structure of this discriminative model in Figure 1.
In Section 4.4, we evaluate the influence of the depth of GCN with and without residual connections and also with different aggregate functions to guide how to choose from different settings of LGGAN for the experiments.
3.3 Training
GANs (Goodfellow et al., 2014) train via a minmax game with two players competing to improve themselves. In theory, the method converges when it reaches a Nash equilibrium, where the samples produced by the generator match the data distribution. However, this process is highly unstable and often results in problems such as mode collapse (Goodfellow, 2016). To deal with the most common problems in training GAN, such as mode collapse and unstable training, we use the CTGAN (Wei et al., 2018) framework, which is one of the stateoftheart approaches. CTGAN adds a consistency term to the Wasserstein GANs (WGANGP) (Salimans et al., 2016) that preserves Lipschitz continuity in the training procedure of WGANGPs. We also adopt several techniques such as feature matching and minibatch discrimination that were shown to encourage convergence and help avoid mode collapse (Salimans et al., 2016). Details are shown in the appendix.
3.4 Node Ordering
A common representation for graph structure uses adjacency matrices. However, using matrices to train a generative model introduces the issue of how to define the node ordering in the adjacency matrix. There are permutations of nodes, and it is time consuming to train over all of them.
For LGGAN, we use the framework of GCN with residual connections and a node aggregation operator (De Cao and Kipf, 2018) as the discriminator. This discriminator is invariant to node ordering, avoiding the issue. However, for the generator, we use an MLP, which does depend on node ordering. Therefore, we adapt the approach by You et al. (2018b) where we arrange the nodes in a breadthfirstsearch (BFS) ordering for each training graph.
In particular, we preprocess the adjacency matrix and node label matrix by feeding them into a BFS function. This function takes a random permutation of the nodes in graph as input, picks a node as the starting node, and then outputs another permutation that is a BFS ordering of the node in graph starting from node . By specifying a structuredetermined node ordering for the graph, we only need to consider all possible BFS orderings, rather than all possible node permutations. This reduction makes a significant difference for computational complexity when graphs are large.
4 Experiments
In this section, we compare LGGAN with other graph generation methods to demonstrate its ability to generate highquality labeled graphs in diverse settings. What’s more, we further evaluate the quality of the generated graphs by applying it to a downstream task graph classification.
4.1 Baselines
We compare our model against various traditional generative models for graphs, as well as some recently proposed deep graph generative models. For traditional baselines, we compare against the ErdösRényi model (ER) (Erdös and Rényi, 1959), the BarabásiAlbert (BA) model (Albert and Barabási, 2002), and mixedmembership stochastic block models (MMSB) (Airoldi et al., 2008). Then we also compare with some recently proposed deep graph generative models such as the DeepGMG (Li et al., 2018) and GraphRNN (You et al., 2018b). Few current approaches are designed to generate labeled graphs. One exception is MolGAN (De Cao and Kipf, 2018), which is designed to generate only molecular graphs and needs specialized evaluation methods specific to that task, so we do not compare against it.
4.2 Datasets
We perform experiments on different kinds of datasets with varying sizes and characteristics. Details about the statistics of these datasets are given in the appendix.
Citation graphs We test on scientific citation networks. We used the Cora and Citeseer datasets (Sen et al., 2008)
. The Cora dataset is a collection of 2,708 machine learning publications categorized into seven classes, and the CiteSeer dataset is a collection of 3,312 research publications crawled from the CiteSeer repository. To test the scalability of LGGAN, we extracted different subsets with different graph sizes by constraining the number of nodes in graph
. For small datasets (denoted Cora_small and Citeseer_small), we extract twohop and threehop ego networks with . For the large datasets (denoted Cora and Citeseer), we extract threehop ego networks with . For the graph label, we set it to be the node label of the center node in the ego network.Protein graphs We also test on multiple collections of protein molecular graphs. The ENZYMES dataset consists of 600 enzymes (Schomburg et al., 2004). Each enzyme in the dataset is labeled with one of the six enzyme commission (EC) code toplevel classes. The PROTEINS dataset includes proteins from the dataset of enzymes and nonenzymes created by Dobson and Doig (2003). There are two graph labels: enzymes and nonenzymes.
4.3 Metrics for the Quality of Generated Labeled Graphs
To evaluate the quality of the generated graphs, we follow the approach used by You et al. (2018b): we compare a distribution of generated graphs with that of real ones by measuring the maximum mean discrepancy (MMD) (Gretton et al., 2012) of graph statistics, capturing how close their distributions are. We use four graph statistics to evaluate the generated graphs: degree distribution, clustering coefficient distribution, nodelabel distribution, and average orbit count statistics.
Since we are generating labeled graphs, we also want to evaluate the graph distribution in each class. To do this, we extract subgraphs for each class from both the training graphs and generated graphs and evaluate based on these three metrics (excluding the label distribution). These perlabel tests help test whether the model simply assigns the class based on the label distribution without considering the underlying graph structure. Due to space, the details are shown in the appendix.
4.4 Evaluating the Design of the Residual GCN Discriminator
To evaluate the influence of residual connections and the depth of the GCN discriminator, we report the results for graph generation on the ENZYMES dataset based on the four evaluation metrics mentioned in Section
4.3. Through these tests, we aim to investigate the following design aspects: (1) the GCN depth, (2) residual connections, and (3) different aggregate functions (i.e., maxpooling and concatenation). We plot the results in Figure 2.According to the plots, the performance does not have noticeable improvement with more than two or three GC layers unless we include residual connections. However, when adding the residual connections, using either aggregate function can train deeper GCNs with more than five or six layers, achieving high quality results that outperform other baseline models we compare to in Section 4.5. There is no notable difference between the aggregate functions. Since maxpooling does not introduce any additional parameters to learn, we use maxpooling as the aggregation function for residual connections in the remaining experiments.
To further evaluate the performance of GCN based discriminative model, we also compare LGGAN with a simple version where we use MLP as the discriminator. Due to space, please find the details in the appendix.
4.5 Comparing to Other Models
We compare LGGAN to other methods for generating graphs—both traditional generative models such as ER, BA, and MMSB as well as deep generative models that were proposed recently, such as GraphRNN and DeepGMG. DeepGMG cannot be used to generate large graphs due to its high computational complexity, so the results of DeepGMG on large graph datasets are not available. For each method, we measure three aspects. The first is the quality of the generated graphs, which should be able to mimic typical topology of the training graphs. The second is the generality, where a good generative model should be able to generalize to different and complex graphstructured data. Then the last aspect is the scalability, where we want the model to be able to scale up to generate large networks instead of being restricted to relatively small graphs.
Table 1 lists results from our comparison. LGGAN achieves the best performance on all datasets, with a decrease of MMD on average compared with traditional baselines, and a decrease of MMD compared with the stateoftheart deep learning baseline GraphRNN. Although GraphRNN performs well on the two smaller proteinrelated datasets, ENZYMES and PROTEINS, it does not maintain the same performance on large datasets, such as Cora and Citeseer.
Cora_small  Citeseer_small  Cora  

Degree  Clustering  Orbit  Label  Degree  Clustering  Orbit  Label  Degree  Clustering  Orbit  Label  
ER  0.68  0.94  0.48  N/A  0.63  0.86  0.12  N/A  0.88  1.45  0.27  N/A 
BA  0.31  0.53  0.11  N/A  0.37  0.18  0.11  N/A  0.54  1.06  0.16  N/A 
MMSB  0.21  0.68  0.07  0.48  0.17  0.50  0.11  0.32  0.12  0.68  0.09  0.49 
DeepGMG  0.34  0.44  0.27  N/A  0.27  0.36  0.20  N/A         
GraphRNN  0.26  0.38  0.39  N/A  0.19  0.20  0.39  N/A  0.20  0.46  0.11  N/A 
LGGAN  0.13  0.08  0.03  0.11  0.17  0.13  0.04  0.09  0.15  0.21  0.06  0.01 
PROTEINS  ENZYMES  Citeseer  

Degree  Clustering  Orbit  Label  Degree  Clustering  Orbit  Label  Degree  Clustering  Orbit  Label  
ER  0.31  1.06  0.28  N/A  0.38  1.26  0.08  N/A  0.82  1.57  0.06  N/A 
BA  0.93  0.88  0.05  N/A  1.17  1.08  0.51  N/A  0.32  1.04  0.08  N/A 
MMSB  0.46  1.05  0.21  0.01  0.55  1.08  0.05  0.92  0.08  0.50  0.11  0.32 
DeepGMG  0.96  0.63  0.16  N/A  0.43  0.38  0.08  N/A         
GraphRNN  0.04  0.18  0.06  N/A  0.06  0.20  0.07  N/A  0.20  1.15  0.14  N/A 
LGGAN  0.18  0.15  0.02  0.01  0.09  0.17  0.03  0.01  0.25  0.12  0.06  0.15 
4.6 Downstream Task: Graph Classification
To further evaluate the quality of LGGAN’s generated graphs, we extract the generated examples and apply it to a downstream task. We use the synthetic graphs to train a model for graph classification. We first compute a kernel matrix for each of a set of graph kernels, where represents an inner product between representations of and
. Then we can train kernel support vector machines (SVM) to classify the graphs. In our experiment, we choose three popular graph kernels: the graphlet kernel (GK) based on subgraph patterns, the shortestpath kernel (SP) based on random walks, and the WeisfeilerLehman subtree kernel (WL) based on subtrees.
We compare performance when training on synthetic graphs from LGGAN, MMSB (the only baseline model that can generate labeled graphs), and real graphs. We run this procedure with three datasets: Cora_small, ENZYMES and PROTEINS. For each dataset, we run ten trials and calculate the average value of the accuracy.
We list results in Table 2. The accuracy of models trained with graphs generated by LGGAN is close to those trained using the real graphs, especially compared to the models trained with graphs generated by MMSB. These results suggest that the graphs generated by LGGAN can better capture the important aspects of graph structure.
Cora_small  ENZYMES  PROTEINS  

GK  SP  WL  GK  SP  WL  GK  SP  WL  
gen_MMSB  22.62  21.34  23.15  15.38  14.29  17.86  64.32  65.61  64.29 
gen_LGGAN  23.44  26.56  26.92  23.08  26.92  30.19  65.61  66.54  69.23 
real_graphs  26.56  29.69  34.32  23.08  29.68  35.71  69.05  73.81  76.19 
4.7 Scalability
To evaluate the scalability of these methods, we perform experiments on two different subsets of the Cora dataset with different graph sizes: the Cora_small and Cora datasets. As listed in Table 1, the traditional models all create a large gap between these two datasets in terms of three evaluation metrics. For the deep generative models, DeepGMG cannot generate large graphs due to the computational complexity of its generation procedure which tries to add node one by one increasingly. And compared to GraphRNN, LGGAN MMD scores barely increase compare to smaller dataset, suggesting that our model is more reliable and has the best ability to scale up to large graphs.
4.8 Generality
To evaluate the ability of LGGAN to adapt to different graphstructured data, we evaluate the results of all methods on the different domains of citation egonetworks (Cora) and molecular protein graphs (ENZYMES). From Table 1, LGGAN achieves more consistent results on various datasets compared to other models, where some of them suffer from the issue of generalization such as MolGAN which can only be used to generate specific or limited types of graphstructured data.
Some examples are visualized in Figure 3, which contains graphs generated by our model and the baselines. Although it is not as intuitive for humans to assess as, e.g., natural images, one can still see that LGGAN appears to capture the typical structures of datasets better than other models.
4.9 Diversity
A good labeled graph generative model should generate diverse examples. Two types of diversity are important: (1) diversity among generated examples would capture the natural variations in real graphs; and (2) diversity compared to training examples ensures that the generative model is doing more than exactly memorizing some training examples and outputting copies of them. Generative models should balance the need for generated outputs to be new graphs unseen during training while retaining important properties of the real data.
Therefore, to investigate to what extent our model can maintain these types of diversity, we calculate the WeisfeilerLehman kernel value for both the training graphs and the generated graphs (by MMSB and LGGAN) and compute the kernel distance between any graphs and as
(3) 
We plot histograms of the minimum distances between each generated example and the training set in Figure 4 for four datasets, Cora_small, Citeseer_small, PROTEINS and ENZYMES. In each plot, the left column shows the minimum distance of each training graph to any other graph; the middle column shows the minimum distance for each graph generated by LGGAN to the training graphs; and the right column shows the same for MMSB. These plots suggest that the graphs generated by our model are more similar to training graphs than the examples generated by MMSB, yet they are not exact copies of training graphs and have a similar diversity of graph distances as the real data.
5 Conclusion
In this work, we proposed a deep generative model using a GAN framework that generates labeled graphs. These labeled graphs can mimic distributions of citation graphs, knowledge graphs, social networks, and more. We also introduced an evaluation method for labeled graphs to measure how well the model learns the substructure of the labeled graphs. Our model can be useful for simulation studies, especially when access to labeled graph data is limited by access or privacy concerns. We can use these models to generate synthetic datasets or augment existing datasets, to do graphbased analyses such as communication segmentation, node classification, anomaly detection, and link prediction. The experiments show that it outperforms other stateoftheart models for generating graphs while also being capable of the previously unaddressed task of generating labels for nodes.
6 Acknowledgements
We thank NVIDIA for donating hardware through their GPU Grant Program and Amazon for donating cloud computing resources through their AWS Cloud Credits for Research Program. We appreciate their support of our research.
References
 Airoldi et al. [2008] Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed Membership Stochastic Blockmodels. Journal of Machine Learning Research, 9(Sep):1981–2014, 2008.
 Albert and Barabási [2002] Réka Albert and AlbertLászló Barabási. Statistical Mechanics of Complex Networks. Reviews of Modern Physics, 74(1):47, 2002.
 Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
 Bojchevski et al. [2018] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. NetGAN: Generating graphs via random walks. In International Conference on Learning Representations (ICLR), 2018.
 De Cao and Kipf [2018] Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.
 Dobson and Doig [2003] Paul D Dobson and Andrew J Doig. Distinguishing Enzyme Structures from Nonenzymes Without Alignments. Journal of molecular biology, 330(4):771–783, 2003.
 Erdös and Rényi [1959] P. Erdös and A. Rényi. On Random Graphs I. Publicationes Mathematicae Debrecen, 6:290–297, 1959.
 Fan and Huang [2017] Shuangfei Fan and Bert Huang. Recurrent Collective Classification. Knowledge and Information Systems, pages 1–15, 2017.
 Goldenberg et al. [2010] Anna Goldenberg, Alice X. Zheng, Stephen E. Fienberg, and Edoardo M. Airoldi. A survey of statistical network models. Foundations and Trends® in Machine Learning, 2(2):129–233, 2010.
 Goodfellow [2016] Ian Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
 Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 Gretton et al. [2012] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
 Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
 Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
 Kersting et al. [2016] Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark data sets for graph kernels, 2016. URL http://graphkernels.cs.tudortmund.de.
 Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
 Li et al. [2018] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324, 2018.
 Mirza and Osindero [2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 Nair and Hinton [2010] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, pages 807–814, 2010.
 Odena et al. [2017] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2642–2651. JMLR. org, 2017.
 Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
 Schomburg et al. [2004] Ida Schomburg, Antje Chang, Christian Ebeling, Marion Gremse, Christian Heldt, Gregor Huhn, and Dietmar Schomburg. Brenda, the enzyme database: updates and major new developments. Nucleic Acids Research, 32(suppl_1):D431–D433, 2004.
 Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina EliassiRad. Collective classification in network data. AI Magazine, 29(3):93, 2008.
 Wei et al. [2018] Xiang Wei, Boqing Gong, Zixia Liu, Wei Lu, and Liqiang Wang. Improving the improved training of Wasserstein GANs: A consistency term and its dual effect. In International Conference on Learning Representations (ICLR). OpenReview.net, 2018.
 You et al. [2018a] Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goaldirected molecular graph generation. In Advances in Neural Information Processing Systems, 2018a.
 You et al. [2018b] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. GraphRNN: Generating realistic graphs with deep autoregressive models. In International Conference on Machine Learning, pages 5694–5703, 2018b.

Yu et al. [2017]
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.
SeqGAN: Sequence generative adversarial nets with policy gradient.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.  Zhang et al. [2017] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN: Text to photorealistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5907–5915, 2017.
 Zheleva et al. [2012] Elena Zheleva, Evimaria Terzi, and Lise Getoor. Privacy in social networks. Synthesis Lectures on Data Mining and Knowledge Discovery, 3(1):1–85, 2012.
Appendix A Different variants of LGGAN
a.1 Alternate GAN frameworks
Since the GAN framework was introduced by Goodfellow et al. [2014], many variations have been proposed that proved to be powerful for generation tasks. Therefore, we adopt three popular variations and compare how well they perform for our task of labeled graph generation. We use the traditional, original GAN approach as the first approach. Beyond the traditional GAN framework, we use two other methods that include extra information of classification labels for the graphs themselves. The first follows the conditional GAN [Mirza and Osindero, 2014] framework, which feeds the graph label as an extra input to the generator in addition to the noise . We can use this label to generate the graphs of different types. To improve on this, our last variation uses the auxiliary conditional GAN [Odena et al., 2017] framework, in which the discriminator not only distinguishes whether the graph is real or fake, but it also incorporates a classifier of the graph labels. The structure of all those three variations is illustrated in Figure 5.
Generative adversarial networks
Generative adversarial networks (GANs) [Goodfellow et al., 2014] train implicit generative models by competitively training two neural networks. The first is the generative model , which learns to map from a latent space to the distribution of the target data. The second network is the discriminative model , which tries to separate the real data and the candidates predicted by the generator. These two networks compete with each other during training, via different objective functions. They adapt to improve itself based feedback from each other. The generator and the discriminator can be seen as two players in a minimax game where where the generator tries to produce samples realistic enough to fool the discriminator, and the discriminator tries to differentiate the samples from the real data and the generator correctly. The famous objective for GAN training is
(4)  
Conditional GAN
After the original GAN was proposed, Mirza and Osindero [2014] introduced a conditional variant, which can be constructed by simply modifying the GAN framework to feed the class label to both the generator and discriminator. By doing this, the model is then able to generate fake data that are conditioned on class labels. This ability means that this model gives more control over the generated data. The objective function of this conditional GAN is similar to the original version, only adding the condition of to both generator and discriminator:
(5)  
AcGan
Beyond the conditional GAN, Odena et al. [2017] proposed another variant of the GAN called ACGAN that extends the idea of conditional GAN. In the ACGAN framework, each generated sample has a predefined (usually randomly chosen) class label, in addition to the noise as the input to the generator . The generator then uses both to generate samples . The discriminator will output probability distributions over both data and the class labels, . The objective function of ACGAN has two parts: the loglikelihood of the correct data, , and the loglikelihood of the correct class, .
(6)  
(7)  
The objective function of the discriminator is , while the the objective function for the generator is . From the structure, we can see that the procedure of learning a mapping from to the representation of ACGANs is independent of class label. There are mainly two differences between ACGAN and conditional GAN: First, conditional GAN conditioned labels to both generator and discriminator. However, in the structure of ACGAN, it is only fed to the generator. Second, in ACGAN, the discriminator not only tries to classify whether the sample comes from the real data or not, but it also has a classifier that outputs the probability distribution over the class labels. These modifications to the standard GAN formulation can produce excellent results and also help to stabilize the training procedure.
a.2 Different models for the discriminator
For the discriminator, We also compare two different discriminative models, a simple multilayered perceptron (MLP) and GCN with residual connections and the advanced model is what we proposed in our paper which is a GCN with residual connections. The advanced model is comprised by a series of graph convolution layers and a layer aggregation operator to integrate useful information from each layer for learning more powerful graph representations. We refer to the whole framework using the simple model as LGGAN_s and using what we proposed as simply LGGAN.
We evaluate these six different variants for LGGAN (three different GAN frameworks for either LGGAN or LGGAN_s) on different graphstructured data to measure the quality of the learned generative models. We run experiments on two datasets: Cora_small and ENZYMES. The results are listed in Table 3.
Among all the three GAN frameworks, LGGAN_ACGAN achieves the best results on both datasets regardless of which discriminative model is used. This result matches with our expectations, since the ACGAN framework incorporates the class information allowing it to learn a better embedding and to propagate that information to the generator.
Also, we noticed that using GCN with residual connections added can improve the quality of generated graphs regardless of which GAN framework is used. An interesting point is that there is a gap between LGGAN_s and LGGAN on the ENZYMES dataset that is much larger than on the citation networks. This difference may be because the Cora_small dataset is composed of many small twohop and threehop ego networks where the structure is quite simple and uniform—so it could be easier to learn. However, with the ENZYMES dataset, the structure is more complicated and diverse. Therefore, this result reveals that the LGGAN_s is unable to generalize to complex data. In contrast, the quality of generated graphs with LGGAN using GCN with residual connection is more consistent among different datasets, which suggests that LGGAN can adaptively adjust to different graphstructured data. Based on these results, we use LGGAN_ACGAN in the remaining experiments and refer to it as LGGAN when comparing with other baselines.
GAN frameworks  Cora_small  ENZYMES  

Degree  Clustering  Orbit  Label  Degree  Clustering  Orbit  Label  
LGGAN_GAN_s  0.27  0.18  0.03  0.37  0.67  0.88  0.004  0.01 
LGGAN_GAN  0.21  0.14  0.01  0.15  0.31  0.20  0.01  0.01 
LGGAN_CGAN_s  0.18  0.18  0.006  0.35  0.53  0.69  0.04  0.004 
LGGAN_CGAN  0.10  0.24  0.01  0.19  0.23  0.13  0.02  0.01 
LGGAN_ACGAN_s  0.14  0.009  0.06  0.13  0.51  0.29  0.03  0.01 
LGGAN_ACGAN  0.13  0.08  0.03  0.11  0.09  0.17  0.01  0.01 
Appendix B Datasets
We perform experiments on different kinds of datasets, with varying sizes and characteristics, such as the Enzymes, Protein, D&D dataset [Dobson and Doig, 2003, Kersting et al., 2016], and also datasets of citation graphs such as Cora and Citeseer [Sen et al., 2008]. The summary of the statistics for these datasets is shown in Table 4
Graph Types  Datasets  # Graphs  # Graph classes  Avg.  Avg.  # Node labels 
Citation graphs  Cora_small  256  7  38.7  61.6  7 
Citeseer_small  256  6  44.2  82.7  6  
Cora_large  128  7  175.3  326.3  7  
Citeseer_large  128  6  172.5  414.7  6  
Protein graphs  PROTEINS  384  2  28.1  53.4  3 
ENZYMES  256  6  39.4  77.7  3 
Appendix C Training
For training, we adopt several techniques such as feature matching and minibatch discrimination that were shown to encourage convergence and help avoid mode collapse.
Wasserstein GAN
Wasserstein GAN (WGAN) framework [Arjovsky et al., 2017]
, as it prevents mode collapse and leads to more stable training. In this work, they introduced WGANs which minimize an approximation of the Wasserstein distance between the real distributions and the distribution of the generated samples. They proposed to use gradient clipping as an constraint on the 1Lipschitz continuity to help WGAN to converge. In a later followup work,
Gulrajani et al. [2017]proposed a better method that uses a gradient penalty as an alternative soft constraint compared to the gradient clipping. Therefore, the loss function for the discriminator is modified to
(8)  
CtGan
CTGAN [Wei et al., 2018] is a recently proposed model based on WGANGP. It improves the WGANGP approach by adding a soft consistency term to enforce the Lipschitz constraint. We train our model based on it since CTGAN has been shown to further stabilize the training procedure. The objective function for CTGAN is
(9)  
Feature matching
To stabilize the training procedure of GAN, Salimans et al. [2016] proposed another technique: feature matching to prevent the generator from overtraining on the current discriminator. It specifies a new objective function for the generator:
(10) 
where denote the activation of an intermediate layer of the discriminator. Instead of directly maximizing the output of the discriminator, the generator is trained to match the expected value of the features on an intermediate layer of the discriminator. The discriminator is thus trained to find the features that are most discriminative between the real samples from samples produced by the generative model.
Appendix D Evaluating Labeled Graphs
To better evaluate the structure of the labeled graphs being generated, we also calculate MMD of the three graph statistics for the subgraphs centered around each node class, taking the average MMD value across all classes. Since among existing methods, only MMSB can be used to directly generate labels, we compare LGGAN to it using the ENZYMES datasets. The results are listed in Table 5. LGGAN can not only learn a good distribution of the labels, but it is also able to learn the structure within each class much more reliably than the MMSB model.
Graph statistics  Subgraph statistics  

Degree  Clustering  Orbit  Label  Avg. D  Avg. C  Avg. O  
MMSB  0.55  1.08  0.05  0.92  0.14  0.20  0.03 
LGGAN  0.09  0.17  0.03  0.01  0.13  0.15  0.01 