GraphTune: A Learning-based Graph Generative Model with Tunable Structural Features

Generative models for graphs have been actively studied for decades, and they have a wide range of applications. Recently, learning-based graph generation that reproduces real-world graphs has gradually attracted the attention of many researchers. Several generative models that utilize modern machine learning technologies have been proposed, though a conditional generation of general graphs is less explored in the field. In this paper, we propose a generative model that allows us to tune a value of a global-level structural feature as a condition. Our model called GraphTune enables to tune a value of any structural feature of generated graphs using Long Short Term Memory (LSTM) and Conditional Variational AutoEncoder (CVAE). We performed comparative evaluations of GraphTune and conventional models with a real graph dataset. The evaluations show that GraphTune enables to clearly tune a value of a global-level structural feature compared to the conventional models.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

04/15/2021

A Tunable Model for Graph Generation Using LSTM and Conditional VAE

With the development of graph applications, generative models for graphs...
02/08/2022

GraphDCA – a Framework for Node Distribution Comparison in Real and Synthetic Graphs

We argue that when comparing two graphs, the distribution of node struct...
12/08/2020

Graph-Based Generative Representation Learning of Semantically and Behaviorally Augmented Floorplans

Floorplans are commonly used to represent the layout of buildings. In co...
06/05/2019

GRAM: Scalable Generative Models for Graphs with Graph Attention Mechanism

Graphs are ubiquitous real-world data structures, and generative models ...
02/09/2018

GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders

Deep learning on graphs has become a popular research topic with many ap...
04/12/2021

Boltzmann Tuning of Generative Models

The paper focuses on the a posteriori tuning of a generative model in or...
11/14/2019

Long-range Prediction of Vital Signs Using Generative Boosting via LSTM Networks

Vital signs including heart rate, respiratory rate, body temperature and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Generative models for graphs have a wide range of applications, including communication networks, social networks, transportation systems, databases, cheminformatics, and epidemics. Repeated simulation on graphs is a basic approach to discovering information in the above fields of study. However, researchers and practitioners do not always have access to enough real graph data. Generative models of graphs can supplement a graph dataset that does not include a sufficient number of real graphs. Moreover, generating graphs that are not included in a dataset or future graphs can be used to discover novel synthesizable molecules [2, 3, 4] or to predict the growth of a network [5].

Classically, stochastic models that generate graphs with a pre-defined probability of edges and nodes have been studied, and they focus on only a single-aspect feature of graphs. Various models 

[6] have been proposed in the literature, including the Erdős-Rényi model [7], Watts-Strogatz model [8], and Barabási-Albert (BA) model [9]. These stochastic models accurately reproduce a specific target structural feature (e.g., randomness [7], small worldness [8], scale-free features [9], and clustered nodes [10]). In other words, they cannot be adapted to the real data of graphs with numerous features, and cannot guarantee that generated graphs completely reproduce all features of the graphs excluding the target feature.

Generative models for graphs using machine learning technology learn features from graph data and try to reproduce those features according to the data in every single aspect [11, 12, 5, 13, 14, 15, 16, 17, 18, 19, 20, 6]. For the last several years, learning-based graph generation has been attracting the attention of many researchers, and several approaches have been tried in recent studies. Although a lot of models have been proposed to generate small graphs with the aim of designing molecules [3, 14, 4, 16, 18], some recent studies [5, 17]

enable the generation of relatively large graphs that also include citation graphs and social networks. In particular, the sequence data-based approach, which converts a graph into sequential data and learns the sequential data by recurrent neural networks, has been successful in the field 

[11, 12, 5, 17]. These studies reproduce various features that reflect the global structures of graphs, including average shortest path length, clustering coefficient, and the power-law exponent of the degree distribution.

Although the existing generative models for graphs using machine learning technology can generate similar graphs to real-world graphs, most of them cannot generate graphs that have user-specified structural features. Although demand for conditional generation of graphs with specific features is common, it has been less explored [19, 20].

Despite the fact that some works aim for conditional generation of graphs with a specific feature, their applicability and performance are not sufficient to generate general graphs with a specific value of a structural feature. Several models [14, 16, 18] that enable conditional generation utilize domain-specific knowledge of molecule chemistry and are not suitable for graphs of other domains. DeepGMG [12], one of the pioneering studies of conditional graph generation, does not assume domain-specific knowledge explicitly, and can generate graphs according to a specific condition. However, the work just evaluates generation conditioned by the number of atoms (nodes), bonds (edges), or aromatic rings (hexagons) in a molecule. DeepGMG does not have an ability to tune global-level structural features (e.g., average shortest path length, clustering coefficient, and the power-law exponent of the degree distribution) which are difficult to tune by adding or removing a local structure of a graph such as a node, edge, or hexagon. Although attempts have been made to train DeepGMG by graphs generated by the BA model, they have succeeded in unconditional generation of only very small graphs with 15 nodes [12]. CondGen [15], whose applicable domain is not limited to molecule chemistry, has achieved conditional generation of general graphs including citation graphs. CondGen can reproduce global-level structural features (average shortest path length and Gini index are evaluated in the paper [15]). CondGen succeeds in improving generation by using datasets grouped by label by inputting labels as a condition. Unfortunately, it does not provide a model to continuously tune a feature since it requires training datasets grouped by labels. It does not have sufficient performance to tune the features flexibly based on conditions given as a continuous value of a global-level feature (we discuss the performance of CondGen in Section VI).

In this paper, we propose GraphTune, a graph generative model that makes it possible to tune the value of any structural feature of a generated graph using Long Short Term Memory (LSTM) [21] and a Conditional AutoEncoder (CVAE) [22]. GraphTune adopts a sequence data-based approach for learning graph structures, and graphs are converted to sequence data by using Depth-First Search (DFS) code that achieves success in GraphGen [17]. Unlike GraphGen, GraphTune is a CVAE-based model and the CVAE in the model is composed of an LSTM-based encoder and decoder. GraphTune uses CVAE to generate a graph with some specific feature, including global-level structural features, and the feature can be continuously tuned. Meanwhile, features other than the specified feature are accurately reproduced in every single aspect according to the dataset that is learned.

In summary, the main contributions of this paper are as follows:

  • We propose a novel learning-based graph generative model called GraphTune with tunable structural features. In GraphTune, flexible generation of graphs with any feature including global-level structural features (e.g., average shortest path and clustering coefficient) can be achieved by giving the value of a feature as a condition vector to the CVAE-based architecture.

  • We achieve elaborate reproduction of a graph dataset in every single aspect by adopting a sequence data-based approach for learning. GraphTune tunes the value of a specific feature to a specified value while keeping values of other features within the range of values that exist in the dataset.

  • We perform empirical evaluations of GraphTune on a real graph dataset. The evaluation results establish that the tunability and reproducibility of graphs in GraphTune outperforms those in conventional conditional and unconditional graph generative models.

The rest of this paper is structured as follows. In Section II, we summarize related works in the field of graph generative models. Section III formulates the generation problem of a graph with a specified feature. In Section IV, we introduce the DFS code that we adopt as a method for converting graphs into sequence data in our model. We explain the model architecture and the training and generation algorithms of GraphTune in Section V. Section VI shows the empirical evaluations of GraphTune and conventional models. Section VII discusses the limitations of the paper and future research directions. Finally, Section VIII concludes the paper.

Ii Related Works

Graph generation has a long history and the literature is rich with the results of many researchers. One of the most rudimentary models, Erdős-Rényi model, generates simple random graphs and was proposed in 1959. Around 2000, two models [8, 9] that reproduce structural features of graphs called small-world networks and power-law degree distribution attracted the attention of researchers. Since these studies were reported, various statistical graph generation methods inspired by them have been proposed [23, 24, 25, 6]. A lot of the structural features of graphs have also been quantified in this research. What can be said in common with these traditional statistical generation models is that a model focuses on one (or a few) of the many features and aims to reproduce the features (e.g., small worldness, power-law degree distribution, and local clustering). These models cannot be adapted to real graph datasets that have numerous features, and cannot guarantee that the generated graphs completely reproduce every single aspect according to the real data.

A recent trend in the field of graph generation is learning-based models that reproduce real-world graphs. Learning-based models have evolved rapidly over the last few years, and have attracted the attention of researchers. Learning-based models have been proposed for a wide range of domains ranging from discovering new molecular structures to modeling social networks, and most recently, survey papers have been published [19, 20, 6]. Various learning-based models have been proposed, including adjacency-based and edge-list-based approaches. The sequence data-based approach that converts a graph into sequential data has been particuarly successful in this field [12, 11, 5, 17].

Although several learning-based models for graphs have been proposed, conditional generation of graphs is less explored [19, 20]. The few models that achieve conditional generation of graphs include domain-specific models in the field of molecule chemistry [3, 4], evolutionary developmental biology [26]

, and natural language processing 

[27, 28]. Unfortunately, these models cannot easily be applied to general graphs due to the utilization of domain-specific knowledge. DeepGMG [12] and CondGen [15] have been proposed as models applicable to general graphs. DeepGMG adds a condition into the latent vector during the decoding process of the graph to tune the local structure in the graph such as the number of nodes, edges, or hexagons. CondGen allows tuning of global-level structural features (including average shortest path length and Gini index at least) with graph variational generative adversarial nets. TSGG-GAN also provides conditional generation of graphs, but it focuses on time series conditioned generation and the challenges are different from ours. In TSGG-GAN, multivariate time series data are input as node expression values to the model, and graphs conditioned by the time series are generated.

Iii Problem Formulation

The graphs treated in this paper are undirected connected graphs without self-loop. As a notational convention, a graph is represented by , where and denote a subset of nodes  and a subset of edges , respectively. We let denote the universal set of graphs .

We consider a mapping , and a graph that is mapped to a feature vector by as shown in Fig. 1. The th element of the vector expresses a feature of graph . Every feature is represented by a real number . We assume that a feature vector for graph contains values for all the features that are calculated from (i.e., ). For example, elements of represent values of features such as the number of nodes and edges, average path length, average degree, edge density, modularity, clustering coefficient, power-law exponent of the degree distribution, and largest component size.

Fig. 1: Problem formulation. A graph is mapped to a feature vector by a mapping . Elements of a feature vector for graph represent values of all sorts of features that are calculated from . This paper tackles the inference problem of finding that approximates , using a subset of the universal set of graphs. can generate by inputting the feature vector in which any element of the vector is replaced by an arbitrary value.

In this paper, we formulate the problem of generating a graph with specified features as inferencing the mapping from feature vectors of graphs to graphs (see Fig.1). We define the inverse mapping of the mapping , and graph can be obtained by calculating with the inverse mapping. We tackle the inferencing problem of finding that approximates by using a subset of . By solving this problem, it is possible to generate with the feature vector in which any element of the vector is replaced by an arbitrary value. Since we cannot use the universal set of graphs, the inference needs to be achieved by a subset of that is contained in an accessible dataset.

Iv DFS Code

GraphGen [17] is a successful model for unconditional generation in learning-based graph generation that utilizes the DFS code. The key idea is to use the DFS code to convert graphs into sequence data. The converted sequence data are learned by LSTM in a training process. The compact expression of a graph by sequence data with DFS code allows GraphGen to accurately generate graphs that are similar to graphs in a dataset. While we propose a novel CVAE-based generative model for graphs different from GraphGen, our model follows the sequence data-based approach by using DFS code in the preprocess for training. In this section, we summarize the conversion process by DFS code that is common to GraphGen and our model.

DFS code converts a graph to a unique sequence of edges that retains the structural features of the graph. It is well known that a trajectory (i.e., a sequence) of a walk on a graph reflects features of the graph, including the degree distribution [29]. DFS code converts a graph to a compact sequence of length by using the depth-first search preventing revisit of edges.

In the conversion algorithm of DFS code, timestamps are first added to all nodes from 0 by performing the depth-first search. That is, all nodes are discovered and assigned timestamps in order of depth-first search. The traversal of the search is represented as a sequence of edges called a DFS traversal. In the example graph shown in Fig. 2, a DFS traversal is , and timestamps of nodes are assigned as , , , and . Edges contained in a DFS traversal are called forward edges, and other edges are called backward edges. By adding a timestamp to all nodes, edge can be annotated as a 5-tuple , where and denote the timestamp of node and the label of the edge or node, respectively. Although the graphs treated in this paper are graphs with unlabeled nodes and edges, the original DFS code is designed for labeled graphs. A detailed treatment of , in our model is explained in Section V-A.

Fig. 2: Sequence converted from an example graph. All nodes are assigned timestamps in order of depth-first search, thereby obtaining a DFS traversal. The edges contained in the DFS traversal are called forward edges, and other edges are called backward edges. A backward edge is placed between forward edges and , thereby obtaining a sequence that contains all edges.

Based on the order of node timestamps, DFS code constructs a sequence that contains all edges in a graph. Although the forward edges are already constructed as a sequence (i.e., a DFS traversal), the backward edges are not included in the sequence. To construct a sequence that contains all edges in a graph, a backward edge is placed between forward edges and . If there are multiple backward edges and , the timestamps of and are compared, and the smaller one is placed in front. By performing this procedure, all backward edges are placed between forward edges, and a sequence that contains all edges is obtained. In the example shown in Fig. 2, the obtained sequence is . Although sequences that are constructed by the above procedure are not necessarily unique, the lexicographically smallest sequence based upon lexicographical ordering [30] is chosen as the unique one. As a result, the graph is represented as unique sequence of 5-tuple by DFS code.

V GraphTune: A Graph Generative Model with Tunable Structural Features

We propose GraphTune – a generative model for graphs that is able to tune a specific structural feature using DFS code and CVAE. GraphTune is composed of CVAE with an LSTM-based encoder and decoder. Graphs are converted to sequence data by the DFS code, and the sequence data are input to LSTM. This section provides a detailed review of generative approaches of graphs in GraphTune.

V-a Sequence Data Converted from Graphs

Like GraphGen, GraphTune learns a sequence dataset that is converted from a graph dataset using a conversion based on DFS code. Although the conversion is basically the same as the conversion by DFS code described in Section IV, it is modified as follows to adapt it for our problem. As mentioned above, although the original DFS code is designed for labeled graphs, we assume unlabeled graphs in Section III. In the sequence dataset that is learned by GraphTune, node labels and edge labels in a 5-tuple are set as the degree of node and 0, respectively. According to the DFS code procedure described in Section IV, a graph is converted to a sequence of 5-tuples. At the end of the sequences, an End Of Sequence (EOS) token is added, where and represent and for a set of all nodes in the graph dataset . Sequences with EOS allow us to learn graphs of any size. We can obtain a sequence

by further converting the sequence of 5-tuples by component-wise one-hot encoding. A sequence

with element is input into the model of GraphTune (see below for the model architecture).

V-B Condition Vectors Corresponding to Graphs

Along with the sequential data fed by the DFS code, we input condition vectors expressing structural features of the graph dataset to the GraphTune model for learning. Elements of vector represent the value of the structural feature of graph in graph dataset , and corresponds to the first element of a feature vector in Fig.1. Elements of condition vector are calculated from graph by a statistical process. The condition vector specifies the structural features we focus on, and we can choose any structural feature as the elements of a condition vector. Structural features that can be specified in a condition vector are not limited to features regarding the local structures of graphs such as the number of nodes and edges, but global-level structural features (including average of shortest path length, clustering coefficient and the power-law exponent of the degree distribution) can also be specified. For example, if we want to tune the model by focusing on the clustering coefficient of the graphs, then we calculate a clustering coefficient for each graph and construct a vector for each .

V-C Model Architecture

Fig. 3: Proposed model composed of CVAE with a LSTM-based encoder and decoder. A graph in the graph dataset is converted to a sequence by using DFS code. The sequence is processed by an LSTM-based encoder. The decoder generates a sequence of 5-tuples, and the sequence is converted to a generated graph. The condition vector is input to both the encoder and decoder. See the equations in Section V-C for the detailed process of the model.

The proposed model is composed of CVAE with an LSTM-based encoder and decoder (see Fig. 3). A graph dataset is converted to a sequence dataset in the manner explained in Section IV. By doing this, set of condition vectors is calculated from the graph dataset . The proposed model is trained with a sequence dataset and a condition vector set . Sequence data and the condition vector are input into the LSTM-based encoder, and the encoder finds a latent state distribution of latent vectors. A latent vector is randomly sampled from the distribution , and the latent vector concatenated with a condition vector is input into the LSTM-based decoder. The decoder tries to reproduce the sequence . The details of the encoder and the decoder are described below.

Encoder

Encoder  with parameter learns sequences and maps them to a latent vector according to features of graphs. To treat sequence data, we employ a stacked LSTM as an encoder. To encode a graph into a latent space, the th element of the sequence and a condition vector are vertically concatenated in a single vector and the vector is embedded with a single fully connected layer . The embedded vector is then fed into each LSTM block . The initial hidden state vector is initialized as a zero vector . The stacked LSTM with the embedding layer processes a sequence of length by recursively applying the LSTM block to hidden state vector . The output of the last LSTM block is fed to two functions and

implemented by single fully connected layers. As usual in VAE, the latent state distribution is enforced as a multivariate Gaussian distribution with dimension

. A latent vector is then sampled from the latent state distribution where and . Summarizing the above, the process of the encoder part of our model is as follows:

(1)
(2)
(3)
(4)
(5)

Decoder

Decoder  with parameter learns to map a subsequence of a sequence , a condition vector , and latent vector to a next element . The decoder is also modeled by a stacked LSTM. Like the encoder, the th element of the sequence is embedded into a vector by a single fully connected layer , and is processed by a stacked LSTM. Unlike the encoder, however, the embedded vector is concatenated with a sampled latent vector and a condition vector before inputting into a stacked LSTM. The concatenated vector is fed into the LSTM block , and a sequence is processed by recursively applying the LSTM block . The initial hidden state is calculated by replacing and with a Start Of Sequence () and . is converted by from a vector into which a latent vector and a condition vector are concatenated together, where the conversion is implemented by a fully connected layer. The function is also implemented by a fully connected layer. The output vector of each LSTM block is fed to 5 functions , , , , and

implemented by a fully connected layer. The 5 vectors output from these 5 functions are respectively converted to probability distributions

, , , , and through a softmax function, and these distributions predict one-hot vectors of 5-tuples in . In the learning process, the stacked LSTM and the fully connected layers are trained to predict by a concatenated vector (See Section V-D for detailes). Summarizing the above, the process of the decoder part of our model is as follows:

(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)

V-D Training

Using sequence data fed by DFS code and structural feature vectors calculated by a statistical process, GraphTune infers the functions mentioned in Section V-C. In the training process, we input sequence data and a condition vector into the proposed model, and obtain a latent vector and a predicted sequence .

Following the optimization manner of VAE [22], our model with encoder  and decoder  considers the two components

(15)
(16)

of the variational lower-bound, where and

denote the Kullback-Leibler divergence and the multidimensional standard normal distribution 

with dimension , respectively. Like the normal VAE [22], the first component Eq. (15) regularizes the latent state distribution to be the standard normal distribution, and can be written as

(17)

The second component Eq. (16) is a reconstruction loss that ensures the predicted sequence is similar to the input sequence from the dataset. For our model, the reconstruction loss is defined for a sequence data and a predicted sequence by

(18)

where and represent component of and , respectively.

By uniting the two losses with the idea of -VAE [31], the proposed model is optimized by gradient descent on the following loss with weight .

(19)

The loss is backpropagated to the model, and we use the reparameterization trick 

[22] for backpropagation through the Gaussian latent variable.

The detailed algorithm of the training is shown in Algorithm 1. For a given sequence dataset and the condition vector set corresponding to the dataset, Algorithm 1 returns learned encoder functions (, , , and ) and decoder functions (, , , , , , , and ). First, the total loss is initialized (Line 2). The algorithm then iterates over all sequences (Lines 3-23). An encoder recursively calculates and a latent vector is sampled (Lines 4-9). The loss for regularization of the latent state distribution is added to the total loss (Line 10). A predicted sequence , and , are initialized by an empty vector, , and , respectively (Line 11-13). is converted to probability distributions of one-hot vectors of 5-tuples through the function , , , , and (Lines 15-17). Next, the distribution is vertically concatenated into a single vector , and these vectors are horizontally concatenated into a predicted sequence (Lines 18-19). A decoder also recursively calculates for the prediction of the next 5-tuple (Line 20). The reconstruction loss calculated from the concatenated probability distributions is added to the total loss (Line 22). Lastly, the weights of all functions are updated by back-propagating the total loss (Line 24). The above procedures are iterated until the total loss converges (Line 25).

0:  Graph dataset , Condition vector set
0:  Learned functions , , , , , , , , , , , and
1:  repeat
2:     
3:     for  from 1 to  do
4:        
5:        for  from 0 to  do
6:           
7:        end for
8:        ;
9:        
10:        
11:        
12:        
13:        
14:        for  from 0 to  do
15:           for  do
16:              
17:           end for
18:           
19:           
20:           
21:        end for
22:        
23:     end for
24:     Back-propagate and upate weights
25:  until stopping criteria
Algorithm 1 Training of GraphTune

V-E Generation

When we generate graphs with specific structural features, GraphTune recursively generates sequential data in the DFS code format using learned functions. We give a sampled latent vector and a condition vector whose elements are tuned to specific values to the decoder. By giving a condition vector, the decoder recursively generates a sequence of 5-tuples according to the condition vector. Finally, we get a graph with specific structural features by inverse converting from sequence data to a graph.

The entire procedure for the generation of a graph with a specific condition is summarized in Algorithm 2. The input and output of the algorithm are a condition vector with specific values and sequence data of a graph with the condition, respectively. Firstly, a generated sequence data and an iterator variable are initialized (Lines 1-2). For the generation, a latent vector is sampled from the standard normal distribution (Line 3). is calculated from the sampled latent vector and the given condition (Lines 4-5). To obtain the next 5-tuple in a predicted sequence, the element-wise distributions , , , , and of 5-tuple are calculated, and predicted values , , , , and are sampled from these distributions, respectively (Lines 7-10). The predicted values , , , , and of each element of 5-tuple are vertically concatenated into a single vector , and the concatenated vector is horizontally concatenated to the predicted sequence (Lines 11-12). is recursively calculated for the prediction of the next 5-tuple, and the iterator variable is updated (Lines 13-14). To stop the generation of the sequence in finite size, the iteration finishes if at least one element of the prediction of 5-tuple is EOS (Line 15). A graph with a specific condition is easily constructed from the sequence at the end of the algorithm.

0:  Condition vector with specific values
0:  Sequence data of a graph with a specific condition
1:  
2:  
3:  
4:  
5:  
6:  repeat
7:     for  do
8:        
9:        
10:     end for
11:     
12:     
13:     
14:     
15:  until 
Algorithm 2 Generation of a graph with a specific condition

Vi Experiments

We verify that GraphTune can learn structural features from graph data and generate a graph with specific structural features. In this section, we present performance evaluations of GraphTune on a real graph dataset extracted from a who-follows-whom network of Twitter. Through the evaluations, we show that GraphTune yields better performance than the conventional generative models, namely, GraphGen and CondGen.

Vi-a Baselines

To confirm the basic characteristics of GraphTune in a conditional graph generation task, we compare the performance of GraphTune with two baseline models: GraphGen [17] and CondGen [15].

GraphGen

GraphGen employs a scalable approach to domain-agnostic labeled graph generation and is a representative model that adopts a sequence data-based approach. As we mentioned in the introduction, the sequence data-based approach is one of the most successful approaches in the field of learning-based graph generation. GraphGen was compared with DeepGMG [12] and GraphRNN [11] in [17]. It was reported that GraphGen is superior to these methods in terms of the reproduction accuracy of graph structural features. Although GraphGen is an outstanding model, it unfortunately does not provide conditional generation of graphs. Hence, GraphGen is a baseline in terms of the reproduction accuracy of graph structural features, and it does not provide a baseline regarding the ability of conditional generation. In the evaluations in Section VI, we use the parameters recommended in [17].

CondGen

CondGen employs conditional structure generation through graph variational generative adversarial nets and is one of the few models that achieves a conditional generation for general graphs that is not limited to a specific domain. To the best of our knowledge, CondGen is almost the only model that is oriented towards the reproduction of global-level structural features in human relationship graphs including social networks and citation networks. GraphGen was compared with GraphVAE [13], NetGAN [5], and GraphRNN [11] in  [15]. The study [15] reports that CondGen records the best performance in most cases. Since CondGen supports a conditional generation of graphs, CondGen provides a baseline regarding the ability of conditional generation. Unlike GraphTune specifying a value of a feature after a training process, CondGen requires training datasets grouped by labels. Hence, when we specify another condition, CondGen needs to relearn another dataset grouped by the conditions. In the evaluations in Section VI, we use the parameters recommended in the paper [15]. Since the number of nodes and edges are required as input parameters in the generation, the number of nodes and edges of sampled graphs with a specific label from the dataset are used.

Vi-B Parameters and Training Dataset

The parameters of a model in GraphTune are set as follows. For the encoder part of our model, we use 2-layer LSTM blocks for , which has a hidden state vector of dimension 223. The dimension of is set to 227. The dimensions of vectors and , that is, the dimension of the latent vector , are set to 10. Three-layer LSTM blocks for which have a hidden state vector of dimension 250, is adopted for the decoder part. The dimensions of and

are set to 250. We use Adam optimizer to train the neural networks for 10000 epochs with a batch size of 37 and an initial learning rate of 0.001 for training. The weight

for the calculation of loss is set to 3.0.

To evaluate the performance of GraphTune, we sampled data from the Twitter who-follows-whom graph in the Higgs Twitter Dataset [32]

. To prepare a human relationship graph dataset with sufficient size for training and evaluation, we sampled 2000 graphs from a single huge graph included in the Higgs Twitter Dataset with 456,626 nodes and 14,855,842 edges. A graph in the dataset for the evaluations is sampled by performing a random walk that starts from a randomly selected node. An initial node of a random walk is selected following a uniform distribution. The edge for the next hop is randomly selected from all edges with equal probability. The random walk ends up after 50 nodes are found, and a graph in the evaluation dataset is an induced subgraph that is composed of the nodes included in the random walk. Note that an edge can be included in the evaluation dataset if both nodes are included in the random walk, even if the edge is not included in the random walk. Although the original Higgs Twitter Dataset is a directed graph, we ignore the directions of all edges in this study since our model is designed for undirected graphs. This dataset with small and uniform size graphs is suitable for evaluating the basic tunability of global-level features without being affected by the difficulty of reproducing heterogeneous or very large graphs. We split the dataset into two parts: the training set and validation set. The size ratio of the training set and validation set are 90% and 10%, respectively.

Vi-C Structural Features

As structural features of graphs, we focus on the following 5 features: average of shortest path length, average degree, modularity [33], clustering coefficient, and a power-law exponent of a degree distribution [34]. The value of modularity is calculated for modules consisting of nodes divided by the Louvain algorithm [35]. We calculate the power-law exponent of the degree distribution by the powerlaw Python package [34]. These global-level structural features are selected from survey papers [36, 37] on the measurement of complex network structures, and they have been widely used as graph features of human relationship graphs [38, 39]. Compared with local structures such as the number of nodes and edges, these global-level structural features are difficult to tune by adding or removing local structure to or from a graph, such as a node, edge, or hexagon. Needless to say, adding or removing local structures to or from a graph can change the value of global-level structural features. However, if we control a value of a feature to a specific value on a graph, we must understand the structure of the whole graph and consider the effect of the local structure on the value of the feature. For the creation of the dataset, the value of features is rounded to one decimal place.

Vi-D Performance Evaluations

In this section, we show that GraphTune can generate graphs with specific structural features. Performance comparison among three methods (GraphTune, CondGen, and GraphGen) and detailed analysis of generated graphs are provided.

We trained GraphTune, CondGen, and GraphGen using the training set described in Section VI-B, and generated graphs with specific conditions. For GraphGen, a single model is trained with the training set, since GraphGen does not provide conditional generation of graphs. For GraphTune and CondGen, the models were trained individually for each feature that is focused on; that is, we trained 5 different models for each single feature. After the training process, we generated 300 graphs for each model. For each feature, we picked up 3 typical values of a feature from the range of values of the training set as the values of the condition vectors. In the generation process, we give the condition vectors to models of GraphTune. Since CondGen requires training sets grouped by labels, the training sets are divided into 3 groups at the middle of the typical values. Note that we cannot give a condition to GraphGen since it is designed for unconditional generation.

The summary of the results of generation are listed in TABLE I. The values in the columns of GraphTune, CondGen, GraphGen represents average values of features in graphs generated by each model. Since GraphGen does not provide conditional generation, the same value is listed for different conditions. We can consider that a method has better performance if the average value is closer to the values of the condition in the tunability of a condition. The best performance achieved under each condition for a particular feature is emphasized in bold font.

According to the results in TABLE I, we can confirm that the graphs generated by GraphTune have high reproduction accuracy and clearly change depending on the conditions. GraphGen generally reproduces real data well, but GraphTune, which adopts sequence-based generation like GraphTune, has similar performance. In the results of average degree and clustering coefficient, information of condition vector works effectively, and GraphTune has better reproduction accuracy than GraphGen. In tunability of feature, GraphTune achieves the best performance in most of the features. While GraphTune can accurately tune the average of shortest path length, the value for CondGen cannot be calculated since all generated graphs are unconnected (we depict “–” for unconnected graphs). Since we explicitly give information on the number of nodes and edges in the graph of the training dataset, CondGen can accurately tune the value of average degree. GraphTune is also quite accurate even though such information is not given. With regard to modularity and clustering coefficients, GraphTune outperforms CondGen in terms of both reproduction accuracy of real data and tunability. For highly complex statistics such as the power-law exponent, the value of the feature is somewhat tunable but the result is a little unstable.

Global-level structural feature Condition GraphTune CondGen GraphGen Real data
average average average average
(25-percentile / median / 75-percentile)
3.0 3.05
Average of shortest path length 4.0 4.02 4.59 4.26
5.0 5.43 (3.40 / 4.09 / 4.84)
3.0 3.26 2.93
Average degree 3.5 3.60 3.48 2.96 3.59
4.0 3.90 4.51 (2.96 / 3.44 / 3.92)
0.40 0.389 0.299
Modularity 0.55 0.430 0.325 0.567 0.550
0.70 0.507 0.336 (0.509 / 0.563 / 0.617)
0.1 0.177 0.344
Clustering coefficient 0.2 0.181 0.366 0.0846 0.203
0.3 0.217 0.409 (0.152 / 0.196 / 0.251)
2.6 2.91 4.10
Power-law exponent of a degree distribution 3.0 2.98 3.90 5.48 4.28
3.4 3.50 3.80 (2.91 / 3.48 / 4.23)
TABLE I: Average values of 5 global-level structural features in graphs generated by GraphTune, CondGen, and GraphGen. We present the same value for all results of GraphGen for each feature since GraphGen does not provide conditional generation of graphs. The best performance between GraphTune and CondGen achieved under each condition for a particular feature is highlighted in bold font.

To investigate the detailed performance of GraphTune, we depict the distributions of the values of the global-level structural features. In Fig. 4, we plot pairwise relationships of the values of the features on generated graphs by GraphTune. While maintaining the value of the other features of the generated graphs are in the range of real data, the distributions of values of an average shortest path on GraphTune results are clearly distinguishable. According to the scatter plots in Fig. 4, we can confirm that the distribution of points in the real graph data and that of the generated graph data almost overlap. As a result, the relationships between any two features are accurately reproduced, and it was achieved that the reproduction of a graph dataset in every single aspect.

Fig. 4: Pairwise relationships of the values of the features on generated graphs by GraphTune when we give values 3.0, 4.0, and 5.0 of an average shortest path length as conditions. The figure is a grid of multiple plots, and the grid is such that each feature will be shared across the y-axes across a single row and the x-axes across a single column. The diagonal plots are the distributions of the features, and the others are scatter plots of two features. The distributions of values of an average shortest path are clearly distinguishable.

Vii Limitations and Future Directions

We recognize that there are some limitations of generation by GraphTune, which suggest that GraphTune has potential for future expansion.

Generation of Large Graphs

Although the number of nodes on graphs in our dataset is relatively large compared with the datasets that have been evaluated in studies regarding learning-based conditional generations of graphs, it is small in terms of social networks. In the results [17] of unconditional generation with GraphGen, which adopts sequence-based generation like GraphTune, the average number of nodes of graphs generated by GraphGen is at most 54.01 nodes. While GraphTune generates graphs of almost the same size as the graphs that GraphGen generates, it was unfortunately not able to generate graphs with over 100 nodes. More innovation is needed to overcome this limitation. Hierarchical generation [18]

is relatively easy to implement but is expected to be effective and is a promising option. Another promising option is a combination of deep learning and traditional statistical graph generation. We consider that the sequence-based generation in GraphTune has a high affinity and extensibility for both approaches.

Pinpoint Specification of Features

The tunability of graph features in GraphTune is not perfect. While a distribution of feature values of graphs generated by GraphTune has distinctly different peaks, these values are somewhat varied. Accuracy improvements of the specification remain as a future issue. In addition, GraphTune can specify at most one feature, and currently, it has not succeeded in specifying multiple features at the same time. In order to achieve the specification of multiple features, it is necessary to understand the independency and/or dependency between each feature. However, the independency and/or dependency of global-level structural features is very complex, and understanding it is a challenging issue. It goes without saying that analytical results based on graph theory are important for this issue. However, observing the features of graphs generated by GraphTune may reinforce the results of graph theory by a data-driven approach.

Generation of Extrapolation

Although GraphTune generates graphs with a specific feature flexibly, the tunable range of a feature is limited to the range of the feature within the graphs of the learned dataset. In the tasks of generation or prediction, it is generally hard to extrapolate values that are not included in the range of the dataset. This difficulty is the same for the graph generation task, and the current GraphTune cannot generate extrapolated graphs that are outside the range of graphs included in the training set. However, the specification technique of graph features provided by GraphTune could be a key technology to overcome the difficulty of extrapolation output in the graph generation task. By specifying the value of a feature on the edge of the range of the training dataset, we can generate graphs inside and outside the border. The generated graphs enhance the original training set, and the enhanced training set covers ranges that are not included in the original training set by repeatedly generating graphs on the edges. The generation of extrapolated graphs is one of our future directions.

Viii Conclusion

In this work, we proposed GraphTune, which is a learning-based graph generative model with tunable structural features. GraphTune is composed of a CVAE with an LSTM-based encoder and decoder. By specifying the value of a particular structural feature as a condition vector that is input into CVAE, we can generate graphs with a specific structural feature. We performed comparative evaluations of GraphTune, CondGen, and GraphGen through a real graph dataset sourced from the who-follows-whom graph on Twitter. The result of the evaluation show that GraphTune makes it possible to tune the value of a global-level structural feature, and that conventional models are unable to tune global-level structural features.

Although GraphTune provides a rich variety of graphs flexibly, it does not solve all problems related to graph modeling. One improvement needed in future works is to provide rich functionality for the specification of structural features. In addition to improving the accuracy of feature values of generated graphs, it is also necessary to be able to specify multiple features at the same time. Allowing the generation of extrapolated graphs that are not included in the training dataset is also an important function. On the other hand, the tunability of GraphTune has the potential to empower traditional graph theory by a data-driven approach. When combined with traditional graph theory, unraveling complex global-level features relationships is a challenging but interesting issue.

Acknowledgment

This work was partly supported by JSPS KAKENHI Grant Number JP20H04172.

References

  • [1] S. Nakazawa, Y. Sato, K. Nakagawa, S. Tsugawa, and K. Watabe, “A Tunable Model for Graph Generation Using LSTM and Conditional VAE,” in Proceedings of the 41st IEEE International Conference on Distributed Computing Systems (ICDCS 2021) Poster Track, Online, 2021.
  • [2] N. D. Cao and T. Kipf, “MolGAN: An Implicit Generative Model for Small Molecular Graphs,” in Proceedings of the 35th International Conference on Machine Learning (ICML 2018) Workshop, Stockholm, Sweden, 2018.
  • [3] Y. Li, L. Zhang, and Z. Liu, “Multi-Objective De Novo Drug Design with Conditional Graph Generative Model,” Journal of Cheminformatics, vol. 10, no. 33, 2018.
  • [4]

    E. Jonas, “Deep Imitation Learning for Molecular Inverse Problems,” in

    Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 2019.
  • [5] A. Bojchevski, O. Shchur, D. Zügner, and S. Günnemann, “NetGAN: Generating Graphs via Random Walks,” in Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 2018, pp. 609–618.
  • [6] A. Bonifati, I. Holubová, A. Prat-Pérez, and S. Sakr, “Graph Generators: State of the Art and Open Challenges,” ACM Computing Surveys, vol. 53, no. 2, 2021.
  • [7] P. Erdös and A. Rényi, “On Random Graphs I,” Publicationes Mathematicae, vol. 6, no. 26, pp. 290–297, 1959.
  • [8] D. J. Watts and S. H. Strogatz, “Collective Dynamics of ‘Small-World’ Networks,” Nature, vol. 393, no. 6684, pp. 440–442, 1998.
  • [9] R. Albert and A.-L. Barabási, “Statistical Mechanics of Complex Networks,” Reviews of Modern Physics, vol. 74, no. 1, pp. 47–97, 2002.
  • [10] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic Blockmodels: First Steps,” Social Networks, vol. 5, no. 2, pp. 109–137, 1983.
  • [11] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec, “GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models,” in Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 2018.
  • [12] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia, “Learning Deep Generative Model of Graphs,” in Proceedings of 6th International Conference on Learning Representations (ICLR 2018) Workshop, Vancouver, Canada, 2018.
  • [13] M. Simonovsky and N. Komodakis, “GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders,” in Proceedings of the 27th International Conference on Artificial Neural Networks (ICANN 2018), Rhodes, Greece, 2018.
  • [14] R. Assouel, M. Ahmed, M. H. Segler, A. Saffari, and Y. Bengio, “DEFactor: Differentiable Edge Factorization-Based Probabilistic Graph Generation,” arXiv, 2018.
  • [15] C. Yang, P. Zhuang, W. Shi, A. Luu, and P. Li, “Conditional Structure Generation through Graph Variational Generative Adversarial Nets,” in Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 2019, pp. 1338–1349.
  • [16] J. Lim, S.-Y. Hwang, S. Moon, S. Kim, and W. Y. Kim, “Scaffold-Based Molecular Design with a Graph Generative Model,” Chemical Science, vol. 2020, no. 4, pp. 1153–1164, 2020.
  • [17] N. Goyal, H. V. Jain, and S. Ranu, “GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation,” in Proceedings of the Web Conference 2020 (WWW 2020), Taipei, Taiwan, 2020, pp. 1253–1263.
  • [18] W. Jin, R. Barzilay, and T. Jaakkola, “Hierarchical Generation of Molecular Graphs using Structural Motifs,” in Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Online, 2020.
  • [19] X. Guo and L. Zhao, “A Systematic Survey on Deep Generative Models for Graph Generation,” arXiv, 2020.
  • [20] F. Faez, Y. Ommi, M. S. Baghshah, and H. R. Rabiee, “Deep Graph Generators: A Survey,” IEEE Access, vol. 9, pp. 106 675–106 702, 2021.
  • [21] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [22] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014), Banff, Canada, 2014.
  • [23] A. Vázquez, “Growing Network with Local Rules: Preferential Attachment, Clustering Hierarchy, and Degree Correlations,” Physical Review E, vol. 67, no. 5, 2003.
  • [24] D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A Recursive Model for Graph Mining,” in Proceedings of the 2004 SIAM International Conference on Data Mining (SDM 2004), Lake Buena Vista, Florida, USA, 2004.
  • [25] T. G. Kolda, A. Pinar, T. Plantenga, and C. Seshadhri, “A Scalable Generative Graph Model with Community Structure,” SIAM Journal on Scientific Computing, vol. 36, no. 5, 2014.
  • [26]

    J. Liu, Y. Chi, and C. Zhu, “A Dynamic Multiagent Genetic Algorithm for Gene Regulatory Network Reconstruction Based on Fuzzy Cognitive Maps,”

    IEEE Transactions on Fuzzy Systems, vol. 24, no. 2, pp. 419–431, 2016.
  • [27] Y. Wang, W. Che, J. Guo, and T. Liu, “A Neural Transition-Based Approach for Semantic Dependency Graph Parsing,” in

    Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI 2018)

    , New Orleans, Louisiana, USA, 2018.
  • [28] B. Chen, L. Sun, and X. Han, “Sequence-to-Action: End-to-End Semantic Graph Generation for Semantic Parsing,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia, 2018, pp. 766–777.
  • [29] D. Spielman, “Spectral graph theory,” in Combinatorial Scientific Computing, U. Naumann and O. Schenk, Eds.   Taylor & Francis, 2011, ch. 9.
  • [30] X. Yan and J. Han, “gSpan: graph-based substructure pattern mining,” in Proceedings of 2002 IEEE International Conference on Data Mining (ICDM 2002), Gunma, Japan, 2002.
  • [31] I. Higgins, L. Matthey, A. Pal, C. P. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework,” in Proceedings of 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 2017.
  • [32] M. D. Domenico, A. Lima, P. Mougel, and M. Musolesi, “The Anatomy of a Scientific Rumor,” Scientific Reports, vol. 3, no. 2980, 2013.
  • [33] M. E. J. Newman and M. Girvan, “Finding and Evaluating Community Structure in Networks,” Physical Review E, vol. E, no. 69, 2004.
  • [34] A. Jeff, B. Ed, and P. Dietmar, “powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions,” PLOS ONE, vol. 9, no. 1, 2014.
  • [35] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast Unfolding of Communities in Large Networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2008, no. 10, 2008.
  • [36] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.-U. Hwanga, “Complex Networks: Structure and Dynamics,” Physics Reports, vol. 424, no. 4-5, pp. 175–308, 2006.
  • [37] L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas, “Characterization of Complex Networks: A Survey of Measurements,” Advances in Physics, vol. 56, no. 1, pp. 167–242, 2007.
  • [38] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a Social Network or a News Media?” in Proceedings of the 19th International Conference on World Wide Web (WWW 2010), Geneva, Switzerland, 2010, pp. 591–600.
  • [39] B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi, “On the Evolution of User Interaction in Facebook,” in Proceedings of the 2nd ACM Workshop on Online Social Networks (WOSN 2009), Barcelona, Spain, 2009, pp. 37–42.