DataSynth
None
view repo
The use of synthetic graph generators is a common practice among graphoriented benchmark designers, as it allows obtaining graphs with the required scale and characteristics. However, finding a graph generator that accurately fits the needs of a given benchmark is very difficult, thus practitioners end up creating adhoc ones. Such a task is usually timeconsuming, and often leads to reinventing the wheel. In this paper, we introduce the conceptual design of DataSynth, a framework for property graphs generation with customizable schemas and characteristics. The goal of DataSynth is to assist benchmark designers in generating graphs efficiently and at scale, saving from implementing their own generators. Additionally, DataSynth introduces novel features barely explored so far, such as modeling the correlation between properties and the structure of the graph. This is achieved by a novel propertytonode matching algorithm for which we present preliminary promising results.
READ FULL TEXT VIEW PDFNone
During the last decade, the amount of available data has grown exponentially and it is expected to grow even more over the next years. Much of these data present themselves in the form of property graphs, which are graphs whose vertices and edges are labeled and have associated properties in the form of keyvalue pairs. The increasing popularity of property graphs has provoked the irruption of many systems specialized on their management [1, 3] and analysis [2, 21, 8], as well as benchmarking initiatives to fairly compare them [17, 10, 12, 11, 5].
One of the difficulties of evaluating graph systems is to obtain representative datasets with the desired scale and characteristics – because data is often sensitive and business critical, and companies do not disclose them. Thus, the use of synthetically generated graphs has become a common practice among graphoriented benchmark designers.
Recent literature on graph system benchmarking reveals an increasing interest on large synthetic graphs that can reliably mimic real datasets at both the structural and property value levels. On the one hand, it is acknowledged that the structure of a graph can heavily affect the performance and behavior of an algorithm [19, 22]
. On the other hand, graphbased technology is penetrating into domains such as social networks, mobility planning or drug development, just to cite a few. Each domain requires application specific benchmarks with graphs where not only the structure is relevant (i.e. it must be similar to that of the graphs of the domain), but also the distributions of the property values and the way these properties are correlated with the underlying graph structure. In many real graphs, we observe propertystructure correlations in the form of joint probability distributions between the property values of pairs of connected nodes
[16]. The presence of these propertystructure correlations can be determinant for the performance of some queries. This is an aspect accurately modeled, for instance, by the modern LDBC Social Network Benchmark [10], which uses correlated graphs that have been crafted after a detailed choke point analysis similar to that done on more traditional benchmarking such as TPCH [6].Given these trends, we foresee an increasing need for synthetic graph generators that can produce large graphs that are realistic both structurally and in terms of properties. However, most of existing graph generators only focus on the structure [7, 9, 13], and those that generate properties are designed for specific use cases [10, 23]. Implementing a property graph generator is a time consuming task, thus we need tools to save practitioners from such a burden.
In this paper we present the conceptual design of DataSynth, a workinprogress domain agnostic graph generation framework, for the generation of property graphs for benchmarking at a scale. DataSynth assumes a sharednothing environment and borrows techniques from existing tools to generate data efficiently in parallel. At the core of DataSynth lies a novel and fundamental propertytonode matching algorithm that allows decoupling the generation of properties from the generation of the graph structure, while preserving propertystructure correlations. According to our first experiments, this approach looks promising. Summarizing, DataSynth is designed to be capable of:
[leftmargin=*]
Generating property graphs using configurable schemas that consist of multiple node and edge types and properties.
Reproducing userprovided property value distributions and propertystructure correlations, similar to those observed in many real graphs.
Scaling to billion edge graphs and be workefficient by applying inplace data generation and other optimization techniques whenever possible.
In this section, we identify and classify data requirements related to the
schema, the structure and the property distributions; and functional requirements related to the scale factor of the graph and other characteristics of a property graph generator. We use the running example of Figure 1, which represents a simple social network.Schema.
Graphbased algorithms are being adopted in many domains, from recommender systems to social network analysis, route planning, etc. All these applications rely on property graphs that exhibit a myriad of different schemas that graph generators have to be able to reproduce. Following the typical property graph model, such schemas are usually defined in terms of the node and edge types, their associated properties and the cardinality of the edge types. Thus, a property graph generator should allow expressing a schema in similar terms.
For instance, in our running example there are two node types, Person and Message, and two edge types, knows and creates. Person has five properties: name, country, interest, sex and creationDate. Message has two: topic and text and the edge creates has one: creationDate. Without loss of generality, we will assume that all properties in this schema are of type String. Regarding the cardinality of the edges, knows is a ** relationship between Persons and creates, which is a 1* relationship between Persons and the Messages.
Structural.
Graph theory defines tens of structural properties to characterize graphs, such as number of connected components, clustering coefficient, degree distribution, centrality, diameter, assortativity, community distribution, etc. Graphs from different domains exhibit differences in such structural characteristics, which can affect the performance of the algorithms. Thus a property graph generator should be able to generate graphs reproducing them.
For instance, our running example imposes a structural characteristic, namely , over the knows edge that should be able to reproduce. In addition, the pairs of countries of connected Persons by this edge should follow a joint probability distribution P(X,Y) that reflects that Persons from the same country are more likely to know each other. As a consequence, the resulting graph will be divided into communities of Persons from the same Country.
Distribution.
The distribution of property values in real graphs is rarely uniform. For example, our running example, Person’s country should follow a distribution similar to that found in real life. Moreover, property values may be correlated with each other. For instance, the name of a Person is clearly correlated with the sex and the country. Finally, other relations may exist between the values of different properties, such as binary logical relations between numerical values. For instance, in our running example, the knows creationDate should be greater than the creationDate of two connected Person’s by means of the edge.
Schema  Structure  Distributions  Scale Factor  Others  

Node  Edge  Edge 


Node  Edge 

Scalability  Language  Integrability  
type  prop.  type  prop.  cardinality  distribution  correlation  edge  
LDBCSNB [10]  x  dd, cc  x  x  x  x  
Myriad [4]  x  x  x  x  dd  x  x  
RMat [7]  pl dd  x  x  
LFR [15]  pl dd, c  x  
BTER [13]  dd, accd  x  x  
Darwini [9]  dd. ccdd  x  x 
Scale Factor.
Existing benchmarks usually define some sort of scale factors for their data. Each scale factor is used to size the capabilities of the systems with respect to the amount of processed data. Some existing benchmarks base such scale on the number of nodes of the graph [17], others prefer the number of edges [11] or a combination of nodes and edges [12], or even on the size on disk of the datasets [10]. Thus, a property graph generator should provide different means of specifying the scale of the produced graph.
Other requirements.
We have identified a series of other characteristics that we believe any graph generator should have. Specially relevant is its scalability and efficiency, which have to allow the generator to produce large graphs, as those found in reallife. Also, taming a crossdomain property graph generator with such a degree of flexibility requires a properly designed interface. This should include some sort of Domain Specific Language (DSL) for the specification of the data to generate, with the corresponding syntax completion tools. Finally, beside the interface, generators should provide connectors for integrating the framework with productionlevel technologies such as databases and cluster storages (e.g. HDFS).
Table 1 summarizes the state of the art generators in relation to the requirements described in Section 2. Note that in the table, marked cells indicates that the generator allows configuring explicitly the corresponding aspect.
The LDBC Social Network Benchmark [10] (LDBCSNB) models a realistic social network, with multiple node types (Persons, Posts, Topics, etc.) and edge types (knows, creates, has). Among the novel features it incorporates, is specially remarkable the generation of a friendship graph with propertystructure correlations, which also has several desirable properties observed in social networks such as a realistic community structure [18], a small diameter, a large clustering coefficient and a Facebooklike degree distribution. However it does not provide many ways to change the produced schema, but some distributions and cardinalities can be tuned using configuration files.
Myriad is a domainagnostic property graph generator for structured relational data. It is flexible and allows the definition of different domain objects with multiple properties, including foreign keys, which can be seen as onetoone or onetomany edges. However, Myriad does not allow generating manytomany relationships, thus cannot be applied to fully model property graphs. Additionally, Myriad implements inplace data generation using pseudorandom number generators, a technique we borrow for DataSynth.
RMat is a graph generator used in the Graph500 competition [17] and produces graphs with a powerlaw degree distributions. Similarly, the LFR graph generator not only generates powerlaw degree distributions but also communities of nodes. This graph generator is typically used to benchmark community detection algorithms, since the communities are known beforehand.
The BTER graph generator goes beyond degree distributions and is also capable of reproducing the average clustering coefficient per degree of an input graph. As a side effect of its generation process, BTER produces graphs with a positive degree of assortativity and a community structure. Darwini [9] extends BTER and captures the clustering coefficient distribution at a finer granularity. Both BTER and Darwini are highly scalable, which allows the generation of Facebookscale graphs in the order of a trillion of edges. Additionally, BTER and Darwini produce graphs with a small diameter due to its generation process, although this cannot be configured in any way.
In this Section, we describe how DataSynth approaches the problem of property graph generation, given the requirements identified in Section 2. Figure 2 summarizes how conceptually DataSynth generates property graphs. First the schema is received expressed in a domain specific language (DSL), that allows expressing all the needs identified by the schema, structural, distributions and scale factor requirements. Then, for each edge type, we generate node properties and graph structure independently, which are later matched (node ids are assigned to graph structure nodes) in order to reproduce the required joint probability distributions specified by the user. Finally, the properties of the edges are generated.
We follow this approach for the following reasons. Building a graph generator capable of configuring all the existing structural characteristics yet generating properties at the same time, that also reproduces the joint probability distributions between property values of nodes, is a very complex task. Moreover, we do not even know which of the structural characteristics have an actual impact on the performance of the algorithms, which may actually depend on the domain of the generated graph and the type of queries to perform. For instance, while it is acknowledged that the degree distribution and diameter affect the performance of some algorithms such as BFS, impact of the community structure or the degree of assortativity is not yet assessed. Thus, our approach lets the user to choose between existing structure generators and structural properties they reproduce, fulfilling the structural requirement and keep the framework open to advances in the field.
In the rest of the section, we detail the whole approach. We first introduce some preliminary concepts, namely the data model, property generators and structure generators, and continue with the actual property graph generation process. We do not detail the design of the DSL because it is not in the scope of this paper.
Data Model.
DataSynth is designed with scalability in mind, for that purpose we rely on distributed tables as the data storage. In more detail, we use one “Property Table” (PT), which is a 2column table [id:Long, value:type], for each pair <node type, property> and <edge type, property>. In our running example, it would create eight PTs. In addition, we use one “Edge Table” (ET), which is a 3column table [id:Long, tailId:Long, headId:Long], for each edge type. The first column is used to identify an edge instance, while the second and third columns contain the of the nodes connected by the edge. The , either of nodes or edges, are unique per type, and range between 0 and , where is the number of instances of the given type. For our running example, DataSynth would create three ETs.
Property Generators (PGs).
PGs are pluggable “objects” that can be referenced from the DSL to specify the way property values are generated. A PG implements an interface with the following methods:
[labelsep=1.5em, leftmargin=*]
Sets up the state of the PG. It takes a variable number of parameters that depend on the strategy to generate the data (e.g. a filename to load a dictionary).
It generates the property values of the instances. It takes two parameters, i) the of the instance, either of a node or an edge, for which it generates the property value and ii) the result of calling which is a deterministic function that generates a random number using . Optionally, it also takes a variable number of parameters used to specify the correlation between property values.
Notice that depends exclusively on the and the result of a deterministic function called with the same . This allows regenerating a property value inplace by just knowing the , for instance, in different computing nodes. This approach is the same to that used in Myriad, where the function is a pseudorandom number generator (PRNG) with skip seed. Such a PRNG implements efficiently a method that returns the random number in a sequence. The PG can use the number returned by to generate a property value randomly. In order to ensure independence between properties, DataSynth builds a different for each PT. Additionally, passing the to allows the generation of usercontrolled that can be correlated with other properties such as the time.
Note that the interface of a PG is flexible enough to allow the generation of the properties of our running example. The optional number of parameters of the run
method allows implementing the generation of sequences that follow probability or conditional probability distributions. For example, the
run method of , the PG for Person name, which depends on Person country and Person sex, has the signature . In order to generate names with a realistic given distribution, that method can implement the “Inverse Transform Sampling” using the provided random number. This allows fulfilling part of the distribution requirement.Structure Generators (SGs).
Similar to PGs, SGs can be provided by users to customize the generation of the graph structure (the edges), and are referenced from the DSL as well. SGs implement the following interface.
[labelsep=1.5em, leftmargin=*]
Initializes the SG, similarly to PG.It takes a variable number of parameters that depend on the strategy to build the structure (e.g. a file with an empirical degree distribution).
Generates an ET with the edges of a graph of size (number of nodes). The values of and range between zero and , while the ids of the edges range between zero and , where is the ET size, which depends on the generation process.
Returns the number of nodes to call the method run(n) with, such that the resulting ET is of size .
This approach allows accommodating stateoftheart graph generators such as BTER or Darwini. Their parameters –in this case the degree and clustering coefficient distributions– would be passed to the method initialize. A call to the run method would then generate the structure for the given number of nodes. Finally, the method getNumNodes would be used to specify the scale of the graph in terms of the number of edges.
The data generation process begins analyzing the schema described by the user to reveal dependencies among the data to be generated. In more detail, from the dependencies analysis we get a dependency graph, which we traverse to preserve the dependencies between the tasks. This guarantees that the required parameter are available for each task when we execute it. There are three different types of tasks: generate property, generate graph and match graph.
The dependency analysis is required for the following reason. Generate property and generate graph tasks take the size of the task to run as an input (e.g. the number of node instances or the edges of the graph to generate). These sizes are sometimes given by the size of the output of another task. For example, imagine that the user of our running example only defines the scale factor of the graph by means of the number of Persons to be , but says nothing about the rest of entities. Thus, how many instances of Message does DataSynth have to generate? Notice that each Message depends on the number of instances of the edge creates (due to the cardinality). In turn, the number of edges creates follows , a degree distribution observed in reallife and provided by the user, and is conditioned by . Thus, to infer the number of Messages, we need first to generate the structure for the edge creates. Once the structure of creates is generated, its size determines the number of Messages to create. Finally, we can apply the match operator between Persons, Messages and creates. Notice that the chain of dependencies can be much complex than the one used in this example.
Alternatively, the user could be interested in specifying the scale of the graph in terms of the number of edges creates, instead of the number of Persons nor the number of Messages. In this case, DataSynth would use the getNumNodes method with the desired number of edges as a parameter, and use the result to size the graph structure and the number of Persons. The size of the resulting graph structure would be used to determined the number of Messages. This flexibility in specifying the scale of the graph lets DataSynth to fulfill the scale requirement.
Generate Structure.
This task is responsible of the generation of the graph structure of a given edge type. For each edge type, the user specifies the SG to use and its parameters. DataSynth initializes the SG and calls the run method, which returns a table with the graph structure. The number of nodes used to call the run method is determined either by the user or from the dependency analysis of DataSynth. Remember that this task generates a graph whose ids must be matched later with node ids to reproduce the desired correlations (if any).
Generate Property.
Properties, either from nodes or edges, are created by calling the run method of its corresponding PG , which is initialized with the parameters specified by the user using the method initialize. The method is called times, which is the size of the PT willing to generate. Before the generation of , its corresponding PRNG is initialized as well. For those properties that are not correlated with any other property, the row of is [,]. For those properties correlated with other properties is [,], where is the result of calling the run method for the generation of the property the currently generated property depends on (using the appropriate PG and PRNG). Such call, in turn, can lead to successive calls of other run methods. The dependency analysis guarantees that the recursion terminates. For example, in order to generate the property sex of Person, we would call:
Properties for edges can be similarly generated, using the of the endpoints of the edge if needed. Note that thanks to our approach where property values are generated independently, these can be generated efficiently in parallel in a distributed system by just knowing the of the nodes to generate. Knowing such is easy, because we just need the number of instances which are unique per type and not globally. This allows achieving the desired scalability expressed in the others requirement.
Graph Matching.
The task of graph matching consists in matching entries of a PT with the nodes of the generated graph structure , in such a way that the desired propertystructure correlation is preserved. We model the propertystructure correlation as a joint probability distribution . This distribution, which is provided by the user, expresses the probability of picking a random edge of the graph and observing property values and in its endpoints. In those cases where an edge type is not correlated with any property, the matching is done randomly.
The input of the graph matching is: the PT of the property that is correlated with the structure, a joint probability distribution and a graph structure . The goal is to find a mapping function that maps the node ids of the graph structure to ids of the PT, such that the observed after applying the mapping function is as close as possible as . This is the way we fulfill the remaining of the distribution requirement.
We approach the problem using the Stochastic Block Model (SBM). SBM is a model used for graphs where there are groups of entities with a given property value or category (one group per property value). In SBM, for each pair of groups , there is a probability that an edge exists between each pair of members of the two groups ( and can be the same). The SBM is typically used to study community detection algorithms.
For example, suppose that in our input PT , there are different property values. Let be the frequencies of each of the values observed in (which are the sizes of each group). Let be a matrix such that contains the number of edges between the nodes of group and ^{1}^{1}1We work with absolute number of edges instead of probabilities for convenience. Given and the number of edges of the graph structure, we can compute and if . In other words, our problem is equivalent to classifying the nodes of into groups of sizes , in such a way that the intra and intergroup edges are as close as possible to those in . Then, the function is built by assigning to each node of an out of those of that have the value corresponding to the partition the node has been assigned. Thus, our problem can be seen as a graph partitioning problem.
To solve this graph partitioning problem, we have implemented a variation of the LDG streaming graph partitioning algorithm [20]. In LDG, a node arrives along with its edges, and is placed to that partition where lay most of its neighbors already seen (weighted by a factor depending on the remaining capacity of the partition). In our case, instead of taking the decision based on the node’s degree, we place the node to the partition that minimizes the Frobenius Norm between and :
(1) 
where is the matrix where contains the number of edges connecting nodes with properties and , given that we put the current node to partition . As in LDG, the score is balanced by the remaining capacity , where is the number of nodes placed to partition so far. We name this method as SBMPart. Notice that a small variation of SBMPart can also be applied to bipartite graphs, since the SBM can model this type of graphs as well. If the bipartite graph is between two different node types, the input would contain two PTs instead of one.
Preliminary evaluation of graph matching.
We conducted some preliminary experiments to assess the quality of the proposed graph matching. We generated a set of graphs using the LFR and RMAT graph generators. We have configured LFR with an average degree of 20, a maximum degree of 50, a minimum community size of 10 and a maximum community size of 50, which are the parameters used in [14]. The mixing factor is set to 0.1. The rest of parameters have been left to their default values. We have generated graphs of sizes 10k, 100k and 1M nodes. In the case of the RMAT, we have used the default parameters. We have generated graphs of scale 18, 20 and 22.
We partitioned each of the graphs into groups representing different values, using LDG. The size of the group is , where is the number of nodes of and
is a geometric distribution with parameter
in this case. We use a geometric distribution to emulate reallife graphs, where groups have different sizes. Then, the nodes of the partition were assigned the property value . Then, we computed our joint probability distribution empirically. Finally, we created a PT with between and , containing as many rows with property value as the size of partition . Then, we run SBMPart using , and . We sent the nodes to SBMPart randomly.Figure 3 and 4 show the CDF of the expected () and observed distributions after running SBMPart for different graphs and number of values. The x axis corresponds to the different pairs of values , and are sorted by decreasing probability in the expected CDF, for both distributions.
In Figure 3, we fix the number of values to 16, and vary the size of the graph for the two generators. We see two interesting results. The first is that the quality of the results for LFR graphs seems to be very good, with the observed distribution with a shape that is very similar to the expected, and better than that obtained for RMAT graphs. For the latter, however, note that SBMPart is able to reproduce the pronounced slope at the beginning of the distribution, which in general correspond to those entries of where . The results suggest that the performance of the algorithm might be affected by the structure of the graph being partitioned. The second result is that the quality of the results does not seem to be affected by the size of the graphs, which suggests that the method could scale to larger graphs qualitatively speaking.
In Figure 4, we see the results when we fix the sizes and change the number of to 4, 16 and 64. The results are very similar to those observed in the previous experiments. The method works consistently very well with LFR graphs while for RMAT graphs, it seems that the larger the number of values the better. This seems to confirm that the there is a strong influence of the structure of the graph to the quality of the results. Finally, about the performance of the algorithm, it takes about 1100s to process the largest problem, RMAT22 (with 67M of edges) and 64 values, using a single thread on an Intel Xeon E2630 v3 at 2.4Ghz. No optimizations of any kind have been implemented.
We have presented the design of DataSynth, a framework for property graph generation with userdefined property distributions and correlations, different graph structures and propertystructure correlations. The method relies on a novel graph partitioning algorithm, called SBMPart, that allows matching properties to nodes in a graph, in such a way that desired joint probability distributions are preserved.
SBMPart is a greedy algorithm that does not guarantee an optimal solution, thus strict constraints cannot be fully guaranteed. However, special cases of onetoone and onetomany edges could be efficiently handled by more specific and efficient operators. These, would generate both the property values and the graph structure at the same time, which would boost performance allow reproducing strict constraints reliably. Other specific graph structures such as trees, which appear in message cascades in social networks, might require also special strategies. In this case, information propagates through the cascade, which could be modeled using a vertexcentric approach that propagates the information through the cascade iteratively.
We have presented preliminary results of SBMPart. Further study is required, including a complexity analysis, optimization strategies, etc. Specially interesting would be understanding which is the relation between the graph structure and the provided joint probability distribution (i.e. in which situations the algorithm performs well and which does not). Additionally, performing experiments for multivalued properties would also be interesting.
Finally, more work is needed regarding scalable graph generators with realistic structural characteristics. So far, we BTER and Darwini are the structural graph generators that allow tweaking a larger spectrum of structural characteristics. Studying which of the characteristics are important for a given domain, and building scalable graph generators to reproduce these characteristics are still open problems.
1.5
DAMAUPC thanks the Ministry of Economy, Industry and Competitiveness of Spain, Generalitat de Catalunya, for grant numbers TIN201347008R and SGR1187 respectively and also the EU H2020 for funding the Uniserver project (ICT042015688540). Also, thanks to Oracle Labs for the support to our research on graph technologies.
Comments
There are no comments yet.