Graphs are a natural model to represent and analyze linked data in various domains. Property graphs allow vertices and edges to have associated key–value pair
properties, besides the graph structure. This forms a rich information schema and has been used to capture knowledge graphs (concepts, relations)[mitchell2018never], social networks (person, forum, message) [cha2010measuring], and financial and retail transactions (person, store, product) [haslhofer2016bitcoin].
Path queries are a common query-class over property graphs [shu2017fake, Fan:2012:GPM:2274576.2274578].Here, the user defines a sequence of predicates over vertices and edges that should match along a path in the graph. E.g., in the property graph in Fig. 1, “[EQ1] Find a person (vertex type) who lives in ‘UK’ (vertex property) and follows (edge type) a person who follows another person who is tagged with ‘Hiking’ (vertex property)” is a 3-hop path query, and would match CleoAliceBob, if we ignore the time intervals. Path queries are used to identify concept pathways in knowledge graphs, fake news in social media, and product suggestions in retail websites. They also need to be performed rapidly, within , as part of transactional requests from websites or exploratory queries by analysts.
While graph databases are designed for transactional read and write workloads, we consider graphs that are updated infrequently but queried often. For these workloads, graph query engines load and retain property graphs in-memory to service requests with low latency, without the need for locking or consistency protocols [janusgraph]. Property graphs can be large, with – vertices and edges, and ’s of properties on each vertex or edge. This can exceed the memory on a single machine, often dominated by the properties. This necessitates the use of distributed systems to scale to large graphs.
Challenges. Time is an increasingly common graph feature in a variety of domains. However, existing property graph models fail to consider it as a first-class entity. Here, we distinguish between graphs with a time or a lifespan associated with their entities (properties, vertices, edges), and those where the entities themselves change over time and the history is available. We call the former static temporal graphs and the latter dynamic temporal graphs. E.g., in Fig. 1, the vertices, edges and properties have a lifespan, forming a temporal graph. Other than the properties of Cleo, the rest form a static temporal graph. But the Country property of Cleo changes over time, causing it to be a dynamic temporal graph.
This gap is reflected not just in the data model but also in the queries supported. Treating time as just another property fails to express temporal relations such as ensuring time-ordering among the entities on the path. E.g., [EQ2] find people tagged with ‘Hiking’ who liked a post tagged as ‘Vacation’, before the post was liked by a person named ‘Don’, and [EQ3] find people who started to follow another person, after they stopped following ‘Don’. While these should match the paths BobPicPostDon and AliceBobDon, respectively, such queries are hard, if not impossible, to express in current graph databases. This motivates the need to support intuitive temporal predicates to concisely express such temporal relations.
Further, existing graph databases and query engines do not support path queries over dynamic temporal graphs. E.g., the query EQ1 above should not match CleoAliceBob since at the time Cleo was living in ‘UK’, she was not following Alice. While platforms can be adapted to support queries over graphs at a fixed time-point, temporal relationship over time-varying properties and structures cannot be expressed meaningfully. The scalability of existing graph systems is also limited.
We make the following specific contributions in this paper:
We propose a temporal property graph model, and intuitive temporal predicates for path queries over them (§3).
We design a distributed execution model for these queries using the interval-centric computing model (§4).
We develop a novel cost model that uses graph statistics to select the best from multiple execution plans (§5).
We evaluate the performance and scalability of for temporal graphs and up to queries, derived from the LDBC benchmark. We compare this against three configurations of Neo4J, and JanusGraph which uses Spark (§6).
2 Related Work
2.1 Distributed and Temporal Graph Processing
There are several distributed graph processing platforms for running graph algorithms on commodity clusters and clouds [guo2014well]. These typically offer programming abstractions like Google Pregel’s vertex-centric computing model [malewicz2010pregel] and its component centric variants [goffish, gonzalez2014graphx] to design algorithms such as Breadth First Search, centrality scores and mining [Chen:2018:GET]. These execute using a Bulk Synchronous Parallel (BSP) model, and scale to large graphs and applications that explore the entire graph. They offer high throughput batch processing that take –. We instead focus on exploratory and transactional path queries that need to be processed in . This requires careful use of existing distributed graph platforms and additional optimizations for fast responses.
There are also parallel graph platforms for HPC clusters [boost]. These optimize the memory and communication access to scale to graphs with billions of entitles on thousands of cores [dang2018lightweight]. They focus on high-throughput graph algorithms. We instead target commodity hardware and cloud VMs with 10’s of nodes and 100’s of total cores, and are more accessible. We also address queries over temporal property graphs.
A few distributed platforms support high throughput temporal graph processing, with abstractions for designing temporal algorithms [simmhan2015distributed, han2014chronos]. We use one such in-house system, Graphite, which extends Apache Giraph, as the base framework for implementing our low-latency path query engine [graphite]. There are also a few platforms that support incremental computing as graph updates continuously arrive [zakian2019incrementalization, cheng2012kineograph]. We instead focus on property graphs with temporal lifespans on their vertices, edges and properties that have already been collected in the past. In future, we plan to consider incremental query processing over such dynamic graphs.
2.2 Property and Temporal Graph Querying
Query models over property graphs and associated query engines have been popular for semantic graphs [chavarria2016graql, zhou2013distributed]. Languages like SPARQL offer a highly flexible declarative syntax, but are costly to execute in practice for large graphs. Others support a narrower set of declarative query primitives, such as finding paths, reachability and patterns over property graphs, but manage to scale to large graphs using a distributed execution model [jamadagni2016godb, Sarwat:2013:HDS]. However, none of these support time as a first-class entity, either during query specification or during execution.
There has been limited work on querying and indexing over specific temporal features of property graph. [semertzidis2018top] propose a model for finding the top-k graph patterns which exist for longest period of time over a series of graph snapshots. The propose several indexing techniques to minimize the snapshot search space, and perform a brute-force pattern mining on the restricted set. This multi-snapshot approach limits the pattern to fully exist at a single time point and recur across time, rather than allow it to span time intervals. It is also limited to a single-machine execution, which limits scaling.
TimeReach [semertzidis2015timereach] supports conjunctive and disjunctive reachability queries on a series of temporal graph snapshot. It builds an index from strongly connected components (SCC) for each snapshot, condenses them across time, and use this to traverse between vertices in different SCCs within a single hop assuming that the graph has few SCCs that do not change much over time. They also require the path to be reachable within a single snapshot rather than allow path segments to connect across time. Likewise, TopChain [wu2016reachability] supports temporal reachability query using an index labeling scheme. It unrolls the temporal graph into a static graph, with time expanded as additional edges, finds the chain-cover over it, and stores the top-k reachable chains from each vertex as labels. It uses this to answer time-respecting reachability, earliest arrival path and fastest path queries. Paths can span time intervals. However, they do not support any predicates over the properties. Neither of these support distributed execution.
ChronoGraph [byun2019chronograph] supports temporal path traversal queries over interval property graphs. They implement this using the Gremlin property graph query language over TinkerGraph engine. They propose a set of optimizations to the Gremlin traversal operators, offer parallelization and lazy traversals within a single machine. However, they do not design or use any indexing structures or statistics to make the execution plan efficient. Their optimizations are also tightly-coupled to the execution engine, which itself is deprecated and does not support distributed execution.
In summary, these various platforms and techniques lack one or more of the following capabilities that we offer: modeling time as a first-class graph and query concept, besides properties and enabling temporal path queries that allow the path to span time and match temporal relations across entities; a distributed execution model on commodity clusters, that scales to large graphs using a query cost optimizer based on statistics over the graph.
3 Temporal Graph and Query Models
3.1 Temporal Concepts
The temporal property graph concepts used in this paper are drawn from our earlier work [graphite]. Time is a linearly ordered discrete domain whose range is the set of non-negative whole numbers. Each instant in this domain is called a time-point and an atomic increment in time is called a time-unit. A time interval is given by where which indicates an interval starting from and including and extends to but excludes . Interval relations [allen1983textordfemininemaintaining] are boolean comparators between intervals; fully before relation is denoted by , starts before relation by , fully after relation by , starts after relation by , and overlaps relation by .
3.2 Temporal Property Graph Model
We formally define a temporal property graph as a directed graph . is a set of typed vertices where each vertex is a tuple with a unique vertex ID, , a vertex type (or schema) , and the lifespan of existence of the vertex given by the interval, . is a set of directed typed edges, with . Here, is a unique ID of the edge, its type, and are its source and sink vertices, respectively, and is its lifespan. We have a schema function , that maps a given vertex or edge type to the set of property keys (or names) it can have. is a set of vertex property values, where each represents a value for the key for the vertex , with the value valid for the interval . Similar definition applies for edge property .
Further, the graph must meet the uniqueness constraint of vertices and edges, referential integrity constraints, and constant edge association constraints [moffitt2017temporal].
A static temporal property graph is a restricted version of the temporal property graph such that for the vertex and edge properties, i.e., each property key has a static value that is valid for the entire vertex or edge lifespan. Graphs without this restriction are called dynamic temporal property graphs, and these allow keys for a vertex or an edge to have different values for non-overlapping time intervals. E.g., omitting vertex Cleo in Fig. 1 makes it a static property graph, but retaining it makes this a dynamic property graph.
3.3 Temporal Path Query
-hop linear chain path query is a pattern matching query withvertex predicates and edge predicates. The syntax rules for this query model and its predicates are given below.
<path> ::= <ve-path>* <v-pred>
<ve-path> ::= <v-pred> <e-pred> <v-pred> <e-pred> <vint-pred> <e-pred>
<vint-pred> ::= <v-pred> <v-pred> <etr-clause>
<v-pred> ::= <pred>
<e-pred> ::= <pred> <direction>
<pred> ::= <bol-pred> <prop-clause><time-clause>
<bol-pred> ::= <prop-clause> <prop-clause>OR<bol-pred><prop-clause>AND<bol-pred>
<prop-clause> ::= ve-key <prop-comp> value
<time-clause> ::= ve-lifespan <time-comp> interval
<etr-clause> ::= el-lifespan<time-comp>er-lifespan
<prop-comp> ::= ‘==’ ‘!=’
As we can see, the property and time clauses are the atomic elements of the predicate and allow in/equality and containment comparison between a property key and the given value, and a more flexible set of comparisons between a vertex/edge/property lifespan and a given interval. These temporal clauses allow a wide variety of comparison within the context of a single vertex or edge, and their properties. These clauses can be combined using Boolean AND and OR operators. Edge predicates can have an optional direction. The wildcard matches all vertices or edges at a hop.
A novel and powerful temporal operator we introduce is edge time relationship (ETR). Unlike the time clause, this etr-clause allows comparison across entities. Specifically, it is defined on an intermediate vertex in the path (vint-pred), and allows us to compare the lifespans of the left (el-lifespan) and right (er-lifespan) edges. The motivation for this operator comes from social network mining [Fan:2012:GPM:2274576.2274578] and to identify flow and frauds in transactions networks [haslhofer2016bitcoin]. E.g., EQ2 and EQ3 from Sec. 1 can be captured using this.
4 Distributed Query Engine
4.1 Relaxed Interval Centric Computing
Our query engine uses a distributed in-memory iterative execution model that extends and relaxes the interval centric computing model (ICM) [graphite]. ICM adds a temporal dimension to Pregel’s vertex centric iterative computing model [malewicz2010pregel]. Users define their computation from the perspective of a single interval-vertex, i.e., the state and properties for a certain interval of a vertex’s lifespan. In each iteration (superstep) of an ICM application, a user-defined compute function is called on each active interval-vertex, which operates on its prior state and on messages it receives from its neighbors, for that interval, and updates the current state. Then, a user-defined scatter function is called on that interval-vertex that allows it to send temporal messages containing, say, the updated state to its neighbors along the out edges. The message lifespan is typically the intersection of the interval state and the edge lifespan. All active interval-vertices in the distributed graph can execute in a data parallel manner in an iteration. Messages are delivered in bulk at a barrier after the scatter phase, and the compute phase for the next iteration starts after that. Vertices receiving a message whose interval overlaps with its lifespan are activated for the overlapping period. This repeats across supersteps until no messages are generated after a superstep.
We design our engine, called , using the compute and scatter primitives offered the Graphite implementation of ICM over Apache Giraph. However, ICM enforces time-respecting behavior. Here, the intervals between the messages and the interval-vertex state has to overlap for compute to be called, between updated interval states by the compute and the edge lifespans have to overlap for scatter to be called, and scatter sends messages on edges whose lifespan overlaps with the updated states. But the temporal path queries do not need to meet these requirements. E.g., navigating from vertex that occurs after an adjacent vertex is not allowed. Also, ICM uses a TimeWarp operator that allows messages and state intervals to be aligned to enforce this time-respecting behavior, but this operator is costly. So we relax ICM to allow non-time respecting behavior between compute, scatter and messaging, while leveraging other interval centric properties it offers.
4.2 Distributed Execution Model
In our execution model, each vertex predicate for a path query and the succeeding edge predicate, if any, are evaluated in a single ICM superstep. Specifically, the vertex predicates are evaluated in the compute function and the edge predicates in the scatter function. We use a specialized logic called init for the first vertex predicate in a query.
4.2.1 Execution over Static Temporal Graphs
A master receives the path query from the client, and broadcasts it to all workers to start the first superstep. Each worker operates over a set of graph partitions with a single thread per partition, and each thread is responsible for calling the compute and scatter functions on every active vertex in its partition. The init logic is called on all vertices in the first superstep. It resets the vertex state for this new query and evaluates the first vertex predicate of the query. If vertex matches, its state is updated with a match flag and scatter is invoked for each of its incident (in/out) edges. Scatter evaluates the next edge predicate, and if it matches, sends the partial path result to the destination vertex as a message, along with the evaluated path length. If a match fails, this path traversal is pruned.
In the next iteration, our compute logic is called for vertices receiving a message. This evaluates the next vertex predicate in the path and if it matches, it puts all the partial path results from the input messages in the vertex state, and scatter is called on each incident edge. If the edge matches the next edge predicate, the current vertex and edge are appended to each prior partial result and sent to the destination vertex. This continues for as many supersteps as the path length. In the last superstep, the vertices having the matching paths from their messages send it to the master to return back to the client.
Scatter also evaluates the edge temporal relationship. Here, the scatter of the preceding edge passes its lifespan as part of the result message, and this is compared against the current edge’s lifespan by the next scatter to decide on a match.
For static temporal graphs, we do not use any interval-centric features of ICM, and the entire lifespan of the vertex is treated as a single interval-vertex for execution, and likewise for edges. However, we do use the property graph model and state management APIs offered by the interval-vertex.
4.2.2 Execution over Dynamic Temporal Graphs
For graphs with time varying properties, we leverage the interval-centric features of ICM. Specifically, we enable TimeWarp of message intervals with the vertex properties’ lifespans so that compute is called on an interval vertex with messages temporally aligned and grouped against the property intervals. Scatter is called only for edges whose lifespans overlap with the matching interval-vertex, and its scope is limited to the period of overlap. The compute or scatter functions only access messages and properties that are relevant to their current interval of relevance, and both can be called multiple times, for different intervals, on the same vertex and edge.
4.3 Distributed Execution Plans
Queries can be evaluated by splitting them into smaller path query segments that are independently evaluated and the results then combined. Each vertex predicate in the path query is a potential split point. E.g., a query V1-E1-V2-E2-V3 can be split at (or 2) into the segments: V1-E1-V2 and V2-E2-V3. A trivial split at V1 (or 1) degenerates to the standard execution model, while an alternative evaluates this in reverse as V3-E2-V2-E1-V1, which is a split at V3 (or 3). For intermediate split points, execution proceeds inwards, from the outside predicates to the split point where the results are merged. Each split point and plan can be beneficial based on how many vertices and edges match the predicates on a graph. Intuitively, a good plan should evaluate the most discriminating predicate (low selectivity, few vertex/edges matches) first to quickly reduce the solution space.
We modify our logic to handle the execution of two path segments concurrently. For a split point 2, in the first superstep, we evaluate, say, predicates V1-E1 and V3-E2 in the same compute (init) and scatter logic, while in the second superstep we evaluate predicate V2. In the superstep when results from both the segments are available, we do a nested loop join to get the cross-product of the results. This can be extended to more than 1 split point which we leave as future work.
4.4 System Optimizations
4.4.1 Type-based Graph Partitioning
We use knowledge of entity types to create graph partitions with only a single vertex type. This helps eliminate the evaluation of all vertices in a partition if its type does not match the vertex type specified in a hop in the query. This filtering is done before the compute function is called, at the partition compute
of Giraph. We first group vertices by type to form a partition each. But these can have skewed sizes and too few partitions that reduces parallelism. So we further split each typed partition into apartitions using METIS [karypis1998fast], only considering the edges between vertices of the same type and weighted by their lifespan. These partitions are then distributed in a round-robin manner, by type, among all the workers.
4.4.2 Message Optimization
Path results have a lot of overlaps. But each partial result path is separately maintained and sent in messages during query execution. This redundancy leads to large message sizes and more memory. Instead, we construct a result tree, where vertices/edges that match at a previous hop are higher up in the tree and subsequent vertex/edge matches are its descendants. This reduces the result size from size to ; the latter quickly grows smaller for for a binary tree. When complete, a traversal of this result tree will give the expanded result paths.
4.4.3 Memory Optimizations
In our graph data model, all property keys and values, excluding time intervals, are strings. In Java, string objects are memory-heavy. Since often many keys will repeat for different vertices in the same JVM, we map every property key to a byte, and rewrite the query at the master based on this mapping. Further, for property values that repeat, such as country, we use interning in Java that replaces individual string objects with shared string objects. This works as the graph is read-only, and besides reducing space, also allows predicate comparisons based on pointer equivalence.
5 Query Planning and Optimization
A given path query can be executed using different distributed execution plans, each with a different execution time. The goal of cost model
is to quickly estimate the expected execution time of these plans and pick the optimal plan for execution. Rather than absolute accuracy of the estimated query execution time, what matters is its ability to discriminate between poor plans with high times and good plans with low times.
Ours is an analytical cost model that uses statistics about the temporal property graph, combined with estimates about the execution time spent in different stages of the distributed execution plan, to estimate the execution time for the different plans of a given query. We first enumerate the possible plans, contributed by each split point in the path query. The graph statistics are then used to estimate the number of vertices and edges that will be active at each superstep of query execution, and the number of vertices that will match the predicates in this superstep and flow to the next level. Based on the number of active and matched vertices and edges, our execution model will estimate the runtime for each superstep of the plan. Adding these up returns the estimated execution time for a plan. Next, we discuss the graph statistics that we maintain, the model to estimate the vertex and edge counts, and the execution time estimation.
5.1 Graph Statistics
We maintain statistics about the temporal property graph to help estimate the vertices and edges matching a specific query predicate. Typically, such statistics are maintained in relational databases as a frequency of matching tuples for different value ranges, for a given property. A unique challenge here is that the property values can be time variant. Hence, for each property key present across all vertex and edge types, we maintain a 2D histogram, where the X axis indicates different value ranges for the property and the Y axis indicates different time ranges. Each entry in the histogram stores information on the number of vertices or edges that fall within that value range for that time range.
Formally, for a given property key , we define a function , that returns an estimate of the frequency of vertices/edges which have the property value val during a time interval , and the in/out degrees of the matching vertices.
The granularity of the value and time ranges has an impact on the size of the statistics maintained and the accuracy of the estimated frequencies. We make several optimizations in this regard. We coarsen the ranges of the histogram along both axes to form a hierarchical tiling, which uses a dynamic programming (DP) strategy [muthukrishnan1999rectangular]
. The tiling attempts to reduce frequency variance among the individual value–time pairs within each tile to fall below a threshold.
For important properties like vertex/edge type, out-degree and in-degree, we pre-coarsen the time steps into weeks and for other properties, the time steps are in months. This reduces the size of the histogram, and the steps are decided based on how often the properties change in the graph. For properties with 1000’s of enumerated values like Tag from Fig. 1, we first cluster the values by sorting them based on their frequency and grouping them such that each group has a certain frequency, and then perform tiling on these clusters. We maintain a map between property values and clusters for these attributes.
We use interval tree data structure to maintain each histogram, with each tile inserted into this tree based on its time range. The leaves of this tree will have a set of tiles (property value ranges and their frequencies) that fall within its time interval. Calling the function performs a lookup in this interval tree, and matches within the set of property ranges at the leaf.
The time complexity to construct each interval tree is dominated by the tiling step that uses DP, and takes , where is the number of (clustered) property values for the property key, and the number of (coarsened) time units they span [muthukrishnan1999rectangular]. The lookup time is in worst case where is the number of property values(clusters). The raw size of the the statistics for the graphs used in our experiments ranges from – kB for about – property keys. The time required to get an optimal split point for a query ranges between – ms.
5.2 Estimating the Active/Matching Vertices/Edges
A query plan contains one or two path query segments. The query predicates on each vertex (and optionally, its edges) in the segment are evaluated in a single superstep. If two path segments are present, their results are joined at the end. For each segment, we estimate the number of active/matching vertices/edges in each superstep, and this is formalized as a recurrence relation as discussed next.
Let denote the sequence of vertex predicates and edge predicates for a given path query segment. Each predicate has a set of property clauses and temporal clauses , where is a property key, is a value to compare its value against, and is the interval to compare that vertex/edge/property’s lifespan against. These clauses themselves can be combined using AND and OR Boolean operators.
Let () denote the type of the vertex (edge) enforced by a clause of predicate (). Let and denote the set of vertices and edges of that type; if the vertex or edge type is not specified in the predicate, these sets degenerate to all vertices and edges in the graph.
As shown in Fig. 3, each superstep is decomposed into 2 stages: calling init or compute on the active vertices to find the vertices matching the vertex predicate, and calling scatter on the active edges (i.e., in or out edges of the matching vertices) to identify the edges matching the edge predicates. These in turn help identify the active vertices for the next superstep of execution. Initially, all vertices of the graph are active, but if a type is specified in the starting vertex predicate, we can use of type-based partitioning to limit the active vertices to ones having that type.
Let and denote the number of active and matched vertices for vertex predicate with , and and denote the number of active and matched edges matching the edge predicate with , respectively. These can be recursively defined as:
In Eqn. 1, we set the active vertex count in the first superstep to be equal to the number of vertices of type . This reflects the localization of the search space in the init function to only vertices in the partition matching that vertex type. For subsequent supersteps, the active vertex search space is upper-bound by but is usually expected to be the number of matching edges in the previous superstep (in worst case), which would send a message to activate these vertices and call its compute function.
Next, in Eqn. 2, we use the graph statistics to find the % of vertices that match the vertex predicate (right hand term, also called selectivity) and multiply this with the number of active vertices to estimate the matched vertices. This is the expected matched output count from init or compute. For the selectivity, we iterate through all clauses of predicate , get the frequency, average in degree and average out degree of the vertex matches for each property clause, in conjunction with any temporal clause using , and aggregate () these frequencies. The aggregation between adjacent clauses can be either AND or OR, and based on this, we apply the following aggregation logic for the frequencies and degrees.
In Eqn. 5, doing an AND returns the smaller of the frequencies while doing an OR gives the larger of the two; the former can be an over-estimate while the latter an under-estimate. In Eqn. 6, we get the weighted average of the degrees of the vertices matching the predicates. Once we have the aggregated frequencies of the clauses, we divide it by the number of vertices of this vertex type to get the selectivity for the vertex predicate.
In Eqn. 3, we identify the number of edges for which scatter will be triggered by multiplying the matched vertices with the aggregated in and out degrees for the matching vertices, . Lastly, we estimate the number of edges matched by the edge predicate in Eqn. 4. Here, we get the edge selectivity (right hand term) using the frequency of edge matches returned by the graph statistics, and normalized by the number of preceding vertices of type times the average of the in and out degrees of vertices of this type. The edge selectivity is multiplied by the active edge count to get the matched edges that is expected from the scatter call. These edges will message their destination vertices, and this will form the active vertex count in superstep .
5.3 Execution Time Estimate
Given the estimates of the active/matched vertices/edges in each superstep, we incorporate them into execution time models for the different stages within a superstep to predict the overall execution time. We use micro-benchmarks to develop a linear regression equation for these execution time models,and as used below. These models are unique to a cluster deployment of , and can be reused across graphs and queries.
As shown in Fig. 3, the init function is called on the active vertices in the first superstep, and generates outputs that affect the internal states of the interval vertex, and its execution time is given by . For subsequent supersteps , compute function is called similarly on the active vertices to generate matched vertices . This has a slightly different execution model since it has to process an estimated input messages from the previous superstep and does not have to do data structure initializations like init. It takes time . In a superstep , scatter is called on the active edges and generates matched edges, for an estimated time of .
Besides these, there are two per-superstep overheads of the platform: for selecting the vertices matching a given type, , and a per active vertex overhead of Graphite, .
Given these, the total estimated execution time of the cost model for a path segment with hops is
|Query||LDBC ID||Hops||Prop. Preds.||Time Preds.||ER Pred.?|
|Query||Description of path to find (Parameterized property values are underlined)|
|Q1||Two messages with different tags belong to the same forum, with a time ordering between the messages|
|Q2||A person with a given tag creates a message with the same tag after a given date.|
|Q3||A person from a given country has commented or liked a post before a person from another given country.|
|Q4||Mutual friendships between three persons, but with a time-respecting order in which they befriend each other.|
|Q5||A person posts a message with a given tag to a forum and, after a time offset, they post another message to the same forum with a different tag.|
|Q6||A person with a specific gender replies to a post after another person replies to it.|
|Q7||A person posts a message from outside their home country, then befriends another person, and that person then posts another message from outside their home country.|
|Q8||Two persons working in different companies have a common friend at a timepoint.|
The Linked Data Benchmark Council (LDBC) offers the social network benchmark [LdbcTechSpecification], a community-standard workload with realistic transactional path queries over a social network property graph. There are two parts to this benchmark, a social network graph generator and a suite of benchmark queries.
The graph generator S3G2 [s3g2] models a social network as a large correlated directed property graph with diverse distributions. Vertices and edges have a schema type and a set of properties for each type. Vertex types include person, message, comment, university, country, etc. The graph is generated for a given number of persons in the network and a given edge distribution of person–person friendship: Altmann (A), Discrete Weibull (DW), Facebook (F) or Zipf (Z).
We make two changes to the LDBC property graph generator. One, we denormalize the schema to embed some vertex types such as country, company, university and tag as properties inside person, forum, post and comment vertices. This simplifies the data model. Two, while LDBC vertices have a creation time that can span a 3-year period, we include an end time of to form an interval. We assign lifespans to the edges incident on vertices based on their referential integrity constraints and properties like join date, post date, etc. The vertex and edge lifespans are also inherited by their properties. Fig. 4 shows the modified graph schema.
However, this is still only a static temporal property graph. To address this, we introduce temporal variability into the properties, worksAt, country and hasInterest of the Person vertex. For worksAt, we generate a new property every year using the LDBC distribution; the country is correlated with worksAt, and hence updated as well. We update the hasInterest property based on the list of tags for a forum that a person joins, at different time points.
Table 1 shows the vertex and edge counts, the number of vertices of each type and the total number of properties, for graphs we generate with (10k) or (100k) persons, different distributions (DW, Z, A, F), and with static (S, top 4 rows) and dynamic (D, bottom 2 rows) properties.
We select a subset of query templates provided in the LDBC query workload [LdbcTechSpecification] that conform to a linear path query, and adapt them for our temporal graphs. These are either from the business intelligence (BI) or the interactive workload (IW). We also include two additional query templates to exercise our query model. Table 2 and 3 summarizes the query templates. Each template has some parameterized property or time value. We generate 100 query instances for each template by randomly selecting a value for the parameters, evaluating the query on the temporal graph, and ensuring that there is at least 1 valid result set in most cases. This reflects the expressivity of our query model, and ability to intuitively extend it to the time domain.
In our experiments, each query is given an execution budget of 600 secs, after which it is terminated and marked as failed.
6.2 Experiment Setup
Our commodity cluster has nodes with one Intel Xeon E5-2620 v4 CPU with 8 cores (16 HT) @ 2.10GHz, 64 GB RAM and 1 Gbps Ethernet, running CentOS v7. For some shared-memory experiments for other baseline graph platforms, we also use a “big memory” machine with 2 similar CPUs and 512 GB RAM. is implemented over our in-house Graphite v1.0, Apache Giraph v1.3.0, Hadoop v3.1.1 and Java v1.8. By default, our distributed experiments use 8 nodes in this cluster, run one worker JVM per machine with 8 threads per worker and 50 GB RAM available to the JVM. The graphs are initially loaded into from JSON files stored in HDFS, along with their cost model statistics.
6.3 Baseline Graph Platforms
We use the widely-used Neo4J Community Edition v3.2.3 as a baseline. This is a single-machine, single-threaded graph database. We have three variants of this. One specifies the workload queries using the Gremlin query language (N4J-Gr, in our plots), a community standard, and the other uses Neo4J’s native Cypher language (N4J-Cy). Both these variants run on a single node with 50 GB heap size. A third variant uses Cypher as well, but is allocated of RAM on the big memory machine (N4J-Cy-M). This matches the total memory available to our distributed setup. As graph platforms are memory bound, this assigns it equal memory as the distributed platforms. We build indexes on all properties.
There are few open source distributed graph engines available. JanusGraph, a fork from Titan, is popular, and uses Apache Spark v2.4.0 as a distributed backend engine to run Gremlin queries (Spark, in our plots). It uses Apache Cassandra v2.2.10 to store and access the input graph. Spark runs on 8 compute nodes with 1 worker each and 50 GB heap memory per worker. Cassandra is deployed on 8 other nodes. For all baselines, we follow the standard performance tuning guidelines provided in their documentation.
Since these platforms do not easily support temporal queries over dynamic temporal graphs, we transform the graphs into a static temporal graph [wu2016reachability] that allows us to adapt the query to operate over, although over a much bloated graph.
6.4 Effectiveness of Cost Model
We first evaluate the effectiveness of ’s cost model in identifying the optimal split point for the distributed query execution. For each query type, we execute its queries and all their query plans. From the execution time of all plans for a query, we pick the smallest as its optimal plan. We then compare this against the plan selected by our cost model, and report the % of excess execution time that the plan selected by our cost model takes above the optimal plan. This is the effective time penalty when we select a sub-optimal plan.
Fig. (a)a shows a violin plot of the the distribution of the % excess time over optimal for the different fixed split points executed for the 100 queries of type Q4 on graph 100k:A-S. We also report the distribution for the plan selected by our cost model. This illustrates that the execution time varies widely across the plans, with some taking longer than optimal. We also observe that some split points like 2 and 3 are in general better than the others, but among them, neither are consistently better. This is seen by the lower median of 2.9% excess time taken by the cost model, compare to 12.2% and 6.9% by these other split points. Also, it is not possible to a priori find a single fixed split point which is generally better than the rest, without running the queries using all split points. These motivate the need for an automated analytical cost model for query plan selection.
Table 4 shows for different query types (columns), and for different percentiles of their queries (rows), what is the % excess execution time over the optimal spent by the plan chosen by the cost model. This is reported only for 100k:A-S (top) and 100k:Z-D (bottom) graphs for brevity.
For 100k:A-S, the selected plan are within 2% of optimal execution time for the percentile query within a query type, and within 13% for percentile query. Its only at the percentile that we see higher penalties of – for 3 of the 7 types. Even for the dynamic graph 100k:Z-D, at the percentile query, 6 of the 8 query types have negligible time penalties, and two have higher penalties of –. This means we pre-dominantly pick a plan that is optimal, or has an execution time close to the optimal plan.
This is further evident in Fig. (b)b which reports that across all queries and graphs evaluated, our cost model picks the best (optimal) or the second best plan over of the time. So while our cost model is not perfect, its accuracy is high enough to discriminate between the better and the worse plans.
6.5 Comparison with Baselines
Figs. (a)a–(d)d show the average execution time (log scale) for the query workload on and the baseline platforms for the static temporal graphs, and Figs. (a)a and (b)b for the dynamic temporal graphs. Only queries that complete in the sec time budget are plotted. As Table 5 shows, Janus-Spark did not run (DNR) for several larger graphs due to resource limits when loading the graph in-memory from Cassandra. – of queries did not finish (DNF) on Neo4J for 100k:F-S, the largest graph. completes all queries on all graphs, often within 1sec. For the largest graph 100k:F-S, we only run queries per type for all platforms due to time limits, and uses 16 nodes to fit the graph in distributed memory.
The bar plots show that is much faster than the baselines, across all graphs and all query types, except one. On average, we are faster than N4J-Cy-M, faster than N4J-Cy, faster than N4J-Gr and faster than Spark. Other than the largest graph, completes on an average within for all static graphs and most query types, and on an average within for 100k:F-S and the dynamic graphs.
For 100k:F-S, Q3 takes secs due to the huge number of results, M on average. But this query type does not even complete for this graph for N4J-Cy, N4J-Cy-M and Spark. ’s tree-based result structure is more compact, reducing memory and communication costs. Q4 for this graph is also – better than the baselines in . Here, there is a rapid fan-out of matching vertices followed by a fan-in as they fail to match downstream predicates, leading to high costs, though the result sizes are large, k on average. is also consistently better for the dynamic graphs. The only time that our average query time is slower than a baseline is for Q5 on 10k:DW-D.
Neo4J using Cypher, on the compute node and big memory nodes, are the next best to . The large memory variant gives similar performance as the regular memory one for the smaller graphs, but for larger graphs like 100k:A and 100k:F, it out-performs. For the latter graph, N4J-Cy could not finish several query types. Though Neo4J uses indexes to help filter the vertices for the first hop, query processing for later hops involves a breadth first search traversal and pruning of paths based on the predicates. There are also complex joins between consecutive edges along the path to apply the temporal edge relation. These affect their times. The execution times for Gremlin and Cypher variants of Neo4J are comparable, with no strong performance skew either way. Interestingly, the Gremlin variant of Neo4J is able to run most query workloads for all graph, albeit with slower performance.
The JanusGraph-Spark distributed baseline takes the highest amount of time for all these queries. There is a static overhead in Spark in dynamically fetching the graph from Cassandra during query execution time, causing an overhead to each query. persists the graph in-memory across queries. Despite using distributed machines, Spark is unable to load large graphs in memory and often fails to complete execution within the time budget. A similar challenge was seen even for alternative engines like, Hadoop, used by JanusGraph and Spark was the best of the lot.
In the bar plots, we also show a black bar for the single-machine baselines, which is the execution timepoint – this shows the theoretical time that would be taken by these platforms with perfect scaling on 8 machines, though it is not supported. As we see, is often able to complete its execution within that mark, showing that our distributed engine shows scaling performance comparable or better than highly optimized single-machine platforms.
In this paper, we have motivated the need for and gap in querying over temporal property graphs. We have proposed an intuitive temporal path query model to express a wide variety of requirements over such graphs, and designed the distributed engine to implement these at scale over the Graphite ICM platform. Our novel analytical cost model uses concise information about the graph to give highly accurate selection of alternative distributed query execution plans. These are validated through rigorous experiments over 5 graphs and 800 queries derived from the LDBC benchmark, and uniformly out-performs the baseline graph databases and distributed platforms.
As future work, we plan to explore out of core execution models to scale beyond distributed memory, indexing techniques to accelerate performance, and more generalized temporal tree and reachability query models.
We thank Ravishankar Joshi from BITS-Pilani, Goa for his assistance with the experiments, and Swapnil Gandhi for his assistance with using and extending the Graphite platform. We thank the members of the DREAM:Lab for their help with reviewing and offering feedback on the paper.
Shriram Ramesh was supported by the Maersk CDS M.Tech. Fellowship. Yogesh Simmhan was supported by the SwarnaJayanti Fellowship.