Reducing Property Graph Queries to Relational Algebra for Incremental View Maintenance

06/19/2018 ∙ by Gábor Szárnyas, et al. ∙ Budapest University of Technology and Economics 0

The property graph data model of modern graph database systems is increasingly adapted for storing and processing heterogeneous datasets like networks. Many challenging applications with near real-time requirements -- e.g. financial fraud detection, recommendation systems, and on-the-fly validation -- can be captured with graph queries, which are evaluated repeatedly. To ensure quick response time for a changing data set, these applications would benefit from applying incremental view maintenance (IVM) techniques, which can perform continuous evaluation of queries and calculate the changes in the result set upon updates. However, currently, no graph databases provide support for incremental views. While IVM problems have been studied extensively over relational databases, views on property graph queries require operators outside the scope of standard relational algebra. Hence, tackling this problem requires the integration of numerous existing IVM techniques and possibly further extensions. In this paper, we present an approach to perform IVM on property graphs, using a nested relational algebraic representation for property graphs and graph operations. Then we define a chain of transformations to reduce most property graph queries to flat relational algebra and use techniques from discrimination networks (used in rule-based expert systems) to evaluate them. We demonstrate the approach using our prototype tool, ingraph, which uses openCypher, an open graph query language specified as part of an industry initiative. However, several aspects of our approach can be generalised to other graph query languages such as G-CORE and PGQL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Concepts

Graph processing problems are common in modern database systems, where the property graph (PG) data model (Hölsch and Grossniklaus, 2016; Marton et al., 2017; Angles et al., 2018; Francis et al., 2018; Angles, 2018) is gaining widespread adoption. Property graphs extend labelled graphs with properties for both vertices and edges. Compared to previous graph modelling approaches, such as the RDF data model (which treats properties as triples), PGs allow users to store their graphs in a more compact and comprehensible representation.

openCypher

Due to the novelty of the PG data model, no standard query language has emerged yet. The openCypher initiative aims to standardise the Cypher language (Francis et al., 2018)

of the Neo4j graph database. The openCypher language uses a SQL-like syntax and combines graph pattern matching with relational operators (aggregations, joins, etc.). In this paper, we target queries specified in the openCypher language.

Motivation

In graph database applications, numerous use cases rely on complex queries and require low response time for repeated executions, including financial fraud detection, and recommendation engines. In addition, graph databases are increasingly used in software engineering context as a semantic knowledge base for model validation (Bergmann et al., 2010; Szárnyas et al., 2017a; Daniel et al., 2017), source code analysis (Hawes et al., 2015), etc. While these scenarios could greatly benefit from incremental query evaluation, currently no system provides incremental views with sufficient feature coverage for a practical PG query language such as openCypher. Up to our best knowledge, the only existing incremental property graph query engine is Graphflow (Kankanamge et al., 2017), which extends Cypher with triggers, but lacks support for rich language features like negative/optional edges and transitive closures.

Incremental graph queries were successfully used in the domain of model-driven engineering. For example, the incremental query engine of Viatra ensures quick model validation and transformation over in-memory graph models (Ujhelyi et al., 2015).

Problem statement

In relational database systems, incremental view maintenance (IVM) techniques have been used for decades for repeated evaluation of a predefined query set on continuously changing data (Forgy, 1982; Blakeley et al., 1986; Miranker and Lofaso, 1991; Gupta et al., 1993; Gupta and Mumick, 1995; Hanson, 1996; Gupta and Mumick, 1999; Hanson et al., 2002; Szárnyas et al., 2014; Koch et al., 2014; Ujhelyi et al., 2015). However, these techniques typically build on assumptions that do not hold for property graph queries. In particular, PG queries present numerous challenges:

  1. Schema-optional data model. Existing IVM techniques assume that the database schema is known a priori. While this is a realistic assumption for relational databases, the data model of most property graph systems is schema-free or schema-optional at best (Francis et al., 2018). Hence, to use IVM, users are required to manually define the schema of the graph, which is a tedious and error-prone process.

  2. Nested data structures. Most IVM techniques assume relational data model with 1NF relations. However, the property graph data model defines rich structures, including the properties on graph elements and paths. Collection types, such as sets, bags, lists, and maps are also allowed (Angles et al., 2018; Francis et al., 2018). These can be represented as NF2 (non-first normal form) data structures, but their mapping to 1NF relations is a complex challenge.

  3. Mix of instance- and meta-level data. Queries can not only access data fields from the instance graph (e.g. ids, properties), but also metadata such as vertex labels and edge types (Francis et al., 2018; van Rest et al., 2016).

  4. Handling null values and outer joins. Property graph queries allow null values and optional pattern matches, similarly to outer joins in relational databases. Most relational IVM works do not consider this challenge, except (Griffin and Kumar, 1998; Larson and Zhou, 2007).

  5. Complex aggregations. PG queries allow complex aggregations, e.g. aggregations on aggregations (Mumick et al., 1997) and using non-distributive aggregation functions (e.g. min, max, stdev) which are difficult to calculate incrementally (Palpanas et al., 2002).

  6. Reachability queries. Unbounded reachability queries on graphs with few connected components need to calculate large transitive closures, which makes them inherently expensive (Bergmann et al., 2012). Hence, the impact of the IVM on reachability is more limited compared to non-recursive queries and using space-time tradeoff techniques is more expensive: to improve execution time, one has to trade memory at an exponential rate.

  7. Mix of queries and transformations. Some property graph query languages (e.g. openCypher) allow combining update operations with queries. Most traditional IVM techniques do not consider this challenge, and omit related issues such as conflict set resolution. Discrimination networks from rule-based expert systems are better suited to handle this issue (Forgy, 1982; Miranker and Lofaso, 1991; Hanson et al., 2002).

  8. List handling. Property graph data sets and queries make use of lists both as a way to store collection of primitive values and to represent paths in the graphs. Order-preserving techniques have only been studied in the context of IVM on XQuery expressions (Dimitrova et al., 2003), for trees but not for graphs.

  9. Skewed data distribution.

    Subgraph matching is often implemented as a series of binary joins. Recent work revealed that binary (two-way) joins are inefficient on data sets with skewed distributions of certain edge types (displayed by graph instances in many fields, e.g. in social networks). Hence, a large body of new research proposes n-ary (multiway) joins to achieve theoretically optimal complexity 

    (Ngo et al., 2018; Ammar et al., 2018).

  10. Higher-order queries. PG queries often employ higher-order expressions (Buneman et al., 1995), e.g. processing the vertices/edges on a path (also known as path unwinding (Angles et al., 2017)). Incrementalization of higher-order languages is a new field of research (Cai et al., 2014), and up to our best knowledge, currently there are no implementations using these techniques for query evaluation.

In this paper, we address challenges 13 in detail, present a first solution to handle 46 with acceptable performance and propose techniques from the literature to tackle 79. Finding applicable techniques to handle 10 is left for future work.

Contributions

In this paper, we discuss the challenges of IVM on PG queries and present an approach to tackle some of these challenges. In particular:

  • We introduce extensions for relational algebra in order to handle graph-specific operators and use them to capture the semantics of (a subset of) the openCypher language.

  • We define a mapping for PG data using nested relations, and use nested relational algebra (NRA) to represent the queries. The data model can represent both the property graph and the resulting tables, while the NRA operators have sufficient expressive power to capture operations on the PG. This allows the algebra to be composable and closed even for operations such as transitive reachability.

  • We define a chain of transformations to translate the nested algebraic query plans to (incrementally maintainable) flat relational algebra (FRA) expressions.

  • We present the schema inferencing algorithm that eliminates the need to define the graph schema in advance.

  • We present ingraph, a research prototype capable of evaluating openCypher graph queries incrementally.111ingraph is available as an open-source tool at http://github.com/ftsrg/ingraph.

  • We overview applicable IVM approaches from the literature in rule-based expert systems, integrity constraint checking, and materialized views.

Paper structure

We first present some theoretical background for property graphs (Section 2) and define the operators of graph relational algebra (Section 3). We then discuss the compilation and query evaluation process (Section 4) and view maintenance (Section 5). Finally, we overview related techniques (Section 6) and outline future directions (Section 7).

2. The property graph data model

(a) Example graph defined formally.
(b) Example graph visualised.
edge
id src trg type properties
key value
1 KNOWS since 2014
2 INTEREST level 4
3 CLASS
4 SUBCLASS_OF
5 SUBCLASS_OF
(c) Nested relation of edges: .
vertex
id labels properties
label key value
Student
Person
name Alice age 24 speaks [en]
Person
name Bob age 53 speaks [en, de]
Tag
topic Neofolk
Class
subject Folk
Class
subject Music
Class
subject Art
(d) Nested relation of vertices: .
id labels properties
label key value
Student
Person
name Alice age 24 speaks [en]
(e) Get-vertices result: .
id labels properties id src trg type properties id labels properties
label key value key value label key value
Student
Person
name Alice age 24 speaks [en] 2 INTEREST level 4
Tag
topic Neofolk
(f) A result relation produced by an application of the get-edges operator: .
Figure 1. Social network example represented graphically, formally, and as nested relations.

2.1. Data model

The concept of the property graph has only been studied by a few academic works, but it already has multiple flavours and definitions (Hölsch and Grossniklaus, 2016; Marton et al., 2017; Angles et al., 2018; Francis et al., 2018; Angles, 2018). In this paper, we define it as follows.

Structure

A PG is , where is a set of vertex identifiers, is a set of edge identifiers, and function assigns the source and target vertices to edges. Vertices are labelled and edges are typed: is a set of vertex labels, function assigns a set of labels to each vertex; is a set of edge types, function assigns a single type to each edge.

Properties

Let be a set of scalar literals, and denote the set of all finite bags of elements from . Let be the value domain for the PG. 222The data model can be generalised further, e.g. by allowing arbitrary nesting of collections. However, this data model already has higher expressive power than most graph data models (e.g. semantic graphs) and satisfies the needs of most practical use cases. It is also powerful enough to represent the complex schema of the LDBC Social Network Benchmark (Erling et al., 2015).

  • is the set of vertex properties. A vertex property is a partial function , which assigns a property value to a vertex , if has property , otherwise returns .

  • is the set of edge properties. An edge property is a partial function , which assigns a property value to an edge , if has property , otherwise returns .

Example graph

An example graph inspired by the LDBC Social Network Benchmark (Szárnyas et al., 2018) is shown formally in (a) and graphically in (c). The graph contains a , two , and three . Note that edges in the PG data model are always directed, hence the relation is represented with a directed edge and the symmetric nature of the relation can be modelled in the queries.

2.2. Nested relations

openCypher queries take a property graph as their inputs and return a graph relation (Hölsch and Grossniklaus, 2016; Marton et al., 2017) as their output. To represent graphs and query results using the same algebraic constructs, we use nested relations (Colby, 1990), which allow data items of a relation to contain additional relations with an arbitrary level of nesting. The domain for the internal relations is . Relations on all levels of nesting follow bag semantics, i.e. duplicate tuples are allowed. We define the schema of a relation as a list of (nested) attributes and denote it with for relation .

To represent the vertices and edges of the property graph, we define two nested relations, and . Both relations have a single attribute containing nested relations. Their schema is given below and its mapping to the PG concepts is in Table 1.

For , its corresponds to the elements in . For a particular vertex, is the result of the function, whereas is the result of . Similarly for , corresponds to the elements in . For a particular edge, the corresponds to the result of , is the result of , and is the result of .

center vertex edge schema PG schema PG id id labels(label) type properties properties attributes

Table 1. Mapping between the PG data model and its representation as nested relations.

The nested relations representing the example graph are shown in Figure 0(c) and 0(d). These show that the set of vertex labels are stored as a nested relation with a single attribute , while edge types are simply stored as a single string value. The properties of vertices/edges are stored as a nested relation with two attributes, and . This representation is well-suited to the flexible schema of PG databases, as new labels, types, and property keys can be added without any changes to the schemas of the relations.

3. Graph relational algebra

Papers (Hölsch and Grossniklaus, 2016) and (Marton et al., 2017) presented relational algebraic formalisations of the openCypher language. A more rigorous formalisation was given in (Francis et al., 2018). In this paper, we follow the approach of our previous work (Marton et al., 2017) as it is better suited to established IVM techniques. This approach uses graph relational algebra (GRA), which extends standard relational algebra operators with graph-specific navigation operators.

In this section, we formally define the operators of GRA and show example queries specified in natural language and as an openCypher query, along with the equivalent GRA expression and the resulting output relation.

3.1. Basic operators and nested property access

We first present the basic unary operators of relational algebra, found in most relational algebra textbooks like (Garcia-Molina et al., 2009). The selection operator filters the incoming relation according to some criteria. Formally, where predicate is a propositional formula. Relation contains all tuples from for which holds. The projection operator keeps a specific set of attributes in the relation: Note that the tuples are not deduplicated, thus will have the same number of tuples as . The projection operator can also alias attributes, e.g. renames to and returns 5 as attribute . The duplicate-elimination operator eliminates duplicate tuples in a bag, enforcing set semantics on its input.

Shorthands

For the sake of conciseness, we introduce two shorthands. First, we allow using the dot notation () to traverse the nested schema to directly access nested attributes in the expressions (such as the selection predicate ), e.g. the expression can access the attribute of the attribute . This notation requires to have an and the expression holds iff equals to .

Second, properties stored as key-value pairs in the nested relation can be accessed directly as if they were top-level attributes,

e.g. the expression can access the property of attribute . Unlike nested attribute access, this shorthand does not require to have a property with key , it simply returns in the absence of such a key.

3.2. The get-vertices and get-edges operators

For mapping a property graph to relations, we use the nullary operators get-vertices and get-edges. We define these operators using the nested relations and introduced in Section 2.2. These operators are rather involved, hence we introduce some notational conventions used for the definitions:

  • A vertex variable is free w.r.t. an operator’s input relation if and bound if .

  • represents a free vertex, represents a bound vertex, and represents any vertex.

  • Arrow symbols , , and represent an outgoing, incoming, and undirected edge, respectively.

  • For vertices, we use three predefined sets of labels:
    ; ; and .

  • For edges, we use a set of types .

Get-vertices

The get-vertices operator (Hölsch and Grossniklaus, 2016) returns a nested relation of a single attribute that contains vertices which have all labels of . Formally, it is defined as:

The schema of the resulting relation is , as the example in (e) shows. The usage of the operator is illustrated with the following example:

  Example. Get the name of all Persons aged over 25.   MATCH (p:Person) WHERE p.age > 25 RETURN p.name

 

 

Bob

 

The get-edges operator

Next, we introduce the get-edges operator , which returns edges along with their source and target vertices. Using theta joins on get-vertices operators and relation , the get-edges operator can be defined as:

The schema of the result is , as the example in (f) shows.

Edge directions

Additionally to the directed get-edges operator, we define the undirected get-edges operator , which enumerates edges of both directions. Formally:

Notation

To aid readability, we always surround the and operators with parentheses and brackets, resp.

3.3. The expand operators

To capture navigations, we define the unary expand-out operator . The expression takes tuples from relation and returns a tuple for each possible navigation from a bound vertex to vertex through an edge , while enforcing the label and type constraints ( is labelled with all labels of and is typed with one type of or has any type if is empty). It can be defined using the get-edges operator:

The schema of the resulting relation is . The operator is demonstrated as follows:

  Example. Get Persons and their interests.   MATCH (p:Person)-[i:INTEREST]->(t:Tag) RETURN p.name, i.level, t.topic

 

 

Alice 4 Neofolk

 

Edge directions

We define two additional expand operators: the expand-in operator  accepts incoming edges, while the expand-both operator  accepts edges from both directions. Formally, they can be defined as follows:

Transitive navigation

To allow multi-hop navigation along the edges, we define a transitive variant of the expand operator , which navigates from to through edges  of any type in (if is not empty), using a number of hops between a lower bound () and an upper bound ().

We restate here that the nested relations in this paper follow bag semantics (Section 2.2), i.e. they do not store any ordering between their tuples. Therefore, storing the edges of a paths as a single attribute would cause us to lose the information on ordering. Therefore, we define attribute  as a nested attribute which stores the edge attribute “” along with an indexing attribute “” that denotes the position of the edge in the path. Using this attribute, the schema is:

This is demonstrated with the following example:

  Example. Get the subclasses of Class ’Art’   MATCH (c:Class)-[sos:SUBCLASS_OF*1..]->(a:Class) WHERE a.topic = 'Art' RETURN c.name, sos

 

 

index edge
id src trg type properties
key value
Folk 1 4 SUBCLASS_OF
2 5 SUBCLASS_OF
Music 1 5 SUBCLASS_OF

 

3.4. Combining pattern matches

A single graph pattern is defined starting from get-vertices and expand operators. Multiple graph patterns can be combined together based on their common attributes using the natural join operator . Additionally, most PG query languages allow users to define optional pattern parts. This can be captured with the left outer join operator

, which pads tuples from the left relation that did not match any from the right relation with

values and adds them to the result of the natural join (Silberschatz et al., 2005). This is illustrated by the following example:

  Example. Get Persons and their interests if they have any.   MATCH (p:Person) OPTIONAL MATCH (p)-[i:INTEREST]->(t:Tag) RETURN p.name, t

 

 

id labels properties
label key value
Alice
Tag
topic Neofolk
Bob

 

Some queries pose structural conditions on the graph patterns (e.g. only return Persons who have at least one interest). Positive structural conditions can be captured with the semijoin operator , which is defined as . Negative structural conditions can be captured by using the antijoin operator  (also known as the anti-semijoin), which is defined as . For the sake of brevity, we refrain from providing examples for these operators.

3.5. Collections and aggregation

Unwinding

It is often required to handle elements in nested collections separately. To allow this, we introduce the unwind operator , a specialized version of the unnest operator  of nested relational algebra (Botoeva et al., 2018). In particular, takes the bag in attribute and creates a new tuple for each element of the bag by appending that element as an attribute to .

Ordering

In common extensions to relational algebra (Garcia-Molina et al., 2009), the sort operator is used to sort a relation, returning a relation that follows list semantics. The ordering is defined according to selected attributes and with a certain direction for each attribute (ascending  or descending ), e.g. . Additionally, the top operator  (Li et al., 2005) takes a list relation as its input, skips the first tuples and returns the next tuples. The default values are 0 for and for .

As the operators in our nested bag algebra do not define ordering, a standalone sort or top operator would have no clear semantics. Hence, we only allow these operator combined together as a single sort-and-top operator.

Grouping and aggregation

The grouping operator groups tuples according to their value in one or more attributes and aggregates the remaining attributes. As determining the attributes of the grouping criteria is non-trivial, the grouping operator explicitly states these attributes. We use the notation , where form the grouping criteria, i.e. the list of expressions whose values partition the incoming tuples into groups. For every group this aggregation operator emits a single tuple of expressions with aliases , respectively. We demonstrate the unwind, grouping, sort, and top operators using a single example:

  Example. Number of speakers of the top 1 spoken language.   MATCH (p:Person) WITH p UNWIND p.speaks AS lang RETURN lang, count(p) as sks ORDER BY sks DESC LIMIT 1

 

 

en 2

 

center Language construct GRA expression (<<v>>) (<<v>>:<<l1>>:...:<<lk>>) <p>p</p>-[<<e>>:<<t1>>|...|<<to>>]->(<<w>>) <p>p</p><-[<<e>>:<<t1>>|...|<<to>>]-(<<w>>) <p>p</p><-[<<e>>:<<t1>>|...|<<to>>]->(<<w>>) <p>p</p>-[<<E>>:<<t1>>|...|<<to>>*low..up]->(<<w>>) MATCH <p>p1</p>, <p>p2</p>, ... OPTIONAL MATCH <p>p</p> [[r]] OPTIONAL MATCH <p>p</p> [[r]] WHERE <<condition>> [[r]] WHERE (<<v>>:<<l1>>:...:<<lk>>) [[r]] WHERE <p>p</p> [[r]] WHERE NOT <p>p</p> [[r]] RETURN <<x1>> AS <<y1>>, ... [[r]] RETURN DISTINCT <<x1>> AS <<y1>>, ... [[r]] RETURN <<x1>>, <<x2>>, aggr(<<x3>>) [[r]] UNWIND <<xs>> AS <<x>> [[r]] ORDER BY <<x1>> ASC, <<x2>> DESC, ...     SKIP s LIMIT l

Table 2. Mapping from openCypher constructs to GRA. Variables, labels, and types are typeset as <<v>>. The notation <p>p</p> represents a pattern resulting in a relation . To allow navigation from this relation, we presume that relation has an attribute that represents a vertex. [[r]] stands for a relation that is a results of the previous query parts. To avoid confusion with the “..” language construct (used for ranges), we use “...” to denote omitted query parts.

In this section, we defined the operators of GRA and gave an informal specification for compiling from openCypher queries. Table 2 shows a compact mapping of openCypher queries to GRA expressions. Note that the get-edges operator is not needed to capture the mapping—instead, only the get-vertices nullary operators are used and edges are inserted by the expand and transitive expand operators. For a more detailed mapping, we refer the reader to (Marton et al., 2017).

4. Transforming Graph RA to Flat RA

(a) Example graph.

Query specification LABEL:*fig:query-specification

GRA query plan LABEL:*fig:query-plan-search

NRA query plan LABEL:*fig:query-plan-rete

FRA query plan LABEL:*fig:query-plan-rete

Query view Section 5
(b) The workflow of ingraph.
1MATCH (p:Person)-[pi:INTEREST]->(pt:Tag)-[tc:CLASS]-> 2      (fc:Class)-[sos:SUBCLASS_OF*0..]->(c:Class) 3WHERE c.subject = 'Music' 4OPTIONAL MATCH (p)-[k:KNOWS]-(f:Person) 5WHERE p.street = f.street 6WITH p, count(DISTINCT f) AS nf WHERE nf < 3 7RETURN p.name (c) Query specification in openCypher. for tree=align=center [ [ [ [ [ [ [ [ [ ] ] ] ] ] [ [ [ ] ] ] ] ] ] ] ; (d) Query plan in graph relational algebra. for tree=align=left [ [ [ [ [ [ [ [ ,for tree=] [ ,for tree=] ] [