Graph processing problems are common in modern database systems, where the property graph (PG) data model (Hölsch and Grossniklaus, 2016; Marton et al., 2017; Angles et al., 2018; Francis et al., 2018; Angles, 2018) is gaining widespread adoption. Property graphs extend labelled graphs with properties for both vertices and edges. Compared to previous graph modelling approaches, such as the RDF data model (which treats properties as triples), PGs allow users to store their graphs in a more compact and comprehensible representation.
Due to the novelty of the PG data model, no standard query language has emerged yet. The openCypher initiative aims to standardise the Cypher language (Francis et al., 2018)
of the Neo4j graph database. The openCypher language uses a SQL-like syntax and combines graph pattern matching with relational operators (aggregations, joins, etc.). In this paper, we target queries specified in the openCypher language.
In graph database applications, numerous use cases rely on complex queries and require low response time for repeated executions, including financial fraud detection, and recommendation engines. In addition, graph databases are increasingly used in software engineering context as a semantic knowledge base for model validation (Bergmann et al., 2010; Szárnyas et al., 2017a; Daniel et al., 2017), source code analysis (Hawes et al., 2015), etc. While these scenarios could greatly benefit from incremental query evaluation, currently no system provides incremental views with sufficient feature coverage for a practical PG query language such as openCypher. Up to our best knowledge, the only existing incremental property graph query engine is Graphflow (Kankanamge et al., 2017), which extends Cypher with triggers, but lacks support for rich language features like negative/optional edges and transitive closures.
Incremental graph queries were successfully used in the domain of model-driven engineering. For example, the incremental query engine of Viatra ensures quick model validation and transformation over in-memory graph models (Ujhelyi et al., 2015).
In relational database systems, incremental view maintenance (IVM) techniques have been used for decades for repeated evaluation of a predefined query set on continuously changing data (Forgy, 1982; Blakeley et al., 1986; Miranker and Lofaso, 1991; Gupta et al., 1993; Gupta and Mumick, 1995; Hanson, 1996; Gupta and Mumick, 1999; Hanson et al., 2002; Szárnyas et al., 2014; Koch et al., 2014; Ujhelyi et al., 2015). However, these techniques typically build on assumptions that do not hold for property graph queries. In particular, PG queries present numerous challenges:
Schema-optional data model. Existing IVM techniques assume that the database schema is known a priori. While this is a realistic assumption for relational databases, the data model of most property graph systems is schema-free or schema-optional at best (Francis et al., 2018). Hence, to use IVM, users are required to manually define the schema of the graph, which is a tedious and error-prone process.
Nested data structures. Most IVM techniques assume relational data model with 1NF relations. However, the property graph data model defines rich structures, including the properties on graph elements and paths. Collection types, such as sets, bags, lists, and maps are also allowed (Angles et al., 2018; Francis et al., 2018). These can be represented as NF2 (non-first normal form) data structures, but their mapping to 1NF relations is a complex challenge.
Reachability queries. Unbounded reachability queries on graphs with few connected components need to calculate large transitive closures, which makes them inherently expensive (Bergmann et al., 2012). Hence, the impact of the IVM on reachability is more limited compared to non-recursive queries and using space-time tradeoff techniques is more expensive: to improve execution time, one has to trade memory at an exponential rate.
Mix of queries and transformations. Some property graph query languages (e.g. openCypher) allow combining update operations with queries. Most traditional IVM techniques do not consider this challenge, and omit related issues such as conflict set resolution. Discrimination networks from rule-based expert systems are better suited to handle this issue (Forgy, 1982; Miranker and Lofaso, 1991; Hanson et al., 2002).
List handling. Property graph data sets and queries make use of lists both as a way to store collection of primitive values and to represent paths in the graphs. Order-preserving techniques have only been studied in the context of IVM on XQuery expressions (Dimitrova et al., 2003), for trees but not for graphs.
Skewed data distribution.
Subgraph matching is often implemented as a series of binary joins. Recent work revealed that binary (two-way) joins are inefficient on data sets with skewed distributions of certain edge types (displayed by graph instances in many fields, e.g. in social networks). Hence, a large body of new research proposes n-ary (multiway) joins to achieve theoretically optimal complexity(Ngo et al., 2018; Ammar et al., 2018).
Higher-order queries. PG queries often employ higher-order expressions (Buneman et al., 1995), e.g. processing the vertices/edges on a path (also known as path unwinding (Angles et al., 2017)). Incrementalization of higher-order languages is a new field of research (Cai et al., 2014), and up to our best knowledge, currently there are no implementations using these techniques for query evaluation.
In this paper, we discuss the challenges of IVM on PG queries and present an approach to tackle some of these challenges. In particular:
We introduce extensions for relational algebra in order to handle graph-specific operators and use them to capture the semantics of (a subset of) the openCypher language.
We define a mapping for PG data using nested relations, and use nested relational algebra (NRA) to represent the queries. The data model can represent both the property graph and the resulting tables, while the NRA operators have sufficient expressive power to capture operations on the PG. This allows the algebra to be composable and closed even for operations such as transitive reachability.
We define a chain of transformations to translate the nested algebraic query plans to (incrementally maintainable) flat relational algebra (FRA) expressions.
We present the schema inferencing algorithm that eliminates the need to define the graph schema in advance.
We present ingraph, a research prototype capable of evaluating openCypher graph queries incrementally.111ingraph is available as an open-source tool at http://github.com/ftsrg/ingraph.
We overview applicable IVM approaches from the literature in rule-based expert systems, integrity constraint checking, and materialized views.
We first present some theoretical background for property graphs (Section 2) and define the operators of graph relational algebra (Section 3). We then discuss the compilation and query evaluation process (Section 4) and view maintenance (Section 5). Finally, we overview related techniques (Section 6) and outline future directions (Section 7).
2. The property graph data model
2.1. Data model
The concept of the property graph has only been studied by a few academic works, but it already has multiple flavours and definitions (Hölsch and Grossniklaus, 2016; Marton et al., 2017; Angles et al., 2018; Francis et al., 2018; Angles, 2018). In this paper, we define it as follows.
A PG is , where is a set of vertex identifiers, is a set of edge identifiers, and function assigns the source and target vertices to edges. Vertices are labelled and edges are typed: is a set of vertex labels, function assigns a set of labels to each vertex; is a set of edge types, function assigns a single type to each edge.
Let be a set of scalar literals, and denote the set of all finite bags of elements from . Let be the value domain for the PG. 222The data model can be generalised further, e.g. by allowing arbitrary nesting of collections. However, this data model already has higher expressive power than most graph data models (e.g. semantic graphs) and satisfies the needs of most practical use cases. It is also powerful enough to represent the complex schema of the LDBC Social Network Benchmark (Erling et al., 2015).
is the set of vertex properties. A vertex property is a partial function , which assigns a property value to a vertex , if has property , otherwise returns .
is the set of edge properties. An edge property is a partial function , which assigns a property value to an edge , if has property , otherwise returns .
An example graph inspired by the LDBC Social Network Benchmark (Szárnyas et al., 2018) is shown formally in (a) and graphically in (c). The graph contains a , two , and three . Note that edges in the PG data model are always directed, hence the relation is represented with a directed edge and the symmetric nature of the relation can be modelled in the queries.
2.2. Nested relations
openCypher queries take a property graph as their inputs and return a graph relation (Hölsch and Grossniklaus, 2016; Marton et al., 2017) as their output. To represent graphs and query results using the same algebraic constructs, we use nested relations (Colby, 1990), which allow data items of a relation to contain additional relations with an arbitrary level of nesting. The domain for the internal relations is . Relations on all levels of nesting follow bag semantics, i.e. duplicate tuples are allowed. We define the schema of a relation as a list of (nested) attributes and denote it with for relation .
To represent the vertices and edges of the property graph, we define two nested relations, and . Both relations have a single attribute containing nested relations. Their schema is given below and its mapping to the PG concepts is in Table 1.
For , its corresponds to the elements in . For a particular vertex, is the result of the function, whereas is the result of . Similarly for , corresponds to the elements in . For a particular edge, the corresponds to the result of , is the result of , and is the result of .
The nested relations representing the example graph are shown in Figure 0(c) and 0(d). These show that the set of vertex labels are stored as a nested relation with a single attribute , while edge types are simply stored as a single string value. The properties of vertices/edges are stored as a nested relation with two attributes, and . This representation is well-suited to the flexible schema of PG databases, as new labels, types, and property keys can be added without any changes to the schemas of the relations.
3. Graph relational algebra
Papers (Hölsch and Grossniklaus, 2016) and (Marton et al., 2017) presented relational algebraic formalisations of the openCypher language. A more rigorous formalisation was given in (Francis et al., 2018). In this paper, we follow the approach of our previous work (Marton et al., 2017) as it is better suited to established IVM techniques. This approach uses graph relational algebra (GRA), which extends standard relational algebra operators with graph-specific navigation operators.
In this section, we formally define the operators of GRA and show example queries specified in natural language and as an openCypher query, along with the equivalent GRA expression and the resulting output relation.
3.1. Basic operators and nested property access
We first present the basic unary operators of relational algebra, found in most relational algebra textbooks like (Garcia-Molina et al., 2009). The selection operator filters the incoming relation according to some criteria. Formally, where predicate is a propositional formula. Relation contains all tuples from for which holds. The projection operator keeps a specific set of attributes in the relation: Note that the tuples are not deduplicated, thus will have the same number of tuples as . The projection operator can also alias attributes, e.g. renames to and returns 5 as attribute . The duplicate-elimination operator eliminates duplicate tuples in a bag, enforcing set semantics on its input.
For the sake of conciseness, we introduce two shorthands. First, we allow using the dot notation () to traverse the nested schema to directly access nested attributes in the expressions (such as the selection predicate ), e.g. the expression can access the attribute of the attribute . This notation requires to have an and the expression holds iff equals to .
Second, properties stored as key-value pairs in the nested relation can be accessed directly as if they were top-level attributes,
e.g. the expression can access the property of attribute . Unlike nested attribute access, this shorthand does not require to have a property with key , it simply returns in the absence of such a key.
3.2. The get-vertices and get-edges operators
For mapping a property graph to relations, we use the nullary operators get-vertices and get-edges. We define these operators using the nested relations and introduced in Section 2.2. These operators are rather involved, hence we introduce some notational conventions used for the definitions:
A vertex variable is free w.r.t. an operator’s input relation if and bound if .
represents a free vertex, represents a bound vertex, and represents any vertex.
Arrow symbols , , and represent an outgoing, incoming, and undirected edge, respectively.
For vertices, we use three predefined sets of labels:
; ; and .
For edges, we use a set of types .
The get-vertices operator (Hölsch and Grossniklaus, 2016) returns a nested relation of a single attribute that contains vertices which have all labels of . Formally, it is defined as:
The schema of the resulting relation is , as the example in (e) shows. The usage of the operator is illustrated with the following example:
The get-edges operator
Next, we introduce the get-edges operator , which returns edges along with their source and target vertices. Using theta joins on get-vertices operators and relation , the get-edges operator can be defined as:
The schema of the result is , as the example in (f) shows.
Additionally to the directed get-edges operator, we define the undirected get-edges operator , which enumerates edges of both directions. Formally:
To aid readability, we always surround the and operators with parentheses and brackets, resp.
3.3. The expand operators
To capture navigations, we define the unary expand-out operator . The expression takes tuples from relation and returns a tuple for each possible navigation from a bound vertex to vertex through an edge , while enforcing the label and type constraints ( is labelled with all labels of and is typed with one type of or has any type if is empty). It can be defined using the get-edges operator:
The schema of the resulting relation is . The operator is demonstrated as follows:
We define two additional expand operators: the expand-in operator accepts incoming edges, while the expand-both operator accepts edges from both directions. Formally, they can be defined as follows:
To allow multi-hop navigation along the edges, we define a transitive variant of the expand operator , which navigates from to through edges of any type in (if is not empty), using a number of hops between a lower bound () and an upper bound ().
We restate here that the nested relations in this paper follow bag semantics (Section 2.2), i.e. they do not store any ordering between their tuples. Therefore, storing the edges of a paths as a single attribute would cause us to lose the information on ordering. Therefore, we define attribute as a nested attribute which stores the edge attribute “” along with an indexing attribute “” that denotes the position of the edge in the path. Using this attribute, the schema is:
This is demonstrated with the following example:
3.4. Combining pattern matches
A single graph pattern is defined starting from get-vertices and expand operators. Multiple graph patterns can be combined together based on their common attributes using the natural join operator . Additionally, most PG query languages allow users to define optional pattern parts. This can be captured with the left outer join operator
, which pads tuples from the left relation that did not match any from the right relation withvalues and adds them to the result of the natural join (Silberschatz et al., 2005). This is illustrated by the following example:
Some queries pose structural conditions on the graph patterns (e.g. only return Persons who have at least one interest). Positive structural conditions can be captured with the semijoin operator , which is defined as . Negative structural conditions can be captured by using the antijoin operator (also known as the anti-semijoin), which is defined as . For the sake of brevity, we refrain from providing examples for these operators.
3.5. Collections and aggregation
It is often required to handle elements in nested collections separately. To allow this, we introduce the unwind operator , a specialized version of the unnest operator of nested relational algebra (Botoeva et al., 2018). In particular, takes the bag in attribute and creates a new tuple for each element of the bag by appending that element as an attribute to .
In common extensions to relational algebra (Garcia-Molina et al., 2009), the sort operator is used to sort a relation, returning a relation that follows list semantics. The ordering is defined according to selected attributes and with a certain direction for each attribute (ascending or descending ), e.g. . Additionally, the top operator (Li et al., 2005) takes a list relation as its input, skips the first tuples and returns the next tuples. The default values are 0 for and for .
As the operators in our nested bag algebra do not define ordering, a standalone sort or top operator would have no clear semantics. Hence, we only allow these operator combined together as a single sort-and-top operator.
Grouping and aggregation
The grouping operator groups tuples according to their value in one or more attributes and aggregates the remaining attributes. As determining the attributes of the grouping criteria is non-trivial, the grouping operator explicitly states these attributes. We use the notation , where form the grouping criteria, i.e. the list of expressions whose values partition the incoming tuples into groups. For every group this aggregation operator emits a single tuple of expressions with aliases , respectively. We demonstrate the unwind, grouping, sort, and top operators using a single example:
In this section, we defined the operators of GRA and gave an informal specification for compiling from openCypher queries. Table 2 shows a compact mapping of openCypher queries to GRA expressions. Note that the get-edges operator is not needed to capture the mapping—instead, only the get-vertices nullary operators are used and edges are inserted by the expand and transitive expand operators. For a more detailed mapping, we refer the reader to (Marton et al., 2017).