Rewriting techniques have been widely used in different areas such as operational semantics of declarative languages or automated theorem proving. In this paper, our main aim is to propose to use such techniques in the case of graph-oriented database languages.
Current developments in database theory show a clear shift from relational to graph-oriented databases. Relational databases are now well mastered and have been largely investigated in the literature with an ISO standard language SQL [8, 9]. On the other side, the wide use of graphs as a flexible data model for numerous database applications  as well as the emergence of various languages such as SPARQL , Cypher  or G-CORE  to quote a few. An ongoing ISO project of a standard language, called GQL, has emerged recently for graph-oriented databases 111https://www.gqlstandards.org/.
Representing data graphically is quite legible. However, there is always a dilemma in choosing the right notion of graphs when modeling applications. This issue is already present in some well investigated domains such as modeling languages  or graph transformation . Graph-oriented data representation does not escape from such dilemma. We can quote for example RDF graphs  on which SPARQL is based or Property Graphs  currently used in several languages such as Cypher, G-CORE or the forthcoming GQL language.
In addition to the possibility of using different graph representations for data, graph database languages feature new kinds of queries such as graph-to-graph queries, cf. CONSTRUCT queries in SPARQL or G-CORE, besides the classical graph-to-relation (table) queries such as SELECT or MATCH queries in SPARQL or Cypher. The former constitute a class of queries which transforms a graph database to another graph database. The later transforms a graph to a multiset of solutions represented in general by means of a table just as in the classical relational framework.
In general, graph querying processing integrates features shared with graph transformation techniques (database transformation) and goal solving (variable assignments). Our main aim in this paper is to define an operational semantics, based on rewriting techniques, for graph-oriented queries. We propose a generic rule-based calculus, called gql-narrowing which is parameterized by the actual interpretations of graphs and their matches (homomorphisms). That is to say, the obtained calculus can be adapted to different definitions of graph and the corresponding notion of match. The proposed calculus consists on a dedicated rewriting system and a narrowing-like procedure which follows closely the formal semantics of patterns or queries, the same way as (SLD-)Resolution calculus is related to formal models underlying Horn or Datalog clauses. The use of rewriting techniques in defining the proposed operational semantics paves the way to syntactic analysis and automated verification techniques for the proposed core language.
In order to define a sound and complete calculus, we first propose a uniform formal semantics for queries. Actually, we do consider graph-to-graph queries and graph-to-table queries as two facets of one same syntactic object that we call pattern. The semantics of a pattern is a set of matches, that is to say, a set of graph homomorphisms and not only a set of variable assignments as proposed in [3, 11]. From such set of matches, one can easily display either the tables by considering the images of the variables as defined by the matches or the graph target of the matches or even both tables and graphs. Our semantics for patterns allows us to write nested patterns in a natural way, that is, new data graphs can be constructed on the fly before being queried.
The paper is organized as follows: next section introduces a graph query algebra featuring some key operations needed to express the proposed calculus. Section 3 defines the syntax of patterns and queries as well as their formal semantics. In Section 4, a sound and complete calculus is given. First we introduce a rewriting system describing how query results are found. Then, we define gql-narrowing, which is associated with the proposed rules. Concluding remarks and related work are given in Section 5.
2 Graph Query Algebra
During a query answering process, different intermediate results can be computed and composed. In this section, we introduce a Graph Query Algebra which consists of a family of operations over graphs, matches (graph homomorphisms) and expressions. These operations are used later on to define the semantics of queries, see Sections 3 and 4.
2.1 Signature for the Graph Query Algebra
The algebra is defined over a signature. The main sorts of this signature are Gr, Som, Exp and Var to be interpreted as graphs, sets of matches, expressions and variables, respectively, as explained in Sections 2.2, 2.3, 2.4 and 2.5. The sort Var is a subsort of Exp. The main operators of the signature are:
The above sorts and operations are given as an indication while being inspired by concrete languages. They may be modified or adapted according to actual graph-oriented query languages.
2.2 An Actual Interpretation of Graphs
Various interpretations of sorts Gr and Som can be given. In order to provide concrete examples, we have to fix an actual interpretation of these sorts. For all the examples given in the paper, we have chosen to interpret the sort Gr as generalized RDF graphs. We could of course have chosen other notions of graphs such as property graphs. Our choice here is motivated by the simplicity of RDF graph definition (set of triples).
Below, we define generalized RDF graphs.They are the usual RDF graphs but they may contain isolated nodes. Let be a set, called the set of labels, made of the union of two disjoint sets and , called respectively the set of constants and the set of variables.
Definition 1 (graph)
Every element of is called a triple and its members , and are called respectively the subject, the predicate and the object of . A graph is a pair made of a subset of called the set of nodes of and a subset of called the set of triples of , such that the subject and the object of each triple of are nodes of . The nodes of which are neither a subject nor an object are called the isolated nodes of . The set of labels of a graph is the subset of made of the nodes and predicates of , then and . The graph with an empty set of nodes and an empty set of triples is called the empty graph and is denoted by . Given two graphs and , the graph is a subgraph of , written , if and , then . The union is the graph defined by and , then .
In the rest of the paper we write graphs as a couple made of a set of triples and a set of nodes: for example the graph which is made of four nodes and one triple is written as .
We define a toy database which is used as a running example throughout the paper. The database consists of persons who are either professors or students, with topics such that each professor teaches some topics and each student studies some topics.
|(Alice, is, Professor),||(Alice, teaches, Mathematics),|
|(Bob, is, Professor),||(Bob, teaches, Informatics),|
|(Charlie, is, Student),||(Charlie, studies, Mathematics),|
|(David, is, Student),||(David, studies, Mathematics),|
|(Eric, is, Student),||(Eric, studies, Informatics)|
Below, we define the notion of match which will be used, notably, to represent results of queries.
Definition 2 (match)
A graph homomorphism from a graph to a graph , denoted , is a function from to which preserves nodes and preserves triples, in the sense that and . A match is a graph homomorphism which fixes , in the sense that for each in .
When is an isolated node of then the node does not have to be isolated in . A match determines two functions and , restrictions of and respectively. A match is invertible if and only if both functions and are bijections. This means that a function from to is an invertible match if and only if with for each and is a bijection from to : thus, is the same as up to variable renaming. It follows that the symbol used for naming a variable does not matter as long as graphs are considered only up to invertible matches.
Notice that RDF graphs  are graphs according to Definition 1 but without isolated nodes, and where constants are either IRIs (Internationalized Resource Identifiers) or literals and where all predicates are IRIs and only objects can be literals. Blank nodes in RDF graphs are the same as variable nodes in our graphs. An isomorphism of RDF graphs, as defined in , is an invertible match. isomorphism of graphs as in Definition 2.
2.3 More Definitions on Matches
Below we introduce some useful definitions on matches. Notice that we do not consider a match as a simple variable assignment but rather as a graph homomorphism with a clear source and target graphs. This nuance in the definition of matches is important in the rest of the paper.
Definition 3 (compatible matches)
Two matches and are compatible, written as , if for each . Given two compatible matches and , let denote the unique match such that and (which means that coincides with on and with on ).
Definition 4 (building a match)
Let be a match and a graph. The match is the unique match (up to variable renaming) such that for each variable in :
and is the image of by .
Definition 5 (set of matches, assignment table)
Let and be graphs. A set of matches, all of them from to , is denoted and called a homogeneous set of matches, or simply a set of matches, with source and target . The image of by is the subgraph of . We denote the set of all matches from to . When is the empty graph this set has one unique element which is the inclusion of into , then we denote this one-element set and its empty subset. The assignment table of is the two-dimensional table with the elements of in its first row, then one row for each in , and the entry in row and column equals to .
Thus, the assignment table describes the set of functions , made of the functions for all . A set of matches is determined by the graphs and and the assignment table .
In order to determine when professor teaches topic which is studied by student we may consider the following graph , where , and are variables. In all examples, variables are preceded by a “?”.
(, teaches, ), (, studies, )
There are 3 matches from to . The set of all these matches is:
Query languages usually feature a term algebra dedicated to express operations over integers, booleans and so forth. We do not care here about the way basic operations are chosen but we want to deal with aggregation operations as in most database query languages. Thus, one can think of any kind of term algebra with operators which are classified as either basic operators (unary or binary) and aggregation operators (always unary). We consider that all expressions are well typed. Typically, and not exclusively, the sets, and of basic unary operators, basic binary operators and aggregation operators can be:
A group of expressions is a non-empty finite list of expressions.
Definition 6 (syntax of expressions)
Expressions and their sets of in-scope variables are defined recursively as follows, with , , , , , is a group of expressions:
, , , , ,
(the variables in must be distinct from those in ).
The value of an expression with respect to a set of matches (Definition 7) is a family of constants indexed by the set . When the expression is free from any aggregation operator then is simply . But in general depends on and and it may also depend on other matches in when involves aggregation operators. The value of a group of expressions with respect to is the list . To each basic operator is associated a function (or simply ) from constants to constants if is unary and from pairs of constants to constants if is binary. To each aggregation operator in is associated a function (or simply ) from multisets of constants to constants. Note that each family of constants determines a multiset of constants: for instance a family of constants indexed by the elements of a set of matches determines the multiset of constants , which is also denoted when there is no ambiguity. Some aggregation operators in are such that depends only on the set underlying the multiset , which means that does not depend on the multiplicities in the multiset : this is the case for MAX and MIN but not for SUM, AVG and COUNT. When with in then is applied to the underlying set of . For instance, counts the number of elements of the multiset with their multiplicies, while counts the number of distinct elements in .
Definition 7 (evaluation of expressions)
Let be a graph, an expression over and a set of matches. The value of with respect to is the family defined recursively as follows. It is assumed that each in this definition is a constant.
where is the subset of made of the matches in such that .
Note that is the same for all in while is the same for all and in such that .
The sorts Gr, Som, Exp and Var of the signature in Section 2.1 are interpreted in the algebra respectively as the set of graphs (Definition 1), the set of homogeneous sets of matches (Definition 5), the set of expressions (Definition 6) and its subset of variables. Then the operators of the signature are interpreted in the algebra by the operations with the same name in Definition 8. Whenever needed, we extend the target of matches: for every graph and every match where is a subgraph of we denote when is considered as a match from to .
Definition 8 ( operations)
For all graphs and :
is the set of all matches from to .
For all sets of matches and :
For every set of matches , every expression and every variable , let for each . Then:
Equivalently, this can be expressed as follows:
if then ,
For every set of matches and every expression :
For every set of matches and every graph :
For all sets of matches and :
3 Patterns and Queries
Syntax of graph-oriented dabases is still evolving. We do not consider all technical syntactic details of a real-world language nor all possible constraints on matches. We focus on a core language. Its syntax reflects significant aspects of graph-oriented queries. Conditions on graph paths, which can be seen as constraints on matches, are omitted in this paper in order not to make the syntax too cumbersome. We consider mainly two syntactic categories: patterns and queries, in addition to expressions already mentioned in Section 2.4. Queries are either SELECT queries, as in most query languages, CONSTRUCT queries, as in and G-CORE, or the new CONSELECT queries introduced in this paper. A SELECT query applied to a graph returns a table which describes a multiset of solutions or variable bindings, while a CONSTRUCT query applied to a graph returns a graph. A CONSELECT query applied to a graph returns both a graph and a table. On the other hand, a pattern applied to a graph returns a set of matches. Patterns are the basic blocks for building queries. They are defined in Section 3.1 together with their semantics. Queries are defined in Section 3.2 and their semantics is easily derived from the semantics of patterns. In this Section, as in Section 2, the set of labels is the union of the disjoint sets and , of constants and variables respectively. We assume that the set of constants contains the numbers and strings and the boolean values and .
In Definition 9 patterns are built from graphs by using six operators: BASIC, JOIN, BIND, FILTER, BUILD and UNION. Then, in Definition 10 the formal semantics of patterns is given by an evaluation function.
Definition 9 (syntax of patterns)
Patterns and their scope graph are defined recursively as follows.
The symbol is a pattern, called the empty pattern, and is the empty graph .
If is a graph then is a pattern, called a basic pattern, and .
If and are patterns then is a pattern and .
If is a pattern, an expression such that and a variable then is a pattern and .
If is a pattern and an expression such that then is a pattern and .
If is a pattern and a graph then is a pattern and .
If and are patterns such that then is a pattern with .
The value of a pattern over a graph is a set of matches, as defined now.
Definition 10 (evaluation of patterns, set of solutions)
The set of solutions or the value of a pattern over a graph is a set of matches from the scope graph of to a graph that contains . This value is defined inductively as follows:
In all cases, the graph is built by adding to “whatever is required” for the evaluation. When is the empty pattern, the value of over is the empty subset of . Syntactically, each operator OP builds a pattern from a pattern and a parameter , which is either a pattern (for JOIN and UNION), a pair made of an expression and a variable (for BIND), an expression (for FILTER) or a graph (for BUILD). Semantically, for every pattern , let us denote for and for . In every case it is necessary to evaluate before evaluating : for JOIN and UNION this is because pattern is evaluated on , for BIND and FILTER because expression is evaluated with respect to , and for BUILD because of the definition of Build. Note that the semantics of and is not symmetric in and in general, unless and , which occurs when and are basic patterns. Given a pattern , the pattern is a subpattern of , as well as when or . The semantics of patterns is defined in terms of the semantics of its subpatterns (and the semantics of its other arguments, if any). Thus, for instance, BUILD patterns can be nested at any depth.
For every pattern , the set of in-scope variables of is the set of variables of the scope graph . An expression is over a pattern if .
Let be the following graph, where , and are variables.
(, teaches, ), (, studies, )
Note that is the same as , except for the name of one variable. In order to determine when professor teaches some topic which is studied by student , whatever the topic, we consider the following pattern .
|(, teaches, ), (, studies, ) )|
|(, teaches, ), (, studies, )|
Note that the variable in does not appear in . Since there are 3 matches from to (Example 2), the value of over is:
where , and are 3 fresh variables and:
(Alice, teaches, ), (Charlie, studies, ), (Alice, teaches, ), (David, studies, ), (Bob, teaches, ), (Eric, studies, )
We consider three kinds of queries : CONSTRUCT queries, SELECT queries and CONSELECT queries. We define the semantics of queries from the semantics of patterns. According to Definition 10, all patterns have a graph-to-set-of-matches semantics. In contrast, CONSTRUCT queries have a graph-to-graph semantics and SELECT queries have a graph-to-multiset-of-solutions or graph-to-table semantics while CONSELECT have a graph-to-graph-and-table semantics.
Definition 12 (syntax of queries)
Let be a set of variables, a graph and a pattern. A query has one of the following three shapes:
Definition 13 (result of Construct queries)
Given a pattern and a graph consider the query and the pattern . The result of the query over a graph , denoted , is the subgraph of image of by the set of matches .
Thus, the result of a CONSTRUCT query over a graph is the graph built by “gluing” the graphs for all matches in , where is a copy of with each variable replaced by a fresh variable (which means, fresh for each and each ).
Consider the query:
|(, teaches, ), (, studies, )|
|(, teaches, ), (, studies, ) )|
The corresponding pattern and the value of over are as in Example 3. It follows that the result of the query over is the subgraph of image of by :
(Alice, teaches, ), (Charlie, studies, ), (Alice, teaches, ), (David, studies, ), (Bob, teaches, ), (Eric, studies, ) .
CONSTRUCT queries in are similar to CONSTRUCT queries considered in this paper: the variables in play the same role as the blank nodes in . By considering BUILD patterns, thanks to the functional orientation of the definition of patterns, our language allows BUILD subpatterns: this is new and specific to the present study.
For SELECT queries we proceed as for CONSTRUCT queries: we define a transformation from each SELECT query to a BUILD pattern and a transformation from the result of pattern to the result of query . Definition 14 below would deserve more explanations. However this is not the subject of this paper, see  for details about how turning a table to a graph.
Definition 14 (result of Select queries)
For every set of variables , let denote the graph made of the triples for where is a fresh variable and is a fresh constant string for each . Given a pattern and a set of variables consider the query and the pattern . The value of over a graph is a set of matches which assignment table has columns, corresponding to the variables . The result of the query over a graph , denoted , is the multiset of solutions made of the rows of the assignment table of after dropping the column .
Consider the query:
Let where is a fresh variable and , are fresh distinct strings. Then the pattern corresponding to is:
The value of over is:
where , and are 3 fresh variables and:
(, , Alice), (, , Charlie), (, , Alice),
(, , David), (, , Bob), (, , Eric)
It follows that:
Alice Charlie Alice David Bob Eric
Definition 15 (result of Conselect queries)
Given a pattern a graph and a set of variables . Let be the graph as described in Definition 14. consider the query: and the pattern: . The result of the query over a graph , denoted , is the pair consisting of the subgraph of image of by the set of matches and the multiset of solutions made of the rows of the assignment table of after dropping the column .
We illustrate here the CONSELECT queries through a toy example. The idea is to have a query that both returns a graph and a table as result. Typically it may be helpful when one wants to query statistical facts about the generated graph. Let us consider the database defined in Example 1. We propose to ask the following query which generates a graph representing professors and their supervised students accompanied with simple statistics about the number of students supervised by each professor.
The result of this query is the list of professors with the number of students they supervise (in our toy database, Alice has two students, and Bob has one student) together with the graph of students supervised by a professor. The expected graph and table are displayed below:
4 A Sound and Complete Calculus
In this section we propose a calculus for solving patterns and queries based on a relation over patterns called gql-narrowing. It computes values (i.e., sets of solutions) of patterns (Definition 10) and results of queries (Definitions 13, 14 and 15) over any graph. This calculus is sound and complete with respect to the set-theoretic semantics given in Section 3.
In functional and logic programming languages, narrowing or resolution  derivations are used to solve goals and may have the following shape where is the initial goal to solve (e.g., conjunction of atoms, equations or a (boolean) term) and is a “terminal” goal such as the empty clause, unifiable equations or the constant true :