The tractability frontier of well-designed SPARQL queries

12/23/2017
by   Miguel Romero, et al.
University of Oxford
0

We study the complexity of query evaluation of SPARQL queries. We focus on the fundamental fragment of well-designed SPARQL restricted to the AND, OPTIONAL and UNION operators. Our main result is a structural characterisation of the classes of well-designed queries that can be evaluated in polynomial time. In particular, we introduce a new notion of width called domination width, which relies on the well-known notion of treewidth. We show that, under some complexity theoretic assumptions, the classes of well-designed queries that can be evaluated in polynomial time are precisely those of bounded domination width.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/28/1998

Hypertree Decompositions and Tractable Queries

Several important decision problems on conjunctive queries (CQs) are NP-...
01/29/2019

Canonisation and Definability for Graphs of Bounded Rank Width

We prove that the combinatorial Weisfeiler-Leman algorithm of dimension ...
01/13/2020

One-Clock Priced Timed Games are PSPACE-hard

The main result of this paper is that computing the value of a one-clock...
01/27/2021

Characterising Fixed Parameter Tractability of Query Evaluation over Guarded TGDs

We study the parameterized complexity of evaluating Ontology Mediated Qu...
12/20/2017

Boolean Tensor Decomposition for Conjunctive Queries with Negation

We propose an algorithm for answering conjunctive queries with negation,...
06/07/2020

A data complexity and rewritability tetrachotomy of ontology-mediated queries with a covering axiom

Aiming to understand the data complexity of answering conjunctive querie...
12/26/2019

Solving a Special Case of the Intensional vs Extensional Conjecture in Probabilistic Databases

We consider the problem of exact probabilistic inference for Union of Co...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Resource Description Framework (RDF) [20] is the W3C standard for representing linked data on the Web. In this model, data is represented as RDF graphs, which consist of collections of triples of internationalised resource identifiers (IRIs). Intuitively, such a triple represents the fact that a subject is connected to an object via a predicate .

SPARQL [26] is the standard query language for RDF graphs. In a seminal paper, Pérez et al. [23] (see also [22]) gave a clean formalisation of the language, which laid the foundations for its theoretical study. Since then, a lot of work has been done in different aspects of the language such as query evaluation [19, 3, 15, 4, 16], optimisation [17, 24, 14], and expressive power [2, 25, 15, 30, 11], to name a few.

As shown in [23], it is PSPACE-complete to evaluate SPARQL queries. This motivated the introduction of a natural fragment of SPARQL called the well-designed fragment, whose evaluation problem is coNP-complete [23]. More formally, the evaluation problem wdEVAL for well-designed SPARQL is to decide, given a well-designed query , and RDF graph and a mapping , whether belongs to the answer of over . By now the well-designed fragment is central in the study of SPARQL and a lot of efforts has been done by the theory community to understand fundamental aspects of this fragment (see e.g. [23, 17, 24, 4, 14, 11, 16, 15]). In this paper, we focus on the core fragment of well-designed SPARQL restricted to the AND, OPTIONAL and UNION operators, as defined in [23].

Despite its importance, several basic questions remain open for well-designed SPARQL. As first observed in [17], while the problem wdEVAL is coNP-complete, it becomes tractable, i.e. polynomial-time solvable, for restricted classes of well-designed queries. Indeed, it was shown that wdEVAL is in PTIME for every class of queries satisfying a certain local tractability condition [23]. We emphasise that the above-mentioned result is briefly discussed in [23] as the focus of the authors is on the static analysis and optimisation of queries rather than complexity of evaluation. Subsequent works [4, 16] have studied the complexity of evaluation in more depth but the focus has been mainly on the fragment of SPARQL including the SELECT operator (i.e., projection). In particular, the following fundamental question regarding the core well-designed fragment remains open: which classes of well-designed SPARQL can be evaluated in polynomial time?

Our main contribution is a complete answer to the question posed above. In particular, we introduce a new width measure for well-designed queries called domination width, which is based on the well-known notion of treewidth (see Section 3 for precise definitions). For a class of well-designed queries, let us denote by the evaluation problem wdEVAL restricted to the class . Also, we say that a class of well-designed queries has bounded domination width if there is an universal constant such that the domination width of every query in is at most . Then, our main technical result is as follows (Theorem 3). Assume that FPT W[1]. Then, for every recursively enumerable class of well-designed queries, the problem is in PTIME if and only if has bounded domination width. The assumption FPT W[1] is a widely believed assumption from parameterised complexity (see Section 4 for precise definitions). As we observe in Section 3, one can remove the assumption of being recursively enumerable by considering a stronger assumption than FPT W[1] considering non-uniform complexity classes.

Our result builds on the classical result by Dalmau et al. [6] and Grohe [9] showing that a recursively enumerable class of conjunctive queries (CQs) over schemas of bounded arity is tractable if and only if the cores of the CQs in have bounded treewidth. (Recall that a CQ is a first-order query using only conjunctions and existential quantification.)

For the tractability part of our result, we exploit, as in [6], the so-called existential pebble game introduced in [12] (see also [6]). This game provides a polynomial-time relaxation for the problem of checking the existence of homomorphisms, which is a well-known NP-complete problem (see e.g. [5]). Using the existential pebble game, we define a natural relaxation of the standard algorithm from [17] (see also [24]) for evaluating well-designed queries. Then we show that this relaxation correctly solves instances of bounded domination width (Theorem 1).

For the hardness part, we follow a similar strategy as in [9]. The two main ingredients in our proof is an adaptation of the main construction of [9] to handle distinguished elements or constants (Lemma 2) and an elementary property of well-designed queries of large domination width (Lemma 3).

Finally, we emphasise that our classes of bounded domination width significantly extend the classes that are locally tractable [17], which, as we mentioned above, are the most general tractable restrictions known so far. This is even true in the case of UNION-free well-designed queries. As we discuss in Section 3.2, the notion of domination width for UNION-free queries can be simplified and coincides with a width measure called branch treewidth. Bounding this simpler width measure still strictly generalises local tractability.

Organisation. We present the basic definitions in Section 2. In Section 3, we introduce the measure of domination width and present our main tractability result. The main hardness result is presented in Section 4. We conclude with some final remarks in Section 5.

2 Preliminaries

RDF Graphs. Let be a countable infinite set of IRIs. An RDF triple is a tuple in and an RDF graph is a finite set of RDF triples. In this paper, we assume that no blank nodes appear in RDF graphs, i.e., we focus on ground RDF graphs.

SPARQL Syntax. SPARQL [26] is the standard query language for RDF. We rely on the formalisation proposed in [23]. We focus on the core fragment of the language given by the operators AND, OPTIONAL (OPT for short), and UNION.111Additional operators include FILTER and SELECT. We briefly discuss these operators in Section 5. Let be a countable infinite set of variables, disjoint from . A SPARQL triple pattern (or triple pattern for short) is a tuple in . The set of variables from appearing in a triple pattern is denoted by . Note that an RDF triple is simply a SPARQL triple pattern with . A SPARQL graph pattern (or graph pattern for short) is recursively defined as follows:

  1. a triple pattern is a graph pattern, and

  2. if and are graph patterns, then is also a graph pattern, for .

SPARQL Semantics. In order to define the semantics of graph patterns, we follow again the presentation in [23]. A mapping is a partial function from to . We denote by the domain of the mapping . Two mappings and are compatible if , for all . If and are compatible mappings then denotes the mapping with domain such that , for all , and , for all . For a triple pattern and a mapping such that , we denote by the RDF triple obtained from by replacing each by .

For an RDF graph and a graph pattern , the evaluation of over is a set of mappings defined recursively as follows:

  1. , if is a triple pattern.

  2. , and are compatible.

  3. and there is no compatible with .

  4. .

Well-designed SPARQL. A central class of SPARQL graph patterns identified in [23], and also the focus of this paper, is the class of well-designed graph patterns. We say that a graph pattern is UNION-free if it only uses the operators AND and OPT. A UNION-free graph pattern is well-designed if for every subpattern of , it is the case that every variable ocurring in but not in , does not occur outside in . A SPARQL graph pattern is well-designed if it is of the form , where each is a UNION-free well-designed graph pattern.222This top-level use of the UNION operator is known as UNION-normal form [23]. Note that we are implicitly using the fact that UNION is associative.

Example 1

Consider the following graph patterns:

Note that is well-designed, while is not. Indeed, in the subpattern of , the variable appears in and not in but does occur outside in .

Well-designed patterns have good properties in terms of query evaluation. More precisely, let wdEVAL be the problem of deciding, given a well-designed graph pattern , an RDF graph and a mapping , whether . It was shown in [23] that wdEVAL is coNP-complete, while the problem is PSPACE-complete for arbitrary SPARQL graph patterns.

2.1 Pattern trees and pattern forests

Besides alleviating the cost of evaluation, another key property of UNION-free well-designed graph patterns is that they can be written in the so-called OPT-normal form [23]. In turn, patterns in OPT-normal form admit a natural tree representation, known as pattern trees [17]. Intuitively, a pattern tree is a rooted tree where each node represents a well-designed pattern using only AND operators, while its tree structure represents the nesting of OPT operators. Consequently, a well-designed graph pattern UNION can be represented as a pattern forest333In this paper, we work with a particular type of patterns trees/forests, namely well-designed pattern trees/forests. For simplicity, sometimes we abuse notation and use the terms patterns trees/forests and well-designed pattern trees/forests interchangeably.[24], i.e., a set of pattern trees , where is the pattern tree representation of . Pattern trees/forests are useful for understanding how to evaluate and optimise well-designed patterns, and have been used extensively as a basic tool in the study of well-designed SPARQL (see e.g. [17, 24, 4, 14, 11, 16]). As we show in this work, pattern forests are also fundamental to understand tractable evaluation of well-designed SPARQL: by imposing restrictions on the pattern forest representation, we can identify and characterise the tractable classes of well-designed graph patterns.

T-graphs and homomorphisms. A triple pattern graph (or t-graph for short) is a finite set of triple patterns. We denote by the set of variables from appearing in the t-graph . Note that an RDF graph is simply a t-graph with . Let be a triple pattern and be a partial function from to such that . We define to be the triple pattern obtained from by replacing each by . For two t-graphs and , we say that a partial function from to is a homomorphism from to if and for every , it is the case that .

Basics of pattern trees and forests. For an undirected graph , we denote by its set of nodes. A well-designed pattern tree (or wdPT for short) is a triple such that

  1. is a tree rooted at a node ,

  2. is a function that maps each node to a t-graph, and

  3. the set induces a connected subgraph of , for every .

Let be a wdPT. A wdPT is a subtree of if (i) is a subtree of , (ii) , and , for all . Note that any subtree of contains the original root . A child of the subtree is a node such that , where is the parent of in .

For convenience, we fix two functions pat() and vars() as follows. Let be a wdPT. We define , for every and . Note that and are t-graphs. We let , for and .

A well-designed pattern forest (wdPF for short) is a finite set of well-designed pattern trees.

In [17], it was shown that every wdPT can be translated efficiently into an equivalent wdPT in the so-called NR normal form. A wdPT is in NR normal form if for every node with parent in , it holds that . In this paper, we assume that all wdPTs are in NR normal form.

Well-designed SPARQL and wdPFs. As in the case of SPARQL graph patterns, we denote by (resp., ) the evaluation of a wdPT (resp., wdPF ) over an RDF graph . In [17], for a wdPT , the set of mappings is defined via a translation to well-designed graph patterns. However, if is in NR-normal form, then admits a simple characterisation stated in Lemma 1 below. In this paper, we adopt this characterisation as the semantics of wdPTs.

Lemma 1 ([17, 24])

Let be a wdPT in NR normal form, an RDF graph and a mapping. Then iff there exists a subtree of such that

  1. is a homomorphism from to .

  2. there is no child of and homomorphism from to compatible with .

For a wdPF and an RDF graph , we define .

As shown in [17], every UNION-free well-designed graph pattern can be translated in polynomial time into an equivalent wdPT , i.e., a wdPT such that , for all RDF graphs . Consequently and as observed in [24], every well-designed graph pattern can be translated in polynomial time into an equivalent wdPF . Throughout the paper, we fix a polynomial-time computable function that maps each well-designed graph pattern to an equivalent wdPF.

Example 2

Recall from Example 1 and consider the following well-designed graph pattern:

We have that , where and are the wdPTs depicted in Figure 2, for and .

2.2 Restrictions of the evaluation problem

Recall that wdEVAL denotes the problem of deciding, given a well-designed graph pattern , an RDF graph and a mapping , whether . In this paper, we study restrictions of wdEVAL given by different classes of admissible patterns. Formally, for a class of well-designed graph patterns, we define the problem as follows:

Input: a well-designed graph pattern ,
an RDF graph and a mapping .
Question: does hold?

Note that is a promise problem, as we are given the promise that . This allows us to analyse the complexity of evaluating patterns in independently of the cost of checking membership in .

3 A new tractability condition

In this section, we introduce the notion of domination width of a well-designed graph pattern and show our main tractability result: is in PTIME, for classes of graph patterns of bounded domination width. Before doing so, we need to introduce some terminology.

A generalised t-graph is a pair , where is a t-graph and . Consider two generalised t-graphs of the form and . A homomorphism from to is a homomorphism from to such that , for all . We write whenever there is a homomorphism from to ; otherwise, we write . Note that the relation is transitive, i.e., and implies .

Let be a generalised t-graph, be an RDF graph and be a mapping with . We write if there is a homomorphism from to such that , for all . Notice that composes with , i.e., and implies .

Below we state several notions and properties for generalised t-graphs. We emphasise that all these properties are well-known for conjunctive queries (CQs) and relational structures and can be applied in our case as there is a strong correspondence between generalised t-graphs and CQs. Indeed, we can view a generalised t-graph as a CQ over a relational schema containing a single ternary relation, where the variables are , the free variables are , and the IRIs appearing in correspond to constants in . However, for convenience and consistency with RDF and SPARQL terminology, we shall work directly with generalised t-graphs throughout the paper.

Cores. Let and be two generalised t-graphs. We say that is a subgraph of if , and a proper subgraph if but . A generalised t-graph is a core if there is no homomorphism from to one of its proper subgraphs . We say that is a core of if is a core itself, and . As stated below, every generalised t-graph has a unique core (up to renaming of variables), and hence, we can speak of the core of a generalised t-graph.

Proposition 1 (see e.g. [1, 10])

Every generalised t-graph has a unique core (up to renaming of variables).

Treewidth. The notion of treewidth is a well-known measure of the tree-likeness of an undirected graph (see e.g. [7]). For instance, trees have treewidth , cycles treewidth and , the clique of size , treewidth . Let be an undirected graph. A tree decomposition of is a pair where is a tree and is a function that maps each node to a subset of such that

  1. for every , the set induces a connected subgraph of , and

  2. for every edge , there is a node with .

The width of the decomposition is . The treewidth of the graph is the minimum width over all its tree decompositions.

Let be a generalised t-graph. The Gaifman graph of is the undirected graph whose vertex set is and whose edge set contains the pairs such that and , for some triple pattern . We define the treewidth of to be . If has no vertices, i.e., , or has no edges, we let .

For a generalised t-graph , we let , where is the core of .

Figure 1: The generalised t-graphs from Example 3. We assume that and . Note that the distinguished variables are underlined.
Example 3

Let and consider the generalised t-graphs and depicted in Figure 1, where and is the t-graph given by the set

Observe that is a core and hence , as its Gaifman graph is the clique of size . On the other hand, the core of is , where

Hence, while .

Existential -pebble game. The existential -pebble game was introduced by Kolaitis and Vardi [12] to analyse the expressive power of certain Datalog programs. While the original definition deals with relational structures, here we focus on the natural adaptation to the context of generalised t-graphs and RDF graphs.

Let . The existential -pebble game is played by the Spoiler and the Duplicator on a generalised t-graph , an RDF graph and a mapping with . During the game, the Spoiler only picks elements from , while the Duplicator picks elements from , where is the set of IRIs appearing in . In the first round, the Spoiler places pebbles on (not necessarily distinct) elements , and the Duplicator responds by placing pebbles on elements . On any further round, the Spoiler removes a pebble and places it on another element . The Duplicator responds by moving the corresponding pebble to an element . If after a particular round, the elements covered by the pebbles are and for the Spoiler and the Duplicator, respectively, then the configuration of the game is if and , for some with ; otherwise, it is the mapping , where and , for every (note that ).

The Duplicator wins the game if he has a winning strategy, that is, he can indefinitely continue playing the game in such a way that the configuration at the end of each round is a mapping that is a partial homomorphism, i.e., for every triple pattern with , it is the case that . If the Duplicator can win the existential -pebble game on , and , then we write .

Note that if , then for every ,

(1)

i.e., is a homomorphism from to . Observe also that for every ,

(2)

In other words, the relation is a relaxation of . As we state below, the relaxation given by has good properties in terms of complexity444The existential -pebble game is known to capture the so-called -consistency test [13]

, which is a well-known heuristic for solving

constraint satisfaction problems (CSPs).: while checking the existence of homomorphisms, i.e., is a well-known NP-complete problem [5], checking can be done in polynomial time, for every fixed .

Proposition 2 ([12]; see also [6])

Let . For a given generalised t-graph , an RDF graph and a mapping with , checking whether can be done in polynomial time.

As it turns out, there is a strong connection between existential -pebble games and the notion of treewidth. In particular, it was shown by Dalmau et al. [6] that the relations and coincide for generalised t-graphs satisfying 555In [6], it was shown that and coincide for relational structures whose cores have treewidth at most . For Proposition 3, we need a generalisation of the results in [6] that considers relational structures equipped with a set of distinguished elements. Indeed, such distinguished elements correspond to the variables in and the IRIs appearing in the generalised t-graph . Such a generalisation follows straightforwardly from the results in [6]..

Proposition 3 ([6])

Let . Let be a generalised t-graph, be an RDF graph and be a mapping with . Suppose that . Then if and only if .

We conclude with two basic properties of the existential pebble game that will be useful for us.

Proposition 4

Let . Let , , , be generalised t-graphs , be an RDF graph and be a mapping with . Then the following hold:

  1. if and , then it is the case that .

  2. if , for all and , for all with , then .

3.1 Domination width

We start by giving some intuition regarding the notion of domination width. Let be a well-designed graph pattern, be an RDF graph and be a mapping. Suppose that and , for . The natural algorithm for checking is as follows (see e.g. [17, 24]): we simply iterate over all such that is a potential solution of over , i.e., there is a subtree of such that is a homomorphism from to , and we ensure that there is a child of where can be extended consistently.

The key observation is that we can reinterpret the above-described algorithm as follows. We can choose one of the subtrees as above, and associate a collection of generalised t-graphs of the form , where , where is the set of indices such that is a potential solution of over , and is a child of . To avoid conflicts, for every , the variables from that are not in , need to be renamed to fresh variables. Therefore, checking amounts to checking that there is a homomorphism from some element of to , i.e., whether , for some .

The idea behind domination width is to ensure that is always dominated by a subset where each generalised t-graph in has small ctw. The set dominates in the sense that, for every , there is a such that . Therefore, by transitivity of the relation , checking amounts to checking that there is a homomorphism from some element of to . Since generalised t-graphs of small ctw are well-behaved with respect to the relaxation (see Proposition 3), this will imply that the relaxation of the natural algorithm, described at the beginning of this section, given by replacing homomorphism tests by , correctly decides if . Below we formalise this intuition.

Let be a wdPF. A subtree of is a subtree of some wdPT , for . The support of the subtree contains precisely the indices from such that there is a subtree of satisfying . Note that , for every subtree . Since wdPTs are in NR normal form, whenever , then the witness subtree is unique. For , we denote such a by .

Let be a subtree of . A children assignment for is a function with a non-empty domain that maps every to a child of . We denote by the set of all children assignments for . Observe that if , then it must be the case that , for every . In particular, it could be the case that . The renamed t-graphs assignment associated with maps to a t-graph obtained from by renaming all variables in to new fresh variables. In particular, if and , then

For , we define the t-graph as

We say that a children assignment is valid if for every , we have that

We denote by the set of valid children assignments for . Finally, for the subtree , we define the set of generalised t-graphs associated with as

Figure 2: The wdPF of Example 4.
Figure 3: The generalised t-graphs and from Examples 4 and 5.
Example 4

Let . Recall from Example 3 that

Consider the wdPF depicted in Figure 2. For a wdPT and a subset , we denote by the subtree of induced by the set of nodes . Observe that the only subtrees of with a non-empty set are , , , and . Consider first and note that . We have that

with , where and are described by and . Figure 3 illustrates and . Note how we need to rename to a fresh variable in . Observe also that, for instance, the children assignment given by is not valid as and

For , we have that

where . Note that in Figure 1 corresponds to . In the case of , we have that

where . Finally, note that and .

Now we are ready to define domination width.

Definition 1 (-domination)

Let be a set of generalised t-graphs of the form , where is a set of t-graphs and is a fixed set of variables with , for all . We say that is a dominating set of if for every , there exists such that .

We say that is -dominated if the set is a dominating set of .

Definition 2 (Domination width)

Let be a wdPF. The domination width of , denoted by , is the minimum positive integer such that for every subtree of , the set of generalised t-graphs is -dominated.

For a well-designed graph pattern , we define the domination width of as .

We say that a class of well-designed graph patterns has bounded domination width if there is a universal constant such that , for every .

Example 5

Consider a class such that , where is the wdPF defined in Figure 2 and Example 4. We claim that has bounded domination width as for every , it is the case that . Indeed, following the notation from Example 4, we need to check that , and are -dominated.

Note first that . Therefore, is -dominated. Observe also that coincides with from Figure 1 and, as explained in Example 3, we have . It follows that is also -dominated. Finally, for , we have that and (see Figure 3). However, we have that , and hence, is also -dominated.

The following is our main tractability result.

Theorem 1 (Main tractability)

Let be a class of well-designed graph patterns of bounded domination width. Then is in PTIME.

Let be a positive integer such that , for all . Fix , RDF graph and mapping . Let and suppo