A Complete Classification of the Complexity and Rewritability of Ontology-Mediated Queries based on the Description Logic EL

04/29/2019 ∙ by Carsten Lutz, et al. ∙ 0

We provide an ultimately fine-grained analysis of the data complexity and rewritability of ontology-mediated queries (OMQs) based on an EL ontology and a conjunctive query (CQ). Our main results are that every such OMQ is in AC0, NL-complete, or PTime-complete and that containment in NL coincides with rewritability into linear Datalog (whereas containment in AC0 coincides with rewritability into first-order logic). We establish natural characterizations of the three cases in terms of bounded depth and (un)bounded pathwidth, and show that every of the associated meta problems such as deciding wether a given OMQ is rewritable into linear Datalog is ExpTime-complete. We also give a way to construct linear Datalog rewritings when they exist and prove that there is no constant Datalog rewritings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An important application of ontologies is to enrich data with a semantics and with domain knowledge while also providing additional vocabulary for query formulation DBLP:conf/rweb/CalvaneseGLLPRR09 ; DBLP:conf/rweb/KontchakovRZ13 ; BienvenuCLW14 ; DBLP:conf/rweb/BienvenuO15 . In this context, it makes sense to view the combination of a database query and an ontology as a compound query, commonly referred to as an ontology-mediated query (OMQ). An OMQ language is then constituted by an ontology language and a query language . Prominent choices for include many description logics (DLs) such as , Horn-, and  DL-Textbook while the most common choices for are conjunctive queries (CQs), unions thereof (UCQs), and the simple atomic queries (AQ) which are of the form . Substantial research efforts have been invested into understanding the properties of the resulting OMQ languages, with two important topics being

  1. the data complexity of OMQ evaluation hustadt-2005 ; LutzKri07 ; Rosati07 ; DBLP:journals/ai/CalvaneseGLLR13 ; BienvenuCLW14 , where data complexity means that only the data is considered the input while the OMQ is fixed, and

  2. the rewritability of OMQs into more standard database query languages such as SQL (which in this context is often equated with first-order logic) and Datalog DBLP:conf/aaai/EiterOSTX12 ; BieLuWo-IJCAI13 ; BienvenuCLW14 ; kaminski14 ; DBLP:conf/ijcai/AhmetajOS16 ; ICDT17-JournalVersion .

data complexity is often considered a necessary condition for efficient query evaluation in practice. Questions about rewritability are also motivated by practical concerns: Since most database systems are unaware of ontologies, rewriting OMQs into standard database query languages provides an important avenue for implementing OMQ execution in practical applications DBLP:conf/rweb/CalvaneseGLLPRR09 ; HLSW-IJCAI15 ; perezurbina10tractable ; DBLP:journals/ws/TrivelaSCS15 . Both subjects are thoroughly intertwined since rewritability into first-order logic (FO) is closely related to data complexity while rewritability into Datalog is closely related to data complexity. We remark that FO-rewritability of an OMQ implies rewritability into a UCQ and thus into Datalog BienvenuCLW14 . From now one, when speaking about complexity we always mean data complexity.

Regarding compexity and rewritability, modern DLs can roughly be divided into two families: ‘expressive DLs’ such as and that result in OMQ languages with complexity and where rewritability is guaranteed neither into FO nor into Datalog BienvenuCLW14 ; DBLP:journals/ws/TrivelaSCS15 ; ICDT17-JournalVersion , and ‘Horn DLs’ such as and Horn- which typically have complexity and where rewritability into (monadic) Datalog is guaranteed, but FO-rewritability is not BieLuWo-IJCAI13 ; HLSW-IJCAI15 ; ijcai16 . In practical applications, however, ontology engineers often need to use language features that are only available in expressive DLs, but they typically do so in a way such that one may hope for hardness to be avoided by the concrete ontologies that are being designed.

Initiated in LutzWolter12 ; BienvenuCLW14 , this has led to studies of data complexity and rewritability that are much more fine-grained than the analysis of entire ontology languages, see also DBLP:conf/dlog/ZakharyaschevKG18 ; DBLP:conf/ijcai/LutzSW15 . The ultimate aim is to understand, for relevant OMQ languages , the exact complexity and rewritability status of every OMQ from . For expressive DLs, this turns out to be closely related to the complexity classification of constraint satisfaction problems (CSPs) with a fixed template FederVardi . Very important progress has recently been made in this area with the proof that CSPs enjoy a dichotomy between and DBLP:journals/corr/Bulatov17a ; DBLP:journals/corr/Zhuk17 . Via the results in BienvenuCLW14 , this implies that OMQ evaluation in languages such as enjoys a dichotomy between and . However, the picture is still far from being fully understood. For example, neither in CSP nor in expressive OMQ languages it is known whether there is a dichotomy between and , and whether containment in coincides with rewritability into linear Datalog.

The aim of this paper is to carry out an ultimately fine-grained analysis of the data complexity and rewritability of OMQs from the languages and where is a fundamental and widely known Horn DL that is at the core of the OWL EL profile of the OWL 2 ontology language profiles . In fact, we completely settle the complexity and rewritability status of each OMQ from . Our first main result is a trichotomy: Every OMQ from is in , -complete, or -complete, and all three complexities actually occur already in . We consider this a remarkable sparseness of complexities. Let us illustrate the trichotomy using an example. Formally, an OMQ from is a triple with an TBox that represents the ontology, a CQ, and an ABox signature, that is, a set of concept and role names that can occur in the data.

Example 1.

Consider an ontology that represents knowledge about genetic diseases, where is caused by and by . A patient carries if both parents carry , and the patient carries if at least one parent carries (dominant and recessive inheritance, respectively). Let

For , the OMQ is -complete and not rewritable into linear Datalog, while is -complete and rewritable into linear Datalog.

Our second main result is that for OMQs from , evaluation in coincides with rewritability into linear Datalog. It is known that evaluation in AC coincides with FO-rewritability ijcai16 and thus each of the three occurring complexities coincides with rewritability into a well-known database language: AC corresponds to FO, to linear Datalog, and to monadic Datalog. We also show that there is no constant bound on the arity of IDB relations in linear Datalog rewritings, that is, we find a sequence of OMQs from (and in fact, even from ) that are all rewritable into linear Datalog, but require higher and higher arities of IDB relations.

We remark that rewritability into linear Datalog might also be interesting from a practical perspective. In fact, the equation “SQL = FO” often adopted in ontology-mediated querying ignores the fact that SQL contains linear recursion from its version 3 published in 1999 on, which exceeds the expressive power of FO. We believe that, in the context of OMQs, linear Datalog might be a natural abstraction of SQL that includes linear recursion, despite the fact that it does not contain full FO. Indeed, the fact that all OMQs from that are FO-rewritable are also UCQ-rewritable shows that the expressive power of FO that lies outside of linear Datalog is not useful when using SQL as a target language for OMQ rewriting.

The second main result is proved using a characterization of linear Datalog rewritability in terms of bounded pathwidth that may be of independent interest. It is easiest to state for : an OMQ is rewritable into linear Datalog (equivalently: can be evaluated in ) if the class of the following ABoxes has bounded pathwidth: is tree-shaped, delivers the root as an answer to , and is minimal w.r.t. set inclusion regarding the latter property. For , we have to replace in tree-shaped ABoxes with pseudo tree-shaped ones in which the root is an ABox that can have any relational structure, but whose size is bounded by the size of the actual query in . These results are closely related to results on bounded pathwidth obstructions of CSPs, see for example Dalmau05 ; DalmauK08 ; CarvalhoDK10 .

Finally, we consider the meta problems associated to the studied properties of OMQs, such as whether a given OMQ is rewritable into linear Datalog, -hard, -hard, etc. Each of these problems turns out to be -complete, both in and in . In the case of linear Datalog rewritability, our results provide a way of constructing a concrete rewriting when it exists.

This paper is organized as follows. We introduce preliminaries in Section 2 and then start with considering the OMQ language where conCQ refers to the class of CQs that are connected when viewed as a graph; these CQs might have any arity, including 0. In Section 3, we show that enjoys a dichotomy between AC and , using a notion of bounded depth that was introduced in ijcai16 . In particular, it was shown in ijcai16 that when the ABoxes in have bounded depth, then can be evaluated in AC. We prove that otherwise, we find certain gadget ABoxes (we say that has the ability to simulate ) that allow us to reduce the reachability problem in directed graphs, thus showing -hardness. In Section 4, we prove a dichotomy between and , still for . We first show that if has unbounded pathwidth, then we can find certain gadget ABoxes (we say that has the ability to simulate ) that allow us to reduce the path accessibility problem, thus showing -hardness. This result is similar to, but substantially more difficult than the -hardness result in Section 3. We then proceed by showing that if has bounded pathwidth, then we can construct a two-way alternating word automaton that accepts suitable representations of pairs where is an ABox of low pathwidth and and answer to on . We further show how to convert this automaton into a linear Datalog rewriting, which yields complexity. Section 5 is concerned with extending both of our dichotomies to potentially disconnected CQs. In Section 6, we prove that there is a sequence of OMQs that are linear Datalog rewritable but for which the width of IDB relations in linear Datalog rewritings is not bounded by a constant. This strengthens a result by DalmauK08 who establish an analogous statement for CSPs. In Section 7 we prove decidability and -completeness of the meta problems. The upper bounds are established using the ability to simulate from Section 4 and alternating tree automata.

This paper is an extended version of LS-IJCAI17 . The main differences are that LS-IJCAI17 only treats atomic queries but no conjunctive queries, does not provide characterizations in terms of bounded pathwidth, and achieves less optimal bounds on the width of IDB relations in constructed linear Datalog programs.

2 Preliminaries

We introduce description logics, ontology-mediated queries, central technical notions such as universal models and the pathwidth of ABoxes, as well as linear Datalog and a fundamental glueing construction for ABoxes. We refer to DL-Textbook for more extensive information on description logics and to AbiteboulHV95 for background in database theory.

TBoxes and Concepts. In description logic, an ontology is formalized as a TBox. Let , , and be disjoint countably infinite sets of concept names, role names, and individual names. An -concept is built according to the syntax rule where ranges over concept names and over role names. While this paper focuses on , there are some places where we also consider the extension of with inverse roles. An -concept is built according to the syntax rule , the symbol ranges being as in the case of -concepts. An expression of the form is an inverse role. An -TBox (-TBox, resp.) is a finite set of concept inclusions (CIs) of the form , and -concepts (-concepts, resp.).

The size of a TBox, a concept, or any other syntactic object , denoted , is the number of symbols needed to write , with each concept and role name counting as one symbol.

ABoxes. An ABox is the DL way to store data. Formally, it is defined as a finite set of concept assertions and role assertions where is a concept name, a role name, and individual names. We use to denote the set of individuals of the ABox . A signature is a set of concept and role names. We often assume that the ABox is formulated in a prescribed signature, which we call the ABox signature. An ABox that only uses concept and role names from a signature is called a -ABox. We remark that the ABox signature plays the same role as a schema in the database literature AbiteboulHV95 . If is an ABox and , then we use to denote the restriction of to the assertions that only use individual names from . A homomorphism from an ABox to an ABox is a function such that implies and implies .

Every ABox is associated with a directed graph with nodes and edges . A directed graph is a tree if it is acyclic, connected and has a unique node with indegree , which is then called the root of . An ABox is tree-shaped if is a tree and there are no multi-edges, that is, implies for all . The root of a tree-shaped ABox is the root of and we call an individual a descendant of an individual if and the unique path from the root to contains .

Semantics. An interpretation is a tuple , where is a non-empty set, called the domain of , and is a function that assigns to every concept name a set and to every role name a binary relation . The function can be inductively extended to assign to every concept a set in the following way.

An interpretation satisfies a CI if , a concept assertion if , and a role assertion if . It is a model of a TBox if it satisfies all CIs in it and a model of an ABox if it satisfies all assertions in it. For an interpretation and , we use to denote the restriction of to the elements in .

Conjunctive queries. A conjunctive query (CQ) is a first order formula of the form , a conjunction of relational atoms, that uses only unary and binary relations that must be from and , respectively. A CQ with equality atoms is a CQ where additionally, atoms of the form are allowed. We also interpret as the set of its atoms. The variables in are called answer variables whereas the variables in are called quantified variables. We set . Every CQ can be viewed as an ABox by viewing (answer and quantified) variables as individual names. A CQ is connected if is and rooted if every connected component of contains at least one answer variable. A CQ is tree-shaped if is. If is a CQ and , then we use to denote the restriction of to the atoms that only use variables from (this may change the arity of ). An atomic query (AQ) is a CQ of the form .

A union of conjunctive queries (UCQ) is a disjunction of CQs that have the same answer variables. We write to emphasize that are the answer variables in . The arity of a (U)CQ , denoted , is the number of its answer variables. We say that is Boolean if . Slightly overloading notation, we write CQ to denote the set of all CQs, to denote the set of all CQs where equality atoms are allowed, conCQ for the set of all connected CQs, AQ for the set of all AQs, and UCQ for the set of all UCQs.

Let be a UCQ and an interpretation. A tuple is an answer to on , denoted , if there is a homomorphism from to with , that is, a function such that implies and implies .

Ontology-mediated queries. An ontology-mediated query (OMQ) is a triple that consists of a TBox , an ABox signature and a query such as a CQ or a UCQ. Let be a -ABox. A tuple is an answer to on , denoted , if for every common model of and , is an answer to on . If the TBox should be emphasized, we write instead of . For an ontology language and query language , we use to denote the OMQ language in which TBoxes are formulated in and the actual queries are from ; we also identify this language with the set of all OMQs that it admits. In this paper, we mainly concentrate on the OMQ languages and .

For an OMQ , we use eval to denote the following problem: given a -ABox and a tuple , decide whether .

TBox normal form. Throughout the paper, we generally and without further notice assume TBoxes to be in normal form, that is, to contain only concept inclusions of the form , , , , where all are concept names and is a role name or, in the case of -TBoxes, an inverse role. Every TBox can be converted into a TBox in normal form in linear time BaaderBL05 , introducing fresh concept names; the resulting TBox is a conservative extension of , that is, every model of is a model of and, conversely, every model of can be extended to a model of by interpreting the fresh concept names. Consequently, when is replaced in an OMQ with , resulting in an OMQ , then and are equivalent in the sense that they give the same answers on all -ABoxes. Thus, conversion of the TBox in an OMQ into normal form does not impact its data complexity nor rewritability into linear Datalog (or any other language).

First order Rewritability. Let . We call FO-rewritable if there exists a first-order formula without function symbols and constants, potentially using equality, and using relational atoms of arity one and two only, drawing unary relation symbols from and binary relation symbols from such that for every ABox and every tuple of individuals of , we have if and only if , where is interpreted as a relational structure over .

Linear Datalog Rewritability. A Datalog rule has the form , , where are relations of any arity and denote tuples of variables. We refer to as the head of and to as the body. Every variable that occurs in the head of a rule is required to also occur in its body. A Datalog program is a finite set of Datalog rules with a selected goal relation goal that does not occur in rule bodies. The arity of , denoted , is the arity of the goal relation. Relation symbols that occur in the head of at least one rule of are intensional (IDB) relations, and all remaining relation symbols in are extensional (EDB) relations. In our context, EDB relations must be unary or binary and are identified with concept names and role names. Note that, by definition, goal is an IDB relation. A Datalog program is linear if each rule body contains at most one IDB relation. The width of a Datalog program is the maximum arity of non-goal IDB relations used in it and its diameter is the maximum number of variables that occur in a rule in .

For an ABox that uses no IDB relations from and a tuple , we write if is an answer to on , defined in the usual way AbiteboulHV95 : if is a logical consequence of viewed as a set of first-order sentences (all variables in rules quantified universally). We also admit body atoms of the form that are vacuously true. This is just syntactic sugar since any rule with body atom can equivalently be replaced by a set of rules obtained by replacing in all possible ways with an atom where is an EDB relation and where for some and all other are fresh variables.

A Datalog program over EDB signature is a rewriting of an OMQ if iff for all -ABoxes and all . We say that is (linear) Datalog-rewritable if there is a (linear) Datalog program that is a rewriting of . It is well-known that all OMQs from are Datalog-rewritable. It follows from the results in this paper that there are rather simple OMQs that are not linear Datalog-rewritable, choose e.g. , , and .

Universal models. It is well known DBLP:journals/jsc/LutzW10 that for every -TBox and ABox there is a universal model with certain nice properties. These are summarized in the following lemma. Homomorphisms between interpretations are defined in the expected way, ignoring individual names.

Lemma 2.

Let be an -TBox in normal form and an ABox. Then there is an interpretation such that

  1. is a model of and ;

  2. for every model of and , there is a homomorphism from to that is the identity on ;

  3. for all CQs and , iff .

can be constructed using a standard chase procedure, as follows. We define a sequence of ABoxes by setting and then letting be extended as follows:

  • If and , then add to ;

  • If and , then add to ;

  • if and , then add to ;

  • if and , then add to ;

  • if , and there is no such that and , then take a fresh individual and add and to ;

  • if , and there is no such that and , then take a fresh individual and add and to .

Let . We define to be the interpretation that corresponds to . This does actually not define in a unique way since the order or applying the above rules may have an impact on the shape of . However, all resulting are homomorphically equivalent and it does not matter for the constructions in this paper which order we use. Slightly sloppily, we thus live with the fact that is not uniquely defined. Note that can be infinite and that its shape is basically the shape of , but with a (potentially infinite) tree attached to every individual in . The domain elements in these trees are introduced by Rules (v) and (vi), and we refer to them as anonymous elements. The properties in Lemma 2 are standard to prove, see for example BO15 ; DL-Textbook for similar proofs.

The degree of an ABox is the maximum number of successors of any individual in . The following lemma often allows us to concentrate on ABoxes of small degree. We state it only for , since we only use it for these OMQs.

Lemma 3.

Let be an OMQ and a -ABox such that . Then there exists of degree at most such that .

Proof.

(sketch) Assume and let be the ABox produced by the chase procedure described above. Since , by Lemma 2, . Let be obtained from by removing all assertions that did not participate in any application of rule (i), (ii), (v) or (vi) and let be the result of chasing . Clearly, we must have . Moreover, it is easy to verify that the degree of is at most . ∎

Pathwidth. A path decomposition of a (directed or undirected) graph is a sequence of subsets of , such that

  • for and

  • .

A path decomposition is an -path decomposition if and . The pathwidth of , denoted , is the smallest integer , such that has a -path decomposition for some . Note that paths have pathwidth 1. For an ABox , a sequence of subsets of is a path decomposition of if is a path decomposition of . We assign a pathwidth to by setting .

Treeifying CQs. A Boolean CQ is treeifiable if there exists a homomorphism from into a tree-shaped interpretation. With every treeifiable Boolean CQ , we associate a tree-shaped CQ that is obtained by starting with and then exhaustively eliminating forks, that is, identifying and whenever there are atoms and . Informally, one should think of as the least constrained treeification of . It is known that a CQ is treeifiable if and only if the result of exhaustively eliminating forks is tree-shaped DBLP:conf/cade/Lutz08 . Consequently, it can be decided in polynomial time whether a Boolean CQ is treeifiable.

One reason for why treeification is useful is that every tree-shaped Boolean CQ can be viewed as an -concept in a straightforward way. If, for example,

then .

A pair of variables from a CQ is guarded if contains an atom of the form . For every guarded pair and every , define to be the smallest set such that

  1. and ;

  2. if , , and , then ;

  3. if , and , then .

Moreover, . We use to denote the set of all (tree-shaped) CQs such that for some guarded pair with treeifiable.

It is easy to verify that the number of CQs in is linear in . We briefly argue that can be computed in polynomial time. The number of guarded pairs is linear in . For each guarded pair , can clearly be computed in polynomial time. Moreover, exhaustively eliminating forks on takes only polynomial time, which tells us whether is treeifiable and constructs if this is the case.

Pseudo tree-shaped ABoxes. Throughout the paper, we often concentrate on ABoxes that take a restricted, almost tree-shaped form. These are called pseudo tree-shaped ABoxes, introduced in ijcai16 . An ABox is a pseudo tree-shaped ABox of core size if there exist ABoxes such that , , and all are tree-shaped ABoxes with pairwise disjoint individuals and consists precisely of the root of . We call the core of . The tree-shaped ABoxes that are part of a pseudo tree-shaped ABox should not be confused with the anonymous trees that are added when chasing a pseudo tree-shaped ABox to construct a universal model. Note that every tree-shaped ABox is pseudo tree-shaped with core size .

The following lemma describes the central property of pseudo tree-shaped ABoxes. It essentially says that if is an answer to an OMQ based on a connected CQ on an ABox , then one can unravel into a pseudo tree-shaped ABox that homomorphically maps to and such that is an answer to on , witnessed by a homomorphism from to that satisfies the additional property of being within or at least ‘close to’ the core of .

Lemma 4.

Let , a -ABox and such that . Then there is a pseudo tree-shaped -ABox of core size at most and with in its core that satisfies the following conditions:

  1. there is a homomorphism from to that is the identity on ;

  2. , witnessed by a homomorphism from to whose range consists solely of core individuals and of anonymous elements in a tree rooted in a core individual.

Proof.

(sketch) Assume that and let be a homomorphism from to . Let be the set of all individuals that are either in the range of or such that an anonymous element in the chase-generated tree below is in the range of . We can unravel into a potentially infinite pseudo tree-shaped ABox with core , see ijcai16 for details. Then and this is witnessed by a homomorphism as required by Condition (2) of Lemma 4. However, need not be finite. By the compactenss theorem of first order logic, there exists a finite subset such that . Let be the restriction of to those individuals that are reachable in from an individual in . It can be verified that is as required. ∎

We shall often be interested in pseudo tree-shaped ABoxes that give an answer to an OMQ and that are minimal with this property regarding set inclusion, that is, no strict subset of supports as an answer to . We introduce some convenient notation for this. Let . We use to denote the set of all pseudo tree-shaped -ABoxes of core size at most such that for some tuple in the core of , while for any .

-types and Glueing ABoxes. We introduce a fundamental construction for merging ABoxes. Let be an -TBox. A -type is a set of concept names from that is closed under -consequence, that is, if , then . For an ABox and , we use to denote the set of concept names from such that , which is a -type. The following lemma allows us to glue together ABoxes under certain conditions.

Lemma 5.

Let be -ABoxes and an -TBox such that for all . Then for all , .

Proof.

Let , , and be as in the lemma. It clearly suffices to show that for all , . We show the contrapositive. Thus, assume that for some . We have to show that . Let be the universal model of and and for each , let be the a universal model of and . We can assume w.l.o.g. that . By assumption and since , we must have and . Consider the (non-disjoint) union of and . Clearly, is a model of and . To show , it thus remains to prove that is a model of . To do this, we argue that all concept inclusions from are satisfied:

  • Consider and such that and . Then there exist such that and . If , then , since is a model of . Otherwise , so by assumption, . It follows that and thus, . Together with and because is a model of , it follows that . Thus, the inclusion is satisfied in .

  • Consider and . Then for some . Since is a model of , we have , so and the inclusion is satisfied in .

  • Consider and . Then there are such that and . If , then follows, since is a model of . Otherwise , so by assumption, . For sure we have , so we have and since is a model of , we conclude , so the inclusion is satisfied in .

  • Consider and . Then for some . Since is a model of , we have and , hence also and and thus, is satisfied in .

3 versus for Connected CQs

We prove a dichotomy between and for and show that for OMQs from this language, evaluation in coincides with FO rewritability. The dichotomy does not depend on assumptions from complexity theory since it is known that FurstSS81 . We generalize the results obtained here to potentially disconnected CQs in Section 5.

FO-rewritability of OMQs in has been characterized in ijcai16 by a property called bounded depth. Informally, an OMQ has bounded depth if it looks only boundedly far into the ABox. To obtain our results, we show that unbounded depth implies -hardness. Formally, bounded depth is defined as follows. The depth of a tree-shaped ABox is the largest number such that there exists a directed path of length starting from the root in . The depth of a pseudo tree-shaped ABox is the maximum depth of its trees. We say that an OMQ has bounded depth if there is a such that every has depth at most . If there is no such , then has unbounded depth.

Theorem 6.

Let . The following are equivalent:

  1. has bounded depth.

  2. is -rewritable.

  3. eval is in .

If these conditions do not hold, then eval is -hard under FO reductions.

The equivalence (ii) (iii) is closely related to a result in CSP. In fact, every OMQ of the form with a concept name and formulated in is equivalent to the complement of a CSP BienvenuCLW14 and it is a known result in CSP that FO-rewritability coincides with BulatovKL08 . Conjunctive queries, however, go beyond the expressive power of (complements of) CSPs and thus we give a direct proof for (ii) (iii).

The equivalence (i) (ii) follows from Theorem 9 in ijcai16 . Further, the implication (ii) (iii) is clear because first order formulas can be evaluated in . What remains to be shown is thus the implication (iii) (i) and the last sentence of the theorem. We show that unbounded depth implies -hardness, which establishes both since .

We first give a rough sketch of how the reduction works. We reduce from , the reachability problem in directed graphs, which is -complete under FO reductions immerman . An input for this problem is a tuple where is a directed graph, a source node and a target node. Such a tuple is a yes-instance if there exists a path from to in the graph . We further assume w.l.o.g. that and that the indegree of and the outdegree of are both , which simplifies the reduction.

Let be an OMQ of unbounded depth. The reduction has to translate a tuple into a -ABox and a tuple such that if and only if there is a path from to . We show that any ABox from of sufficiently large depth can be used to construct ABoxes , and that can serve as gadgets in the reduction. More precisely, the ABox has (among others) one individual for every node , the edges of will be represented using copies of , and the source and target nodes will be marked using the ABoxes and , respectively. We identify two -types and such that if is reachable from via a path in and otherwise. The tuple is then connected to in a way such that if and only if .

We next define a property of , called the ability to simulate , that makes the properties of , , and precise, as well as those of the -types and . We then show that having unbounded depth implies the ability to simulate and that this, in turn, implies -hardness via a reduction from .

If is a set of concept names, then denotes the ABox . We write to mean that for all . For every pseudo tree-shaped ABox and a non-core individual , we use to denote the tree-shaped ABox rooted at . Note that every tree-shaped ABox is trivially pseudo tree-shaped with only one tree and where the core consists only of the root individual, so this notation can also be used if is tree-shaped. Moreover, we use to denote the pseudo tree-shaped ABox , that is, the ABox obtained from by removing all assertions that involve descendants of (making a leaf) and all assertions of the form . We also combine these notations, writing for example for .

Boolean queries require some special attention in the reduction since they can be made true by homomorphisms to anywhere in the universal model of and , rather than to the neighborhood of the answer tuple  (recall that we work with connected CQs). We thus have to build such that the universal model does not admit unintended homomorphisms. Let be a pseudo tree-shaped -ABox of core size and a tuple from . We call a homomorphism from to core close if there is some variable in such that is in the core of or is an anonymous element in a tree below a core individual. If and is from the core of , then every homomorphism is core close, but this is not true if is Boolean.

Lemma 7.

Let be Boolean and . Then every homomorphism from to is core close.

Proof.

(sketch) Since , is minimal with the property that . Assume that there is a homomorphism from to that is not core close. Then there is no path in from any element in the range of to any individual in the core of (though a path in the converse direction might exist). Thus, we can remove all assertions in that involve a core individual and the resulting ABox satisfies , contradicting the minimality of . Formally, this can be proved by using Lemma 4 and showing that and are isomorphic when restricted to non-core individuals and all elements reachable from them on a path. ∎

For the rest of this section, we assume w.l.o.g. that in any OMQ , the TBox has been modified as follows: for every , introduce a fresh concept name and add the concept inclusion to  where is viewed as an -concept. Finally, normalize again. It is easy to see that the OMQ resulting from this modification is equivalent to the original OMQ . The extension is still useful since its types are more informative, now potentially containing also the freshly introduced concept names. We are now ready to define the ability to simulate .

Definition 8.

An OMQ