Efficiently Enumerating Answers to Ontology-Mediated Queries

by   Carsten Lutz, et al.

We study the enumeration of answers to ontology-mediated queries (OMQs) where the ontology is a set of guarded TGDs or formulated in the description logic ELI and the query is a conjunctive query (CQ). In addition to the traditional notion of an answer, we propose and study two novel notions of partial answers that can take into account nulls generated by existential quantifiers in the ontology. Our main result is that enumeration of the traditional complete answers and of both kinds of partial answers is possible with linear-time preprocessing and constant delay for OMQs that are both acyclic and free-connex acyclic. We also provide partially matching lower bounds. Similar results are obtained for the related problems of testing a single answer in linear time and of testing multiple answers in constant time after linear time preprocessing. In both cases, the border between tractability and intractability is characterized by similar, but slightly different acyclicity properties.



page 1

page 2

page 3

page 4


Enumeration Complexity of Unions of Conjunctive Queries

We study the enumeration complexity of answering unions of conjunctive q...

Enumeration on Trees under Relabelings

We study how to evaluate MSO queries with free variables on trees, withi...

A Complete Classification of the Complexity and Rewritability of Ontology-Mediated Queries based on the Description Logic EL

We provide an ultimately fine-grained analysis of the data complexity an...

Enumeration Complexity of Conjunctive Queries with Functional Dependencies

We study the complexity of enumerating the answers of Conjunctive Querie...

Tight Fine-Grained Bounds for Direct Access on Join Queries

We consider the task of lexicographic direct access to query answers. Th...

Completeness Guarantees for Incomplete Ontology Reasoners: Theory and Practice

To achieve scalability of query answering, the developers of Semantic We...

How to Approximate Ontology-Mediated Queries

We introduce and study several notions of approximation for ontology-med...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In knowledge representation, ontologies are an important means for injecting domain knowledge into an application. In the context of databases, they give rise to ontology/̄mediated queries (OMQs) which enrich a traditional database query such as a conjunctive query (CQ) with an ontology. OMQs aim at querying incomplete data, using the domain knowledge provided by the ontology to derive additional answers. In addition, they may enrich the vocabulary available for query formulation with relation symbols that are not used explicitly in the data. Popular choices for the ontology language include (restricted forms of) tuple-generating dependencies (TDGs), also dubbed existential rules (DBLP:conf/ijcai/BagetMRT11) and Datalog (DBLP:journals/ws/CaliGL12), as well as various description logics (baader-introduction-to-dl).

The complexity of evaluating OMQs has been the subject of intense study, with a focus on single-testing as the mode of query evaluation: given an ontology-mediated query (OMQ) , a database , and a candidate answer , decide whether (AbHV95; barcelo_omq_limits-g; bienvenu-answering-omq; bienvenu-ontology-disjunctive-datalog). In many applications, however, it is not realistic to assume that a candidate answer is available. This has led database theoreticians and practitioners to investigate more relevant modes of query evaluation such as enumeration: given and , generate all answers in , one after the other and without repetition.

The first main aim of this paper is to initiate a study of efficiently enumerating answers to OMQs. We consider enumeration algorithms that have a preprocessing phase in which data structures are built that are used in the subsequent enumeration phase to produce the actual output. With ‘efficient enumeration’, we mean that preprocessing may only take time linear in while the delay between two answers must be constant, that is, independent of . One may or may not impose the additional requirement that, in the enumeration phase, the algorithm may consume only a constant amount of memory on top of the data structures precomputed in the preprocessing phase. We refer to the resulting enumeration complexity classes as and CDLin, the former admitting unrestricted (polynomial) memory consumption; the use of these names in the literature is not consistent, we follow (segoufin-enum; carmeli-enum-ucqs). Without ontologies, answer enumeration in CDLin and in has received significant attention (DBLP:journals/dagstuhl-reports/BorosKPS19; bagan-enum-cdlin; berkholz-enum-fpt; carmeli-enum-ucqs; carmeli-enum-rand; carmeli-enum-func; segoufin-enum; deep-enum-alg; deep-enum-ranked), see also the survey (berkholz-enum-tutorial). A landmark result is that a CQ admits enumeration in CDLin if it is acyclic and free-connex acyclic where the former means that has a join tree and the latter that the extension of with an atom that ‘guards’ the answer variables is acyclic (bagan-enum-cdlin). Partially matching lower bounds pertain to self-join free CQs (bagan-enum-cdlin; BraultBaron).

The second aim of this paper is to introduce a novel notion of partial answers to OMQs. In the traditional certain answers, if and only if is a tuple of constants from such that for every model of and the ontology used in . In contrast, a partial answer may contain, apart from constants from , also the wildcard symbol ‘’ to indicate a constant that we know must exists, but whose identity is unknown. Such labeled nulls may be introduced by existential quantifiers in the ontology . To avoid redundancy as in the partial answers and , we are interested in minimal partial answers that cannot be ‘improved’ by replacing a wildcard with a constant from while still remaining a partial answer. The following simple example illustrates that minimal partial answers may provide useful information that is not provided by the traditional answers, from now called complete answers.

Example 1.1 ().

Consider the ontology that contains

and the CQ giving rise to the OMQ . Take the following database :

The minimal partial answers to on are

We also introduce and study minimal partial answers with multiple wildcards . Distinct occurences of the same wildcard in an answer indicate the same null, while different wildcards may or may not correspond to different nulls. Multiple wildcards may thus be viewed as adding equality on wildcards, but not inequality. We note that there are certain similarities between minimal partial answer to OMQs and answers to SPARQL queries with the ‘optional’ operator (DBLP:conf/pods/BarceloPS15; DBLP:conf/icdt/KrollPS16), but also many dissimilarities.

The third aim of this paper is to study two problems for OMQs that are closely related to constant delay enumeration: single-testing in linear time (in data complexity) and all-testing in CDLin or . Note that for Boolean queries, single-testing in linear time coincides with enumeration in CDLin and in . An all-testing algorithm has a prepocessing phase followed by a testing phase where it repeatedly receives candidate answers and returns ‘yes’ or ’no’ depending on whether  (berkholz-enum-tutorial). All-testing in grants preprocessing time while the time spent per test must be independent of , and all-testing in CDLin is defined accordingly.

An ontology-mediated query takes the form where is an ontology, a schema for the databases on which is evaluated, and a conjunctive query. In this paper, we consider ontologies that are sets of guarded tuple-generating dependencies (TGDs) or formulated in the description logic . We remind the reader that a TGD takes the form where and are CQs, and that it is guarded if has an atom that mentions all variables from and . Up to normalization, an -ontology may be viewed as a finite set of guarded TGDs of a restricted form, using in particular only unary and binary relation symbols. Both guarded TGDs and are natural and popular choices for the ontology language (cali-more-expressove-onto; cali-taming-chase; baader-introduction-to-dl). We use to denote the language of all OMQs that use a set of guarded TGDs as the ontology and a CQ as the actual query, and likewise for and -ontologies.

We next summarize our results. In Section LABEL:sect:singletesting, we start with showing that in , single-testing complete answers is in linear time for OMQs that are weakly acyclic. A CQ is weakly acyclic if it is acyclic after replacing the answer variables with constants and an OMQ is weakly acyclic if the CQ in it is; in what follows, we lift other properties of CQs to OMQs in the same way without further notice. Our proof relies on the construction of a ‘query-directed’ fragment of the chase and a reduction to the generation of minimal models of propositional Horn formulas. We also give a lower bound for OMQs from that are self-join free: every such OMQ that admits single-testing in linear time is weakly acyclic unless the triangle conjecture from fine-grained complexity theory fails. This generalizes a result for the case of CQs without ontologies (BraultBaron). We observe that it is not easily possible to replace by in our lower bound as this would allow us to remove also ‘self-join free’ while it is open whether this is possible even in the case without ontologies. We also show that single-testing minimal partial answers with a single wildcard is in linear time for OMQs from that are acyclic and that the same is true for multiple wildcards and acyclic OMQs from . We also observe that these (stronger) requirements cannot easily be relaxed.

In Section LABEL:sect:enumallcomplete, we turn to enumeration and all-testing of complete answers. We first show that in , enumerating complete answers is in CDLin for OMQs that are acyclic and free/̄connex acyclic while all-testing complete answers is in CDLin for OMQs that are free/̄connex acyclic (but not necessarily acyclic). The proof again uses the careful chase construction and a reduction to the case without ontologies. The lower bound for single testing conditional on the triangle conjecture can be adapted to enumeration, with ‘not weakly acyclic’ replaced by ‘not acyclic’. For enumeration, it thus remains to consider OMQs that are acyclic, but not free-connex acyclic. We show that for every self-join free OMQ from that is acyclic, connected, and admits enumeration in CDLin, the query is free-connex acyclic, unless sparse Boolean matrix multiplication (BMM) is possible in time linear in the size of the input plus the size of the ouput; this would imply a considerable advance in algorithm theory and currently seems to be out of reach. We also show that it is not possible to drop the requirement that the query is connected, which is not present in the corresponding lower bound for the case without ontologies (bagan-enum-cdlin; berkholz-enum-tutorial). We prove a similar lower bound for all-testing complete answers, subject to a condition regarding non-sparse BMM. All mentioned lower bounds also apply to both kinds of partial answers.

In Section LABEL:sect:LPAsingleWildcardUpper, we then prove that enumerating minimal partial answers with a single wildcard is in for OMQs from that are acyclic and free/̄connex acyclic. This is one of the main results of this paper, based on a non-trivial enumeration algorithm. Here, we only highlight two of its features. First, the algorithm precomputes certain data structures that describe ‘excursions’ that a homomorphism from into the chase of with may make into the parts of the chase that has been generated by the existential quantifiers in the ontology. And second, it involves subtle sorting and pruning techniques to ensure that only minimal partial answers are output. We also observe that all-testing minimal partial answers is less well-behaved than enumeration as there is an OMQ that is acyclic and free/̄connex acyclic, but for which all-testing is not in CDLin unless the triangle conjecture fails.

Finally, Section LABEL:sect:enummulti extends the upper bound from Section LABEL:sect:LPAsingleWildcardUpper to minimal partial answers with multiple wildcards. We first show that all-testing (not necessarily minimal!) partial answers with multiple wildcards is in for OMQs that are acyclic and free/̄connex acyclic and then reduce enumeration of minimal partial answers with multiple wildcards to this, combined with the enumeration algorithm of minimal partial answers with a single wildcard obtained in the previous section.