Counting answers to a query is an essential operation in data management, and is supported by virtually every database management system. In this paper, we focus on counting answers over a Knowledge Base (KB), which may be viewed as a database (DB) enriched with background knowledge about the domain of interest. In such a setting, counting may take into account two types of information: grounded assertions (typically DB records), and existentially quantified statements (typically statistics).
As a toy example, Figure 1 provides an imaginary KB storing a parent/child relation, where explicit instances (e.g., Alice is the child of Kendall) coexist with existentially quantified ones (e.g., Jordan has 3 children). The presence of both types of information is a common scenario when integrating multiple data sources. One source may provide detailed records (e.g. one record per purchase, medical visit, etc.), whereas another source may only provide statistics (number of purchases, of visits, etc.), due to anonymization, access restriction, or simply because the data is recorded in this way.
In such scenarios, counting answers to a query over a KB may require operations that go beyond counting records. For instance, in Figure 1, counting the minimal number of children that must exist according to the KB (where children can be explicit or existentially quantified elements in the range of hasChild) requires some non-trivial reasoning. The answer is 4: Bob or Carol may be the second child of Kendall, but Alice cannot be the third child of Parker (because Alice has two parents already), so a fourth child must exist.
One of the most extensively studied frameworks for query answering over a KB is Ontology Mediated Query Answering (OMQA) [BiOr15]. In OMQA, the background knowledge takes the form of a set of logical statements, called the TBox, and the records are a set of facts, called the ABox111Also referred to as OBDA (for Ontology Based Data Access), when emphasis in placed on mappings connecting external data sources to a TBox [XCKL*18].. TBoxes are in general expressed in Description Logics (DLs), which are decidable fragments of First-Order logic. Some DLs can express the combination of explicit and existentially quantified instances mentioned above. Therefore OMQA may provide valuable insight about the computational problem of counting over such data (even though, in practice, DLs may not be the most straightforward way to represent such data).
For Conjunctive Queries (CQs) and Unions of CQs (UCQs), DLs have been identified with the remarkable property that query answering over a KB does not induce extra computational cost (w.r.t. worst-case complexity), when compared to query answering over a relational DB [XCKL*18]. This key property has led to the development of numerous techniques that leverage the mature technology of relational DBs to perform query answering over a KB. In particular, the DL-Lite family [CDLLR07, ACKZ09] has been widely studied and adopted in OMQA/OBDA systems, resulting in the OWL 2 QL standard [W3Crec-OWL2-Profiles].
Yet the problem of counting answers over a DL-Lite KB has seen relatively little interest in the literature. In particular, whether counting answers exhibits desirable computational properties analogous to query answering is still a partly open question for such DLs. A key result for counting over DL-Lite KBs was provided by [KoRe15], who also formalized the semantics we adopt in this paper (from now on, we call it the count semantics). For CQs interpreted under the count semantics, they show a coNP lower bound in data complexity, i.e., considering that the sizes of the query and TBox are fixed. However, their reduction relies on a CQ that computes the cross-product of two relations which is unlikely to occur in practice. Later on, it was shown222Actually, the result was stated for the related setting of bag semantics. However, the same reduction can be applied to count semantics as well. in [NKKK*19] that coNP-hardness still holds (for a more expressive DL) using a branching and cyclic CQ without cross-product.
Building upon these results, we further investigate how query shape affects tractability.
Another important question is whether relational DB technologies may be leveraged for counting in OMQA, as done for boolean and enumeration queries. A key property here is rewritability, extensively studied for DL-Lite and UCQs [CDLLR07], i.e., the fact that a query over a KB may be rewritten as an equivalent UCQ over its ABox only, intuitively “compiling” part of the TBox into this new UCQ. An important result in this direction was provided in [NKKK*19], but in the context of query answering under bag semantics. For certain DL-Lite variants, it is shown that queries that are rooted (i.e., with at least one constant or answer variable) can be rewritten as queries over the ABox. Despite there being a correspondence between bag semantics and count semantics, they show that these results do not automatically carry over to query answering under count semantics, due the way bag answers are computed in the presence of a KB.
So in this work, we further investigate the boundaries of tractability and rewritability for CQs with counting over a DL-Lite KB, with an emphasis on DLs that can express statistics about missing information. As is common for DBs, we focus on data complexity, i.e., computational cost in the size of the ABox (likely to grow orders of magnitude faster than the query or TBox).
Section 2 formalizes the problem and defines key notions; Section 3 summarizes related work; Section 4 presents our results on tractability, and Section 5 addresses rewritability; Section 6 discusses implications of these results, and possible continuations. Due to space limitations, the techniques used to obtain our results are only sketched, but full proofs are available in the extended version of this paper.333arxiv
2 Preliminaries and Problem Specification
We assume mutually disjoint sets of individuals (a.k.a. constants), of anonymous individuals (induced by existential quantification), of variables, of concept names (i.e., unary predicates, denoted with ), and of role names (i.e., binary predicates, denoted with ).
We use boldface letters, e.g., , to denote tuples, and when convenient, we treat tuples as sets. and denote the domain and range of a function . Given , the function restricted to the elements in is denoted . A function is constant-preserving iff for each . If , we use for . If is is a tuple with elements in , we use for .
An atom is an expression of the form or , with , , and . If is a set of atoms, we use to denote the set of all arguments of all atoms in .
An interpretation is a FO structure where the domain is a non-empty subset of , and the interpretation function is a function that maps each constant to itself (i.e. , or in other words we adopt the standard name assumption), each concept name to a set , and each role name to a binary relation .
Given an interpretation and a constant-preserving function with domain , we use to denote the interpretation defined by and for each . Given two interpretations , , we use as a shortcut for and , for each . A homomorphism from to is a constant-preserving function with domain that verifies . We note that a set of atoms that verifies uniquely identifies an interpretation, which we denote by .
KBs, DLs, Models.
A KB is a pair , where , called ABox, is a finite set of atoms with arguments in , and , called TBox, is a finite set of axioms. We consider DLs of the DL-Lite family [ACKZ09], starting with the logic DL-Lite, where each axiom has one of the forms (i) (concept inclusion), (ii) (concept disjointness), or (iii) (role inclusion), where now and in the following, the symbols , , and are defined according to the grammar of Figure 2, and are called respectively roles, basic concepts and concepts. Concepts of the form are called number restrictions. DL-Lite allows only axioms of form (i), with the requirement that the number in number restrictions may only be 1. In this work we study extensions to this logic along three orthogonal directions: (1) allowing also for axioms of form (ii), indicated by replacing the subscript with ; (2) allowing also for axioms of form (iii), indicated by adding a superscript ; (3) allowing for arbitrary numbers in number restrictions, but only on the right-hand-side (RHS) of concept inclusion, indicated by adding a superscript . We also use the superscript for logics with role inclusions, but with the restriction on TBoxes defined in [NKKK*19], which disallows in a TBox axioms of the form if contains a role inclusion , for some .
The semantics of DL constructs and KBs is specified in the usual way [BCMNP03]. An interpretation is a model of iff , and holds for each axiom in . A KB is satisfiable iff it admits at least one model. For readability, we focus in what follows on satisfiable KBs, that is, we use “a KB” as a shortcut for “a satisfiable KB”. We use the binary relation over DL-Lite concepts and roles to denote entailment w.r.t. a TBox , defined by iff for each model of the KB .
A key property of a DL-Lite KB is the existence of a so-called canonical model , unique up to isomorphism, s.t. there exists a homomorphism from to each model of . This model can be constructed via the restricted chase procedure from [CDLLR13, BoAC10].
Finally, we observe that axioms of the form can be expressed in the logic , but with a possibly exponential blowup of the TBox (assuming is encoded in binary). For instance, the axiom can be expressed as , with fresh DL roles.
A Conjunctive Query (CQ) is an expression of the form , where each , , each , and is syntactic sugar for the duplicate-free conjunction of atoms . Since all conjunctions in this work are all duplicate-free, we sometimes treat them as sets of atoms. The variables in , called distinguished, are denoted by , denotes the head of , and denotes the body of . We require safeness, i.e., . A query is boolean if is the empty tuple.
Answers, certain answers.
To define query answers under count semantics, we adapt the definitions of [CoNS07, KoRe15]. A match for a query in an interpretation is a homomorphism from to . Then an answer to over is a pair s.t. , and there are exactly matches for in that verify for . We use to designate the set of answers to over . Similarly, if is a set of queries, we use to designate the set of all pairs s.t. for some and , and . Answering a query over an interpretation (i.e., a database) is also known as query evaluation. Finally, a pair is a certain answer to a query over a KB iff , and is the smallest integer that verifies for each model of . We use to designate the certain answers to over .
The decision problem defined in [KoRe15] takes as input a query , mapping , KB and integer , and decides . It is easy to see though that an instance of this problem can be reduced (in linear time) to an instance where is a boolean query and is the empty mapping, by introducing constants in . We will use this simplified setting for the complexity results below: if a boolean query and the empty mapping, we use as an abbreviation for , and the problem Count is stated as follows:
As usual for query answering over DBs [Vard82] or KBs [CDLLR07], we distinguish between combined and data complexity. For the latter, we adopt the definition provided in [NKKK*19], i.e. we measure data complexity in the cumulated size of the ABox and the input integer (encoded in binary).
As will be seen later, the shape of the input CQ may play a role for tractability. We define here the different query shapes used throughout the article. Because our focus is on queries with unary and binary atoms, we can use the Gaifman graph [BKKRZ17] of a CQ to characterize such shapes: the Gaifman graph of a CQ is the undirected graph whose vertices are the variables appearing in , and contains an edge between and iff for some binary predicate .444This definition implies that the Gaifman graph of has an edge from to if . We call connected (denoted with ) if is connected, linear ( ) if the degree of each vertex in is , and acyclic ( ) if is acyclic. We note that none of these three notions implies any of the other two. In addition, following [NKKK*19], we call a CQ rooted ( ) if each connected component in contains at least one constant or one distinguished variable. Finally, a CQ is atomic ( ) if .
Given a query language , a -rewriting of a with respect to a KB is a query whose answers over alone coincides with the certain answers to over . For instance, for OMQA with boolean or enumeration queries, is traditionally the language of domain independent first-order queries, the logical underpinning of SQL. A for queries with counting, it has been shown in [GrMi96, NKKK*19] that counting answers over a relational DB can be captured by query languages with evaluation in LogSpace (data complexity).
3 Related work
Query answering under count semantics can be viewed as a specific case of query answering under bag semantics, investigated notably by [GrMi96, LiWo97], but for relational DBs rather than KBs. Instead, in our setting, and in line with the OMQA/OBDA literature we assume the input ABox to be a set rather than a bag. The counting problem over sets has also been studied recently in the DB setting [PiSk13, ChMe16], but from the perspective of combined complexity, where the shape of the query (e.g., bounded treewidth) plays a prominent role.
As for (DL-Lite) KBs, [CKNT08] define an alternative count semantics, known as epistemic count semantics, that considers all grounded tuples (i.e., over ) entailed by the KB. Such a semantics does not account for existentially implied individuals, and thus cannot capture the statistics motivating our work.
Instead, the work closest to ours, and which first introduced the count semantics that we adopt here, is the one of [KoRe15], who first showed coNP-hardness of the Count problem for data complexity for DL-Lite, with a reduction that uses a disconnected and cyclic query. coNP-membership is also shown for DLs up to DL-Lite.
[NKKK*19, CNKK*19] have studied query answering over a KB under bag semantics, and provide a number of complexity results (including coNP-hardness) and query answering techniques (including a rewriting algorithm). Such semantics is clearly related to the count semantics, but there are notable differences as argued in [NKKK*19]. In short, one cannot apply the intuitive idea of treating sets as bags with multiplicities . Hence algorithms and complexity results cannot be transferred between the two settings, and this already holds for ontology languages that allow for existential restrictions on the LHS of ontology axioms (note that all the logics considered in this paper allow for such construct). The following example, from [NKKK*19], illustrates this fact.
Example 1 (From [Nkkk*19]).
Consider the KB . Consider the query . If we evaluate our query in the count setting, then the answer would be the empty tuple with cardinality because of the following model:
However, such figure does not accurately represent a bag interpretation. In fact, under bag semantics every concept and property is associated to a bag of elements, rather than a set of elements. Such bag can be seen as a function that returns, given an element, the number of times such element occurs in the bag. Now let us try to build a (minimal) bag interpretation for . To satisfy , it must be that and (by applying the intuitive idea that our ABox is a bag of assertions with an associated multiplicity ). To satisfy the subsumption , we can introduce a single element (as in the figure) and obtain that , and . Therefore, and . According to [NKKK*19] semantics, the latter two equalities imply that . Therefore, to satisfy , it must be that . In fact, the certain answer to over under bag-semantics is the empty tuple with associated cardinality .
4 Tractability and Intractability
We investigate now conditions for in/tractability (in data complexity) of Count, focusing on the impact of the shape of the query. We observe that the queries used in [KoRe15] and [NKKK*19] to show coNP-hardness are cyclic, and either disconnected or branching. Building upon these results, we further investigate whether cyclicity is necessary for non-tractability. Our results indicate that for certain DLs, non-connectedness or branching alone is a sufficient condition for intractability, whereas cyclicity is not. We start with a membership result:
Count is in P in data complexity for DL-Lite and connected, linear CQs ().
We first sketch the proof for DL-Lite, and then discuss the extension to DL-Lite. If is a connected, linear CQ, and a DL-Lite KB, consider the set of all matches for over the canonical model of . Then viewing as a relation, let be the set of all constant-preserving functions with domain . And let be (one of) the function(s) that minimizes the number of resulting tuples, when applied to . If is connected and linear, then is a model of that minimizes the number of answers to , and can be computed in time polynomial in .
For DL-Lite, we associate a cardinality to each element of : cardinality for elements of , and possibly more than for elements of . E.g. if the KB implies that an element has 4 -successors for some , and if there is only one s.t. , then will contain one additional -successors of , with .
We now show that disconnectedness alone leads to intractability, i.e., cyclicity is not needed.
Count is coNP-hard in data complexity for DL-Lite and acyclic, linear, but disconnected CQs ().
The proof is a direct adaptation of the one provided in [KoRe15].
We use a reduction from co-3-colorability to an instance of Count.
Let be an undirected graph with vertices ,
and without self-loop.
The ABox is
for some fresh constants , , and .
The TBox is .
And the (acyclic, non-branching) query is
Then it can be verified that iff is not 3-colorable. ∎
Next we show that linearity is required for tractability:
Count is coNP-hard in data complexity for DL-Lite and acyclic, connected, but branching CQs ().
Finally, we observe that the coNP upper bound provided in [KoRe15] for DL-Lite extends to DL-Lite,555 with a technicality: the input integer is not included in the notion of data complexity used in [KoRe15]. since number restrictions can be encoded in DL-Lite, as explained in Section 2.
Count is in coNP in data complexity for DL-Lite and arbitrary CQs ().
5 Rewritability and Non-rewritability
We now investigate conditions for rewritability. We start by showing P-hardness for DLs with role inclusions and disjointness, and atomic queries.
Count is P-hard in data complexity for and atomic queries ().
We show a LogSpace reduction from the co-problem or evaluating a boolean circuit where all gates are NAND gates [GrHR91] to an instance of Count. We view such a circuit as an interpretation whose domain is composed of the circuit inputs and gates. , and are unary predicates interpreted in as the positive circuit inputs, the negative circuit inputs and the (unique) target gate respectively. is a binary predicate s.t. iff gate has input (where can be either a circuit input or another gate).
The TBox is defined by , where , and .666 The axiom can be encoded into , as explained in Section 2. Intuitively, the unary predicates and correspond to gates that evaluate to true and false respectively in the circuit, and binary predicates and specialize to positive and negative inputs. encodes constraints pertaining to NAND gates: a positive gate must have at least one negative input, and a negative gate must have two positive inputs. Then enforce that no gate can be both positive and negative, and that the circuit inputs and the output gate have the desired truth values.
Finally, as a technicality, the ABox is an extension of , i.e. , viewing as a structure. The domain of contains 3 additional individuals and , and it extends with , , and .
Then it can be verified that is a valid circuit iff there exists a model of s.t. . Now let be the query . It follows that is not a valid circuit iff . ∎
Assuming P LogSpace, this implies that for such DLs, even atomic queries cannot be rewritten into a query language whose evaluation is in LogSpace, which is sufficient to capture counting over relational databases. Interestingly, the reduction can be adapted so that it uses instead a query that is rooted, connected and linear (but not atomic).
Count is P-hard in data complexity for and rooted, connected, linear queries ().
We now focus on positive results, and rewriting algorithms.
5.1 Universal Model
We follow the notion of universal model proposed in [NKKK*19]: a model of a KB is universal for a class of queries iff holds for every . [NKKK*19, CNKK*19] investigated the existence of a universal model for queries evaluated under bag semantics. As we discussed in Section 3, these results carry over to the setting of count semantics, but only for ontology languages not allowing for existential restriction on the LHS of ontology axioms. The existence of such model was proved over the class , for the DL-Lite members up to DL-Lite [NKKK*19] and DL-Lite [CNKK*19], with some syntactic restrictions. It was also shown that queries can be rewritten into (BCALC) queries to be evaluated over the (bag) input ABox. Neither of these logics is able to encode numbers in the TBox though, therefore they cannot capture statistical information about missing data. And as seen in introduction, this information may be important in some applications [ChMe16], and is one of the motivations behind our work. Note that both logics allow for the existential restrictions on the LHS of axioms, and therefore these results do not carry over to count semantics.
Our first result shows the existence of a universal model for and the logic DL-Lite, and queries evaluated under count semantics. Precisely, the canonical model obtained via the restricted chase procedure from [CDLLR13, BoAC10] is a universal model. From now on, we denote by the set of atoms obtained after applying the -th chase step over the knowledge base , and by the (possibly infinite) set of atoms obtained by an unbounded number of applications.
DL-Lite has a universal model w.r.t. Count over queries.
Consider a query query and a knowledge base , and let denote the set of all matches for over the canonical model . Let be a model of . Since is canonical, there exists a homomorphism from to . Since the query is rooted, one immediately obtains that . This implies that . By relying on the fact that is rooted, on the observation that the chase is restricted, and on the observation that interactions between existential quantification and role subsumption are forbidden, one can derive that . Therefore, we conclude that . Then because is canonical, it must be that . Hence, it must be , and therefore we conclude that ∎
5.2 Rewriting for DL-Lite
We introduce , a rewriting algorithm for inspired by [CDLLR06], and show its correctness. There is a fundamental complication in our setting, of which we provide an example. Consider a conjunctive query , a DL-Lite knowledge base , and a query among those produced by or any other rewriting algorithm for CQs. Then, each match for in can be extended to the anonymous individuals so as to form a “complete” match for in in a certain number of ways (dictated by the axioms in the ontology). From now on, we call such number the anonymous contribution relative to . The following example shows that the anonymous contribution is related to the number restrictions occurring in .
Consider the query , and the KB . Starting from and the axiom, will produce, as part of the final rewriting, a query . Note that there is a single match for over , and that can be extended into exactly three matches for in , by mapping variable into some anonymous individual.
To address our scenario in which the anonymous contribution is a non-fixed quantity that depends on the axioms in the ontology, the changes to the standard are quite substantial and highly non-trivial. Our algorithm is also not related to the one in [NKKK*19], which is based on tree-witness rewriting [KiKZ12] rather than on , and that also falls short in dealing with settings where the anonymous contribution is a non-fixed quantity.
Given and input CQ and TBox , produces a set of queries such that, for any ABox , . Each query in comes with a multiplicative factor that captures the anonymous contribution of each match for that query. Queries in are expressed in a target (query) language, called FO(Count), which is a substantial enrichment of the one introduced in Section 2. FO(Count) has a straightforward translation into SQL. Note that we use the enriched language only to express the rewriting: we do not allow it as a language to express user queries over the KBs, which are still expressed in the language of CQs introduced in Section 2.
Following [CoNS07], FO(Count) allows to explicitly specify aggregation variables, as well as a multiplicative factor to be applied after the (count) aggregation operator. Intuitively, aggregation variables specify a subset of the non-distinguished variables for which we count the number of distinct mappings (recall that in the query language considered so far we were counting over the whole set of non-distinguished variables). The language also allows for a restricted use of disjunctions, equalities between terms, and atomic negation in the body of the queries. Finally, it allows the use of nested aggregation in the form of a special operator (which intuitively corresponds to a nested aggregation plus a boolean condition stating that the result of the aggregation must be equal to ).
Formally, a query in FO(Count) is a pair , where variables are called group-by variables, variables are called aggregation variables (intuitively, corresponds to the SQL construct count distinct), , is a positive multiplicative factor and is a set of rules . The colon symbol ’’ in the head777Head and body of a rule are defined as for CQs. of each rule is to distinguish between group-by and aggregation variables. Each in is a conjunction of the form , where is a conjunction of positive atoms, is a conjunction of negated atoms, is a conjunction of equalities between terms, and is a conjunction of special atoms (that we call -atoms) of the form , where and is a variable that occurs only once in .
A mapping is a match for in an interpretation if:
satisfies all equalities in ;
there is no such that , for some in ;
for each in , there are exactly mappings such that, for
A mapping is a match for in interpretation if it is a match in for for some in . A mapping is an answer to over with cardinality iff there are exactly mappings such that, for :
can be extended to a match for in such that .
Note that our semantics also captures the case when the operator is over an empty set of variables (in that case, the above would be equal to ). This technicality is necessary for the presentation of the algorithm.
We are now ready to introduce . Consider a satisfiable knowledge base , and a query . takes as input and and initializes the result set as
Then the algorithm expands by applying the rules AtomRewrite, Reduce, GE, and GE until saturation, with priority over AtomRewrite and Reduce. At the end of this process the resulting set does not necessarily contain just queries (in the sense of our definition above), and hence needs to be normalized (see later). To define the rules of the algorithm, we first need to introduce some notation. In the following, the notation stands for the atom . Hence, also when stands for . We use the symbol underscore ’’ to denote the fresh variables that are introduced during the execution of the algorithm. Given a basic concept , the function application returns if , or , if . Given a set of basic concepts, is defined as the set of basic concepts . If , are two conjunctions of atoms and is an atom, we use (resp. ) to designate the conjunction identical to , but where is replaced with (resp. is deleted from ). By extension, if is a rule, we use to designate the rule . If is a basic concept and a role, we use to designate the maximal s.t. . From now on, we say that a variable in a rule is bound (in ) if it is a distinguished variable, or if it occurs more than once in the set of positive atoms of . We say that is -blocked if it is bound, or if it occurs more than once in , or if it occurs in some -atom in . Finally, we say that is -blocked if it is bound, or if it occurs more than once in , or if it occurs in some atom of the form with .
for some , for some , either:
is of the form , and , or
is of the form , , is an unbound variable, and if , then ;
Reduce (). if
, for some ;
is a most general unifier for and ; with the following restrictions:
a variable in can map only to a variable in ;
a variable in can map only to a variable in ;
GE (). if
is an atom such that:
Let be the conjunction of all exists-atoms in any rule (by construction, such conjunction is the same for each rule in in ). Then the conjunction (seen as a set) must not appear in other rules from ;
is the maximal set of basic concepts such that , for some ;
is defined as follows. First, let denote the set of all pairs such that , , , and . Then, for a set of basic concepts, let denote the cartesian product . And if , are two sets of basic concepts, we call atomic decomposition the formula , defined as:
If is a formula, let designate the rule:
Finally, if is an integer, let be the expression:
We can now define as:
GE (). This rule is defined as , but with the difference that conditions and are as follows:
is an atom such that: