1.1. The context
This paper is about the query determinacy problem. So let us maybe start with a definition:
Definition 1 ().
• For a query and a finite set of queries , we say that determines (denoted as ) if the implication:
holds for every pair of finite111Both “finite” and “unrestricted” versions of this problem were considered, but in this paper let us concentrate on the finite one, which is the only one to make sense in the multiset scenario. structures222Where is the result of applying to . .
• An instance of the determinacy problem, for a query language , consists of a query and a finite set of views . We ask whether .
Many different variants of the determinacy problem, for various query languages, and (when applicable) various arities of queries, have been studied in the last three decades. And the point has been reached, where we have a pretty complete classification of the variants, in the sense that we know which of them are decidable (few) and which are not (most).
So, for example, as observed in (Marcinkowski, 2020), the problem is decidable if the queries in are unary UCQs333“Unary” means that they have one free variable. Similar result for unary conjunctive queries was proven in (Nash et al., 2006). (unions of conjunctive queries) and is any UCQ. Let us outline how one can prove this:
As noticed in he paper (Gogacz and Marcinkowski, 2015),
holds if and only if:
where and are some structures that can easily be constructed from and is some set of Tuple Generating Dependencies which can easily be constructed from , and where is a result of applying the TGDs from to until the fixpoint is reached. Then, it is easy to see that if the queries from are unary then the TGDs in are frontier one. And query entailment444That is, condition (*) above. is decidable for sets of such TGDs (Baget et al., 2011). Then, if one is unhappy with the fact that is potentially infinite, leading to infinite and , the finite controlabillity result for frontier-one TGDs (implied by (Bárány et al., 2011)) can be used to replace with a finite structure with the desired properties.
We do not really want our readers to understand this complicated reasoning (unless they already do). We only outline it in order to show that database theory has reached the point where it is no longer merely a set555Or maybe “multiset” would be a better term in this context, as some of the results were produced more than once. of results about the fundamental notions and phenomena, but a real scientific theory, able to explain and interpret facts which are apparently totally unrelated: we do not believe that the authors of (Baget et al., 2011) and (Bárány et al., 2011) ever expected their results to be used in a decidability proof of a variant of the determinacy problem.
Unfortunately, this beautiful palace of database theory, both the results and the tools, collapses like a house of cards when we try to be slightly more realistic and assume that the queries do not return sets of tuples, but they return multisets (or bags) of tuples.
And this is not a new observation. It was already spotted in (Chaudhuri and Vardi, 1993), where the authors try to see what happens to the most important database theory fundamental, query containment, if bag semantics is assumed, concluding that the “techniques from the set-theoretic setting do not carry over to the bag-theoretic setting”.
The paper (Chaudhuri and Vardi, 1993) was understood, at least by part of the community, as a call “for a re-examination of the foundations of databases where the fundamental concepts and algorithmic problems are investigated under bag semantics, instead of set semantics” (see (Atserias and Kolaitis, 2020), page 2). But only rather limited progress has been achieved. Even the decidability of conjunctive query containment problem remains open in the multiset world. And this is in spite of a considerable effort, which is reflected by a list of publications.
First, in 1995 (Ioannidis and Ramakrishnan, 1995) show that containment of UCQs, which is in NP when the classical (set) semantics is considered, becomes undecidable for the multiset semantics. Then (among other papers) there are (Jayram et al., 2006) where it is shown that containment is undecidable if inequalities are allowed in conjunctive queries and (Afrati et al., 2010) which shows decidability (and establishes complexity) for several simple subcases. And then, finally, there is a paper (Abo Khamis et al., 2020), where query containment is in an elegant way related to the information-theoretic notion of entropy, and it is shown that decidability of even a quite limited subproblem of query containment would imply a solution to a long standing open problem in information theory.
Apart from the line of research focused on the query containment problem, the number of such re-examination attempts, while growing, remains low. And this is – we understand – not because of lack of interest, but because (as the containment problem illustrates) everything suddenly gets very complicated when multiset semantics is assumed. One example we know about is the recent paper (Atserias and Kolaitis, 2020) where the authors re-examine the old result from (Beeri et al., 1983), that a database schema is acyclic if and only if the local-to-global consistency property for relations over that schema holds.
1.2. Our contribution (and the future work).
In this paper we attempt a re-examination, under multiset semantics, of the query determinacy problem.
This means that we now read the equalities in formula ♠ as equalities of multisets. To distinguish we will use the symbol to denote the old style set-semantics determinacy and for determinacy under multi-set semantics.
The first question one naturally needs to ask here is whether is really a different notion than . And, if they are indeed different, the second question is: does at least one implication hold? Like in the case of query containment, where, as noticed already in (Chaudhuri and Vardi, 1993), containment under multiset semantics is a strictly stronger property than containment under set semantics?
To show that the two versions are really different let us use:
Example 2 ().
Let be the query and let consist of two conjunctive queries:
Then is is easy to see that but .
Regarding the second question notice that while equality of and (for some query ) under multiset semantics implies that they are also equal under set semantics, the formula ♠ has both positive and negative occurrence of equality of the answer sets. So it is not obvious at all that multiset determinacy always implies determinacy in the set semantics world. And indeed:
Example 3 ().
Let be the query and let consist of two queries:
Then it is easy to see that . But under the multiset semantics for each we have (since we consider boolean queries here, the answers are natural numbers), which implies that .
Can such example be constructed for conjunctive queries rather than UCQs? We do not know. We conjecture that the answer is “no”, but proving it will probably be hard. What we can show (and we find it a bit surprising, because the situations where set-semantics based notions coincide with their multiset-semantics counterparts seem to be rare) is:
Theorem 1 ().
If is a set of path queries, and is a path query666For a definition of path queries see Section 3 then if and only if .
Determinacy of path queries (under the set semantics) is one of the few decidable cases (Afrati, 2011), and, as Theorem 1 implies, it remains decidable in the multiset semantics world. For the proof of Theorem 1 see Section 3. Notice also that the queries from Example 2 are not far from being path queries, but still, for some reason, the thesis of Theorem 1 does not hold for them.
But the main focus of this paper is on understanding query determinacy in the case of boolean queries. We first present:
Theorem 2 ().
The problem whether, for a set of boolean UCQs and for another boolean UCQ , it is true that , is undecidable.
This is in stark contrast to the situation in the set semantics world where, as we already mentioned in Section 1.1, determinacy is decidable even for unary UCQs, not just boolean. But the proof of Theorem 2 is not hard. In order to show it, it was enough to notice that the “ trick” from (Segoufin and Vianu, 2005) (or the “cold-hot” trick from (Marcinkowski, 2020)) can be safely used in the multiset semantics world. And then to reuse the Hilbert 10th problem encoding from (Ioannidis and Ramakrishnan, 1995).
Finally, our main technical contribution is:
Theorem 3 ().
The problem whether, for a set of boolean CQs and for another boolean CQ , it is true that , is decidable.
Future work. The natural open question we leave is the decidability status of the CQ determinacy for the multiset semantics, that is of the problem whether, for a set of CQs (with free variables) and for another CQ , it is true that . The encoding method from the proof of Theorem 2 is useless when disjunction is no longer available. And also the techniques from the proof of Theorem 3 do not seem to generalize to the scenario with free variables.
1.3. The tools. And related works.
Regarding the tools used in the proofs of Theorem 1 and Theorem 3, let us quote (Chaudhuri and Vardi, 1993) again: techniques from the set-theoretic setting do not carry over to the bag-theoretic setting. The green-red chase (mentioned in Section 1.1), which is a fundamental tool to study determinacy in the set-semantics world, just vanishes in the multiset setting, together with the results that depend on it, like undecidability for the CQ case. And in general, the importance of concepts that stem from the first order logic diminishes in this new world. Instead, tools based on notions from linear algebra arise in a very natural way here. This is not at all a new observation: in order to read (Atserias and Kolaitis, 2020) one also needs to dust off the linear algebra textbook.
While we are (as far as we know) the first to consider query determinacy under multiset semantics, there exists a line of research in database theory which concentrates on the number of answers to a query (homomorphisms), including paper (Chen and Mengel, 2017) (again in a natural way using arguments from linear algebra). And also, due to solely mathematical motivations, such homomorphism counts (and some related numbers) were studied by researchers in combinatorics, with numerous papers published, including (Lovász, 1967) and (Erdős et al., 1979), and a book777We only had access to a free version of this book available on the web. (Lovász, 2012). Some of the results regarding homomorphism counts are useful for us (see Section 6 where we use the main result from (Lovász, 1967)). Some, while not directly useful, are related to our paper, for example there is a construction in (Chen and Mengel, 2017) resembling Step 1 (and partially also Step 2) from our construction in Section 6.
The title of (Erdős et al., 1979) may suggest that there is a connection to determinacy and (as we learned) some less careful readers can have an impression that the main result from (Erdős et al., 1979) is almost our Theorem 3. So let us take some space here to explain why this is not the case888It may be a good idea to skip the rest of this Section now and come back here after you read Section 4 at the earliest..
A set of connected non-isomorphic graphs is consideerd in (Erdős et al., 1979). For and for another graph the number (,,homomorphism density”) is defined as the probability that a random mapping from the set of verticies of to the set of verticies of will be a homomorphism.
Let now be the set , which clearly is a subset of or, to be more precise, of . The main theorem of (Erdős et al., 1979) (Theorem 1 there) says that:
(*) contains a subset which is dense in some ball.
Then, it seems to be claimed999Remarks after Corollary 5.45 in (Lovász, 2012); unfortunately the language is quite sloppy there, and it is not entitely clear for us how this part of text should correctly be interpreted. in (Lovász, 2012) that it follows from (*) that no functional dependence between the numbers can exist, meaning that cannot be a function of arguments . In our language this would mean that:
(**) do not determine .
If this was indeed true that (*) implied (**) then one could use the graph blow-up technique from (Lovász, 2012) (Theorem 5.32) to translate the language of ,,homomorphism densities” into the language of homomorphism counts and, as a result, prove our Corollary 26, which is a very special case of our Theorem 3.
But (*) does not imply (**). Let us define as the projection of on the first m-1 coordinates, that is .
Then (**) means that cannot be the graph of some function . But all (*) tells us about is that its topological closure contains a ball. And it is easy to construct such a function that the topological closure of the graph of not only contains a ball but is actually the entire cube .
What does indeed follow from (*) is that no such continuous function can exist, so in particular cannot be expressed from by operations which preserve continuousness. But then it is a completely different story, as continuousness may make sense when talking about homomorphisms density, but not in the context of homomorphism count.
2.1. Database Theory Notions
A multiset is a mapping where is some specified set101010We (of course) think that .. With we will denote the number of occurrences of in . We write that if . A union of two multisets and is a multiset such that . We define other multiset operators analogously.
A schema is a finite set of relational symbols. A schema is -ary if an arity of its relations is at most . A structure (or database) over schema is a finite set111111Which means that we assume that answers to the queries are multisets, but the structures are sets. However, all our results and techniques would survive if we defined structures which are multisets of facts. consisting of facts. A fact is simply an atom where is a tuple of terms from some fixed infinite set of constants. The active domain of (denoted with ) is the set of constants that appear in facts of .
For two structures and over schema , a homomorphism from to is a function such that for each atom it holds that . A set of homomorphisms from to is denoted with . Note, that for the empty structure .
Conjunctive Queries (CQs).
A conjunctive query is a first order formula such that is a conjunction of atoms over variables from and . With we will denote the set of variables of . The arity of CQ is simply .
The frozen body of a CQ is a structure obtained from by bijective replacement of variables with fresh constants. For a CQ and a structure , with we denote the set of all homomorphisms from the frozen body of to . A result of a CQ over a structure is a multiset such that .
For a binary schema a path query is a CQ of the form
Let denote the set of all words over relational symbols from . Given the nature of path queries we will identify them with words from , so instead of writing we may conveniently write121212Note however, that an empty word is identified with the query , although it is not a valid path query. .
A CQ with no free variables is called boolean. Boolean CQs will be always identified with their frozen bodies.
Accordingly to previous definitions a result of a boolean CQ over some structure is a multiset containing copies of the empty tuple. For brevity we write instead of , so .
A union of boolean conjunctive queries (boolean UCQ) is a disjunction of a finite number of boolean CQs. A result of a boolean UCQ over a is the natural number .
A boolean CQ is contained under set semantics in a boolean CQ (denoted as ) if for every structure it holds that . It is well-known that if and only if is non-empty.
2.2. Graph Theoretic Tools
Operations on Structures
Following (Lovász, 1967) we will use some operations on structures. For structures and over schema :
• is a disjoint union131313That is if we bijectively rename variables of with fresh ones and then make of and ;
• is a structure such that and for any the following holds: is an atom of if and only if and ;
• we use symbols and as generalized and in the usual way;
• for and . Furthermore, is an empty structure and is a singleton such that for any ( has loops of all types).
Graph Theoretic Lemma
From (Lovász, 1967) we recall:
Lemma 3 ().
Let be structures and , then:
If is connected, then
If is connected, then
2.3. Basic Mathematical Tools and Notations
We are going use standard notation from linear algebra, which should be clear in most cases. Below we describe all the conventions that might be non-obvious:
• For a set , means the linear span of (i.e. the smallest linear space containing ). For a set , we define
For two vectors, denotes the dot product of . Vector is orthogonal to if and only if .
• For a vector , denotes the value of the -th coordinate of .
• For a matrix , denotes the value of the element in the -th row and -th column of .
• For a matrix and a set , .
We will use the following well-known mathematical facts:
Fact 4 ().
Let such that . Then there is a vector such that is orthogonal to but is not orthogonal to .
Fact 5 ().
If matrix is nonsingular, then the mapping is a homeomorphism (a continuous bijection whose inverse function is continuous too).
Fact 6 ().
is a dense subset of , i. e., for any and there is such that .
Corollary 7 ().
Suppose is nonsingular. Then there is such that
Proof. From creftype 5 we know that the set has non-empty interior (i. e. the set of points satisfying ‣ 7), since it is a homeomorphic image of a set with non-empty interior. By creftype 6 we get that this interior must contain a point with rational coordinates.
Important Notational Convention.
equals in this paper.
3. The Path Queries Case
In this section we prove:
Theorem 1 ().
If is a set of path queries, and is a path query, then if and only if .
One can find this theorem a bit surprising. Path queries are a reasonably wide class of queries. And we have already learned that one should not expect a set-semantics based notion to agree with its multi-set based counterpart on a wide class of objects141414On the other hand, for path queries, query containment under set semantics also (trivially) coincides with query containment under bag semantics. We have no idea whether there is any relation between this observation and Theorem 1.. But, as it turns out, both versions of determinacy for path queries enjoy the same elegant combinatorial characterisation:
Definition 8 ().
For a set of path queries and for another path query we define an undirected graph as follows:
. In particular, the empty word and itself151515Recall that we identify path queries with words over alphabet . are elements of .
There is an edge between and if and only if for some .
Fact 9 ().
iff there is a path in from to .
In order to prove Theorem 1 we will show that the same is true for determinacy in the multiset setting:
Lemma 9 ().
iff there is a path, in , from to .
The rest of this section is devoted to the proof of Lemma 2.
It turns out that the (simple) proof of the () direction for the set semantics survives also in the multiset context. We include it here for the sake of completeness, but due to the space limitations defer it to Appendix B.
Let us now deal with the () direction.
Assume that and are fixed and such that there is a path in , of some length ,
from to .
This means that there exist:
• a sequence of prefixes of , with and ;
• a sequence of numbers , for , each of them either equal 1 or ;
• a sequence of elements of such that, for each , one of the conditions is true:
• and ; • and .
We are going to show that in such case there will also be . So we assume that there are two stuctures and such that for each . Without loss of generality we can also assume that domains of and are equal161616By domain we do not mean the active domain here: we accept that there are elements, in or which do not appear in any facts., so let .
3.1. The -walks and how to reduce them.
Let be a new alphabet171717Or schema, in the world of path queries words and queries are the same thing., where .
Definition 10 ().
Let and for some and for some . Then is called a -walk if:
for each it holds that ;
for each it holds that
where and is the -th symbol in .
Our path in , leading from to , induces, in a natural way, a -walk181818If , by we mean reversed with every letter replaced with . . For clarity, let us illustrate this with:
Example 11 ().
Imagine that and . Then there is a path in . This path induces a -walk , which is equal to .
Now we are going to explain how each -walk can be turned into by a sequence of simple reductions:
Definition 12 ().
For any and for any we define:
Relations and are defined as the reflexive transitive closure of
and of , respectively.
Lemma 12 ().
If is a -walk, then and .
For the proof of Lemma 3 see Appendix C.
3.2. Seeing (and ) as relations in .
Definition 13 ().
Let . Then is the incidence matrix of the relation in structure , that is and if and only if and if and only if .
Definition 14 ().
Let . Then we define a matrix in the inductive way:
• is the identity matrix.
• For , .
It is well-known that:
Fact 15 ().
If then .
Matrices and are defined analogously, and, obviously, Fact 15 remains true for them.
Of course in general we cannot assume that for . But, since for each we have , we know that for each we have , so we can write just instead of or . Recall that we need to show that . So, if we manage to somehow present (and hence also ) as a function of arguments , then we are done.
Let us also remark that if were invertible, for all , then it would be easy to see that and likewise . However, in the general case, there is of course no reason to think that the matrices are invertible, and thus we need our argument to be a little bit more sophisticated.
Now the matrices will be understood as linear functions. And these functions will be understood as relations. And, while we know that not all matrices are invertible, and in consequence not all the functions under consideration are, relations can always be inverted!
By we will denote the identity relation: .
Definition 16 ().
For a matrix let the function be defined as .
For a function let denote the relation equal191919This means that . We make such distinction since composition and inversion work for functions slightly differently than for relations. to .
For let and
For we define inductively:
• • for
The relations depend on (in the sense that they would not be equal if we computed them in instead of ), so the reader may think that there should be instead of . But omitting the superscript leads to no confusion: is the only structure for which the relations are ever considered.
Observation 17 ().
For , and
For the proof of the Observation use (easy) induction and the fact that for it holds that
It is well-known that the correspondence is 1-1. Also the correspondence is 1-1. So in order to represent as a function of arguments it is enough to represent as a function of . Which we do in the next subsection.
3.3. Using Lemma 3.
Let us start this subsection with a really very simple lemma:
Lemma 17 ().
Let . Then and .
Now we will see what the relations are good for:
Lemma 17 ().
if then ;
if then .
Lemma 17 ().
If is a -walk, then .
Our next corollary is certainly not going to come as a surprise:
Corollary 18 ().
If is a -walk, then .
Now, recall that is a -walk. So, by the last corollary .
4. The boolean case. Our main results.
In contrast to the set-semantics world, where determinacy is easily decidable for unary UCQs, and trivially decidable for boolean UCQs, in the multiset setting already the boolean case is undecidable:
Theorem 1 ().
The problem whether, for a set of boolean UCQs and for another boolean UCQ , it holds that is undecidable.
This negative result is not really hard to prove (see Appendix A). The main technical result of this paper, however, is:
Theorem 2 ().
The problem whether, for a given set of boolean conjunctive queries and for a given boolean conjunctive query , it holds that , is decidable.
Definition 19 ().
Let be the set . Let us also denote .
Queries from are the ones that cannot return 0 in any interesting (from the point of view of this proof) structure . Queries from are free to return 0, and they actually will.
Observation 20 ().
If is any structure, and then also .
Definition 21 ().
Let be the set202020When we say “set” we mean that each such connected component only occurs once in . And we think that isomorphic structures are equal. of all connected components of the query212121 is an operation on structures here, as defined in Section 2.2. . In other words, is a minimal set of structures such that for every connected component of there is isomorphic to . From now on, the letter will always denote the cardinality of .
Queries from are going to serve us as basis queries, in the linear algebra sense:
Observation 22 ().
Let . Then for some .
Note, that such representation is unique. Thus:
Definition 23 ().
For a query we define the vector representation of as , where are as in Observation 22.
So all the queries of interest are now seen as vectors in some -dimensional vector space.
Observation 24 ().
If is any structure and then .
Proof: Notice that . The last equality follows from Lemma 1.
Now we are ready for our Main Lemma:
Lemma 24 ().
if and only if .
Clearly, Theorem 3 easily follows from Lemma 3 as finding is of course decidable (in – we first need to guess a set of homomorphisms and then check that we guessed all of them), while finding and testing whether are polynomial.
Example 25 ().
Let be some non-empty, pairwise non-isomorphic structures and let:
Then for a structure : ,
If , then so it is uniquely determined by . This equality corresponds to the equality of vector representations . If , then for some , so and it is determined again.
It easily follows from Lemma 3 that in the very specific case of connected queries no non-trivial determinacy is possible:
Corollary 26 ().
If all the queries in are connected, and is connected, then if and only if .
5. Proof of Lemma 3 . Part 1.
In this section we assume that . And we are going to show that . To this end we need to find a pair of structures and which is a counterexample for determinacy, which means that:
Notice that there is nothing in Definition 1 that would tell us where to look for such a counterexample: and are just any structures in this definition. Our main discovery is that if such and , forming a counterexample, can be found at all, then a counterexample can also be found in some -dimensional222222Recall, that denotes, as always, the cardinality of , the set of basis queries. vector space that we are now going to introduce. And this is convenient, because living in a vector space one can use linear algebra tools.
Definition 27 ().
For any set of structures (call them basis structures) let be the set of all structures which can be represented as sums of elements of , that is .
Now, the totally informal idea is as follows. We know that . So there is vector which is orthogonal to for each but not to . Let us somehow define and in such a way that is “the difference” between and . Then none of the will spot the difference between and but will.
Definition 28 ().
Set of basis structures is decent if for each and for each we have .
It is easy to see that:
Observation 29 ().
If is decent, then for each and for each we have . In consequence, if is decent, then any pair of structures from satisfies condition (B0) above.
Definition 30 ().
For a set of structures we define its evaluation matrix by the formula .
In other words, the -entry of is defined as the number of homomorphisms from to .
Definition 31 ().
is good when is decent and is nonsingular.
Recall that the set , consisting of queries, is also a set of