1 Introduction
In this paper we revisit the foundations of the relational model and unearth universal nulls, showing that they can be treated on par with the usual existential nulls [19, 12, 13]. Recall that an existential null in a tuple in a relation represents an existentially quantified variable in an atomic sentence . This corresponds to the intuition ”value exists, but is unknown.” A universal null, on the other hand, does not represent anything unknown, but stands for all values of the domain. In other words, a universal null represents a universally quantified variable. Universal nulls have an obvious application in databases, as the following example shows. The symbol ”” denotes a universal null.
Example 1
Consider binary relations (ollows) and (obbies), where means that user follows user on a social media site, and means that is a hobby of user . Let the database be the following.
Alice Chris Alice Bob Chris Bob David Bob Alice Movies Alice Music Bob Basketball
This is to be interpreted as expressing the facts that Alice follows Chris and Chris and David follow Bob. Alice is a journalist who would like to give access to everyone to articles she shares on the social media site. Therefore, everyone can follow Alice. Bob is the site administrator, and is granted the access to all files anyone shares on the site. Consequently, Bob follows everyone. ”Everyone” in this context means all current and possible future users. The query below, in domain relational calculus, asks for the interests of people who are followed by everyone:
(1) |
The answer to our example query is . Note that star-nulls also can be part of an answer. For instance, the query would return all the tuples in .
Another area of applications of “*”-nulls relates to intuitionistic, or constructive database logic. In the constructive four-valued approach of [15] and the three-valued approach of [13, 22] the proposition is not a tautology. In order for to be true, we need either a constructive proof of or a constructive proof of . Therefore both [15] and [22] assume that the database has a theory of the negative information, i.e. that , where contains the positive information and the negative information. The papers [15] and [22] then show how to transform an FO-query to a pair of queries such that returns the tuples for which is true in , and returns the tuples for which is true in (i.e. is false in ). It turns out that databases containing “*”-nulls are suitable for storing .
Example 2
Suppose that the instance in Example 1 represents , and that all negative information we have deduced about the relation, is that we know Alice doesn’t play Volleyball, that Bob only has Basketball as hobby, and that Chris has no hobby at all. This negative information about the relation is represented by the table below. Note that is part of .
Alice | Volleyball |
---|---|
Bob | (except Basketball) |
Chris |
Suppose the query asks for people who have a hobby, that is . Then , and . Evaluating on returns , and evaluating on returns . Note that there is no closed-world assumption as the negative facts are explicit. Thus it is unknown whether David has a hobby or not.
Universal nulls were first studied in the early days of database theory by Biskup in [6]. This was a follow-up on his earlier paper on existential nulls [5]. The problem with Biskup’s approach, as noted by himself, was that the semantics for his algebra worked only for individual operators, not for compound expressions (i.e. queries). This was remedied in the foundational paper [19] by Imielinski and Lipski, as far as existential nulls were concerned. Universal nulls next came up in [20], where Imielinski and Lipski showed that Codd’s Relational Algebra could be embedded in CA, the Cylindric Set Algebra of Henkin, Monk, and Tarski [16, 17]. As a side remark, Imielinski and Lipski suggested that the semantics of their ”” symbol could be seen as modeling the universal null of Biskup. In this paper we follow their suggestion111We note that Sundarmurthy et. al. [25] very recently have proposed a construct related to our universal nulls, and studied ways on placing constraints on them., and fully develop a finitary representation mechanism for databases with universal nulls, as well as an accompanying finitary algebra. We show that any FO (First Order / Domain Relational Calculus) query can be translated into an equivalent expression in a finitary version of CA, and that such algebraic expressions can be evaluated ”naively” by the rules “” and “” for any constant “.” Our finitary version is called Cylindric Star Algebra (SCA) and operates on finite relations containing constants and universal nulls “.” These relations are called Star Cylinders and they are finite representations of a subclass of the infinite cylinders of Henkin, Monk, and Tarski. Interestingly, the class of star-cylinders is closed under first order querying, meaning that the infinite result of an FO query on an infinite instance represented by a finite sequence of finite star-cylinders can be represented by a finite star-cylinder.222Consequently there is no need to require calculus queries to be “domain independent.” This is achieved by showing that the class of star-cylinders are closed under our cylindric star-algebra, and that SCA as a query language is equivalent in expressive power with FO.
The Cylindric Set Algebra [16, 17] —as an algebraization of first order logic— is an algebra on sets of valuations of variables in an FO-formula. A valuation of variables can be represented as a tuple , where . The set of all valuations can then be represented by a relation of such tuples. In particular, if the FO-formula only involves a finite number of variables, then the representing relation has arity . Note however that has an infinite number of tuples, since the domain of the variables (such as the users of a social media site) should be assumed unbounded. One of the basic connections [16, 17] between FO and Cylindric Set Algebra is that, given any interpretation and FO-formula , the set of valuations under which is true in can be represented as such a relation . Moreover, each logical connective and quantifier corresponds to an operator in the Cylindric Set Algebra. Naturally disjunction corresponds to union, conjunction to intersection, and negation to complement. More interestingly, existential quantification on variable corresponds to cylindrification on column , where
and denotes the valuation (tuple) , where and for . The algebraic counterpart of universal quantification can be derived from cylindrification and complement, or be defined directly as inner cylindrification
In addition, in order to represent equality, the Cylindric Set Algebra also contains constant relations representing the equality . That is, is the set of all valuations , such that .
The objects and of [16, 17] are of course infinitary. In this paper we therefore develop a finitary representation mechanism, namely relations containing universal nulls “” and certain equality literals. These objects are called Star Tables when they represent the records stored in the database. When used as run-time constructs in algebraic query evaluation, they will be called Star Cylinders. Example 1 showed star-tables in a database. The run-time variable binding pattern of the query (1), as well as its algebraic evaluation is shown in the star-cylinders in Example 3 below.
Example 3
Continuing Example 1, in that database the atoms and of query (1) are represented by star-tables and , and the equality atom is represented by the star-cylinder . Note that these are positional relations, the ”attributes” are added for illustrative purposes only.
Alice | Chris | ||
---|---|---|---|
Alice | |||
Bob | |||
Chris | Bob |
Alice | Movies | ||
Alice | Music | ||
Bob | Basketball |
2=3 |
The algebraic translation of query (1) is the SCA-expression
(2) |
The intersection of and is carried out as star-intersection , where for instance . The result will contain 12 tuples, and when these are star-intersected with , the star-cylinder will act as a selection by columns 2 and 3 being equal. The result is the star-cylinder below.
Alice | Alice | Movies | |
Alice | Alice | Music | |
Bob | Alice | Alice | Movies |
Bob | Alice | Alice | Music |
Bob | Bob | Bob | Basketball |
Chris | Bob | Bob | Basketball |
The inner star-cylindrification on column 1 then yields
Alice | Alice | Movies | |
Alice | Alice | Music |
Finally, applying outer star-cylindrifications on columns 2 and 3 of star-cylinder yields the final result
Movies | |||
Music |
The system can now return the answer, i.e. the values of column 4 in cylinder . Note that columns where all rows are “” do not actually have to be materialized at any stage. Negation requires some additional details that will be introduced in Section 3.2.
The aim of this paper is to develop a clean and sound modelling of universal nulls, and furthermore show that the model can be seamlessly extended to incorporate the existential nulls of Imielinski and Lipski [19]. We show that FO and our SCA are equivalent in expressive power when it comes to querying databases containing universal nulls, and that SCA queries can be evaluated (semi) naively. This will be done in three steps: In Section 2 we show the equivalence between FO and Cylindric Set Algebra over infinitary databases. This was of course only the starting point of [16, 17], and we recast the result here in terms of database theory.333Van Den Bussche [9] has recently referred to [16, 17] in similar terms. In Section 3 we introduce our finitary Cylindric Star Algebra. Section 3.1 develops the machinery for the positive case, where there is no negation in the query or database. This is then extended to include negation in Section 3.2. By these two sections we show that certain infinitary cylinders can be finitely represented as star-cylinders, and that our finitary Cylindric Star Algebra on finite star-cylinders mirrors the Cylindric Set Algebra on the infinite cylinders they represent. In Section 4 we tie these two results together, delivering the promised SCA evaluation of FO queries on databases containing universal nulls. In Section 5 we seamlessly extend our framework to also handle existential nulls, and show that naive evaluation can still be used for positive queries (allowing universal quantification, but not negation) on databases containing both universal and existential nulls. Section 6 then shows that all SCA expressions can be evaluated in time polynomial in the size of the database when only universal nulls are present. We also show that when both universal and existential nulls are present, the certain answer to any negation-free (allowing inner cylindrification, i.e. universal quantification) SCA-query can be evaluated naively in polynomial time. When negation is present it has long been known that the problem is coNP-complete for databases containing existential nulls. We show that the problem remains coNP-complete when universal nulls are allowed in addition to the existential ones. For databases containing existential nulls it has been known that database containment and view containment are coNP-complete and -complete, respectively. We also show that the addition of universal nulls does not increase these complexities.
2 Relational calculus and
cylindric set algebra
Throughout this paper we assume a fixed schema , where each , , is a relational symbol with an associated positive integer , called the arity of . The symbol represents equality.
Logic. Our calculus is the standard domain relational calculus. Let be a countably infinite set of variables. We define the set of FO-formulas (over ) in the usual way: and are atomic formulas, and these are closed under and in a well-formed manner possibly using parenthesis’s for disambiguation.
Let be an FO-formula. We denote by the set of variables in , by the set of free variables in , and by the set of subformulas of (for formal definitions, see [1]). If has variables we say that is an FO-formula. We assume without loss of generality that each variable occurs only once in the formula, except in equality literals, and that a formula with variables uses variables .
Instances. Let be a countably infinite domain. An instance (over ) is a mapping that assigns a possibly infinite subset of to each relation symbol , and . Note that our instances are infinite model-theoretic ones. The set of tuples actually recorded in the database will be called the stored database (to be defined in Section 4).
In order to define the (standard) notion of truth of an FO-formula in an instance we first define a valuation to be a mapping . If is a valuation, a variable and , then denotes the valuation which is the same as , except . Then we use the usual recursive definition of , meaning instance satisfying under valuation , i.e. if , if , and if for some , and so on. Our stored databases will be finite representations of infinite instances, so the semantics of answers to FO-queries will be defined in terms of the infinite instances:
Definition 1
Let be an instance, and an FO-formula with , . Then the answer to on is defined as
Algebra. As noted in [20] the relational algebra is really a disguised version of the Cylindric Set Algebra of Henkin, Monk, and Tarski [16, 17]. We shall therefore work directly with the Cylindric Set Algebra instead of Codd’s Relational Algebra. Apart from the conceptual clarity, the Cylindric Set Algebra will also allow us to smoothly introduce the promised universal nulls.
Let be a fixed positive integer. The basic building block of the Cylindric Set Algebra is an -dimensional cylinder . Note that a cylinder is essentially an infinite -ary relation. They will however be called cylinders, in order to distinguish them from instances. The rows in a cylinder will represent run-time variable valuations, whereas tuples in instances represent facts about the real world. We also have special cylinders called diagonals, of the form representing the equality . We can now define the Cylindric Set Algebra.
Definition 2
Let and be infinite -dimensional cylinders. The Cylindric Set Algebra consists of the following operators.
-
Union: . Set theoretic union.
-
Complement: .
-
Outer cylindrification:
The operation is called outer cylindrification on the :th dimension, and will correspond to existential quantification of variable . For the geometric intuition behind the name cylindrification, see [16, 20]. Intersection is considered a derived operator, and we also introduce the following derived operator:
-
Inner cylindrification: , corresponding to universal quantification. Note that
We also need the notion of cylindric set algebra expressions.
Definition 3
Let be a sequence of infinite -dimensional cylinders and diagonals. The set of CA-expressions (over ) is obtained by closing the atomic expressions and under union, intersection, complement, and inner and outer cylindrifications. Then , the value of expression on sequence is defined in the usual way, e.g. , , etc.
Equivalence of FO and CA. In the next two theorems we will restate, in the context of the relational model, the correspondence between domain relational calculus and cylindric set algebra as query languages on instances [16, 17]. An expression in cylindric set algebra of dimension will be called a CA-expression. When translating an FO-formula to a CA-expression we first need to extend all -ary relations in to -ary by filling the last columns in all possible ways. Formally, this is expressed as follows:
Definition 4
The horizontal -expansion of an infinite -ary relation is
The equality relation is expanded into diagonals for , where
and for an instance , we have
Once an instance is expanded it becomes a sequence of -dimensional cylinders and diagonals, on which Cylindric Set Algebra Expressions can be applied.
The main technical difficulty in the translation from FO to CA is the correlation of the variables in the FO-sentence with the columns in the expanded relations in the instance. This can be achieved using a derived “swapping” operator that interchanges the columns and , where .444This was already implicitly done in the expansion of in Definition 4. For a definition of swapping using the primitive operators, see Definition 1.5.12 in [16]. Every atom in will correspond to a CA-expression . However, for every occurrence of an atom in we need to interchange the columns with columns . This is achieved by the expression .
Among the many identities holding in Cylindric Set Algebra we will in the sequel need the following ones
Proposition 1
[16]. Let be an -dimensional cylinder, and . Then
-
-
-
-
If then
-
If and then
Proposition 2
Let be pairwise distinct natural numbers, such that , and let be an -dimensional cylinder that is 2-full555Cylinder is -full if . and -full. Then
Proof:
The second equality follows from Theorem 1.5.18 in [16], the third equality holds since and , the fourth since . The last two equalities follow from Theorem 1.5.17 and 1.5.13 in [16], respectively.
The entire FO-formula with will then correspond to the CA-expression , where is defined recursively as follows:
-
If where , then
-
If , then .
-
If , then , if , then , and if , then .
-
If , then .
-
If , then .
For an example, let us reformulate the -query from (1) as
When translating the relation is first expanded to , and is expanded to . In order to correlate the variables in with the columns in the expanded databases, we do the shifts and . The equality was expanded to the diagonal so here the variables are already correlated. After this the conjunctions are replaced with intersections and the quantifiers with cylindrifications. Finally, the column corresponding to the free variable in (whose bindings will constitute the answer) is shifted to column 1. The final CA-expression will then be evaluated against as
We now have . The following fundamental result follows from [16, 17], but we prove it here for the benefit of the readers who don’t want to consult [16, 17].
Theorem 1
For all FO-formulas , there is a CA expression , such that
for all instances .
Proof: We prove the stronger claim: For all FO-formulas , for all , with , there is an CA expression , such that
for all instances . The main claim the follows since , and the outermost sequence of swappings can be considered part of the final expression . In all cases below we assume wlog666 If we can introduce an additional variable and the conjunct which would assure that the :st dimension is full. Alternatively, we could introduce swapping as a primitive in the algebra. This however would require a corresponding renaming operator in the FO-formulas, see [16]. that so that the :st column can be used in the necessary swappings.
-
, where . We let We have
-
. We assume wlog that so that swaps can be performed. We let . We then have
-
, with . We assume wlog that . Then , and the inductive hypothesis is
We have
-
, with , , , , and777The last assumption is needed in steps . Now . The inductive hypothesis is
We have
-
, with . Let