Datalog: Bag Semantics via Set Semantics

03/17/2018 ∙ by Leopoldo Bertossi, et al. ∙ Carleton University University of Oxford TU Wien 0

Duplicates in data management are common and problematic. In this work, we present a translation of Datalog under bag semantics into a well-behaved extension of Datalog (the so-called warded Datalog+-) under set semantics. From a theoretical point of view, this allows us to reason on bag semantics by making use of the well-established theoretical foundations of set semantics. From a practical point of view, this allows us to handle the bag semantics of Datalog by powerful, existing query engines for the required extension of Datalog. Moreover, this translation has the potential for further extensions -- above all to capture the bag semantics of the semantic web query language SPARQL.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Duplicates are a common feature in data management. They appear, for instance, in the result of SQL queries over relational databases or when a SPARQL query is posed over RDF data. However, the semantics of data operations and queries in the presence of duplicates is not always clear, because duplicates are handled by bags or multisets, but common logic-based semantics in data management are set-theoretical, making it difficult to tell apart duplicates through the use of sets alone. To address this problem, a bag semantics for Datalog programs was proposed in [21], what we refer to as the derivation-tree bag semantics (DTB semantics). Intuitively, two duplicates of the same tuple in an intentional predicate are accepted as such if they have syntactically different derivation trees. The DTB semantics was used in [2] to provide a bag semantics for SPARQL.

The DTB semantics follows a proof-theoretic approach, which requires external, meta-level reasoning over the set of all derivation trees rather than allowing for a query language that inherently collects those duplicates. The main goal of this paper is to identify a syntactic class of extended Datalog programs, such that: (a) it extends classical Datalog with stratified negation and has a classical model-based semantics, (b) for every program in the class with a bag semantics, another program in the same class can be built that has a set-semantics and fully captures the bag semantics of the initial program, (c) it can be used in particular to give a set-semantics for classical Datalog with stratified negation with bag semantics. All these results can be applied, in particular, to multi-relational algebra, i.e. relational algebra with duplicates.

To this end, we show that the DTB semantics of a Datalog program can be represented by means of its transformation into a Datalog program [8, 10], in such a way that the intended model of the former, including duplicates, can be characterized as the result of the duplicate-free chase instance for the latter. The crucial idea of our translation from bag semantics into set semantics (of Datalog) is the introduction of tuple ids (tids) via existentially quantified variables in the rule heads. Different tids of the same tuple will allow us to identify usual duplicates when falling back to a bag semantics for the original Datalog program. We establish the correspondence between the DTB semantics and ours. This correspondence is then extended to Datalog with stratified negation. We thus recover full relational algebra (including set difference) with bag semantics in terms of a well-behaved query language under set semantics.

The programs we use for this task belong to warded Datalog [17]. This is a particularly well-behaved class of programs in that it properly extends Datalog, has a tractable conjunctive query answering (CQA) problem, and has recently been implemented in a powerful query engine, namely the VADALOG System [6, 7]. None of the other well-known classes of Datalog share these properties: for instance, guarded [8], sticky and weakly-sticky [11] Datalog only allow restricted forms of joins and, hence, do not cover Datalog. On the other hand, more expressive languages, such as weakly frontier guarded Datalog [5], lose tractability of CQA. Warded Datalog has been successfully applied to represent a core fragment of SPARQL under certain OWL 2 QL entailment regimes [16], with set semantics though [17] (see also [3, 4]), and it looks promising as a general language for specifying different data management tasks [6].

We then go one step further and also express the bag semantics of Datalog by means of the set semantics of Datalog. In fact, we show that the bag semantics of a very general language in the Datalog class can be expressed via the set semantics of Datalog and the transformed program is warded whenever the initial program is.

Structure and main results. In Section 2, we recall some basic notions. In Section 7, we conclude and discuss some future extensions. The main results of the paper are detailed in Sections 36.

  • Our translation of Datalog with bag semantics into warded Datalog with set semantics, which will be referred to as program-based bag (PBB ) semantics, is presented in Section 3. We also show how this translation can be extended to Datalog with stratified negation.

  • In Section 4, we study the transformation from bag semantics into set semantics for Datalog itself. We thus carry over both the DTB semantics and the PBB semantics to Datalog with a form of stratified negation, and establish the equivalence of these two semantics also for this extended query language. Moreover, we verify that the Datalog programs resulting from our transformation are warded whenever the programs to start with belong to this class.

  • In Section 5, we study crucial decision problems related to multiplicities. Above all, we are interested in the question if a given tuple has finite or infinite multiplicity. Moreover, in case of finiteness, we want to compute the precise multiplicity. We show that these tasks can be accomplished in polynomial time (data complexity).

  • In Section 6, we apply our results on Datalog with bag semantics to Multiset Relational Algebra (MRA). We also discuss alternative semantics for multiset-intersection and multiset-difference, and the difficulties to capture them with our Datalog approach.

2 Preliminaries

We assume familiarity with the relational data model, conjunctive queries (CQs), in particular Boolean conjunctive queries (BCQs); classical Datalog with minimal-model semantics, and Datalog with stratified negation with standard-model semantics, denoted Datalog (see [1] for an introduction). An -ary relational predicate has positions: . With we denote the set of positions of predicate ; and with the set of positions of (predicates in) a program .

2.1 Derivation-Tree Bag (DTB) Semantics for Datalog and Datalog

We follow [21], where tuples are colored to tell apart duplicates of a same element in the extensional database (EDB), via an infinite, ordered list of colors  . For a multiset ,   if holds, where denotes the multiplicity of in . In this case, the copies of are denoted by , indicating that they are colored with , respectively. So, becomes a set. A multiset is (multi)contained in multiset , when for every . For a multiset , , which is a set. For a “colored" set , produces a multiset by stripping tuples from their colors.

For , , and . The inverse operation, the decoloration, gives, for instance: ; and , a multiset.

We consider Datalog programs with multiset predicates and multiset EDBs . A derivation tree (DT) for wrt. is a tree with labeled nodes and edges, as follows:

  1. For an EDB predicate and , a DT for contains a single node with label .

  2. For each rule of the form  ;   with ,  and each tuple of DTs for the atoms that unify with with mgu , generate a DT for with as the root label, as the children, and as the label for the edges from the root to the children. We assume that these children are arranged in the order of the corresponding body atoms in rule .

For a DT , we define as of the root when is a single-node tree, and the root-label of , otherwise. For a set of DTs :  , which is a multiset that multi-contains . Here, we write to denote the multi-union of (multi)sets that keeps all duplicates. If is the set of (syntactically different) DTs, the derivation-tree bag (DTB) semantics for is the multiset   . (Cf. [18] for an equivalent, provenance-based bag semantics for Datalog.)




Figure 1: Three (out of four) derivation-trees for in Example 2.1.

Consider the program and multiset EDB , , where are defined as follows:
     ;  ;  

Here, . In total, we have 16 DTs:  (a) 6 single-node trees with labels in (b) 6 depth-two, linear trees (root to the left, i.e. rotated in ): . . . . . . (c) 4 depth-three trees for , three of which are displayed in Figure 1. The 16 different DTs in give rise to , , .

In [22], a bag semantics for Datalog was introduced via derivation-trees (DTs), extending the DTB semantics in [21] for (positive) Datalog. This extension applies to Datalog programs with stratified negation that are range-restricted and safe, i.e. a variable in a rule head or in a negative literal must appear in a positive literal in the body of the same rule. (The semantics in [22] does not consider duplicates in the EDB, but it is easy to extend the DTB semantics with multiset EDBs using colorings as above.)  If is a Datalog program with multiset predicates and multiset EDB , a derivation tree for wrt. is as for Datalog programs, but with condition 2. modified as follows:

  • Now let be a rule of the form ;   with and . Let the predicate of be of some stratum and let the predicates of be of some stratum . Assume that we have already computed all derivation trees for (atoms with) predicates up to stratum . Then, for each tuple of DTs for the atoms that unify with with mgu , such that there is no DT for any of the atoms , generate a DT for with as the root label and as the children, in this order. Furthermore, all edges from the root to its children are labelled with .

Analogously to the positive case, now for a range-restricted and safe program in Datalog and multiset EDB , we write to denote the derivation-tree based bag semantics.

(ex. 2.1 cont.) Consider now the EDB , (with one duplicate of removed from ), and modify to with  (i.e., now encodes multiset difference).  Then, predicates are on stratum 0 and is on stratum 1. The DTs for atoms with predicates from stratum 0 are as in Example 2.1 with two exceptions: there is now only one single-node DT for and only one DT for .

For ground atoms with predicate , we now only get two DTs producing . One of them is shown in Figure 2 on the left-hand side. The other DT of is obtained by replacing the left-most leave by . In particular, the DT of in Figure 2 shows that the “derivation” of succeeds, i.e., there is no DT for . The remaining two trees in Figure 2 establish that do not have a DT because does have a DT. In total, we have 12 different DTs in with .

Notice that the DT semantics interprets the difference in rule as “all or nothing": when computing , a single DT for “kills" all the DTs for  (cf. Section 6). For example, is not obtained despite the fact that we have two copies of and only one of , as the two trees on the right-hand side in Figure 2 show.




Figure 2: Derivation-trees for and in Example 2.1.

2.2 Warded Datalog

Datalog was introduced in [9] as an extension of Datalog, where the “" stands for the new ingredients, namely: tuple-generating dependencies (tgds), written as existential rules of the form , with , and ; as well as equality-generating dependencies (egds) and negative constraints. In this work we ignore egds and constraints. The ““ in Datalog stands for syntactic restrictions on rules to ensure decidability of CQ answering.

We consider three sets of term symbols: , , and containing constants, labelled nulls (also known as blank nodes in the semantic web context), and variables, respectively. Let denote an atom or a set of atoms. We write and to denote the set of variables and nulls, respectively, occurring in . In a DB, typically an EDB , all terms are from . In an instance, we also allow terms to be from . For a rule , denotes the set of atoms in its body, and , the atom in its head. A homomorphism from a set of atoms to a set of atoms is a partial function such that for all and for every atom .

We say that a rule is applicable to an instance if there exists a homomorphism from to . In this case, the result of applying to is an instance , where coincides with on and maps each existential variable in the head of to a fresh labelled null not occurring in . For such an application of to , we write . Such an application of to is called a chase step. The chase is an important tool in the presence of existential rules. A chase sequence for a DB and a Datalog program is a sequence of chase steps with , such that and for every (also denoted ). For the sake of readability, we sometimes only write the newly generated atoms of each chase step without repeating the old atoms. Also the subscript is omitted if it is clear from the context. A chase sequence then reads with .

The final atoms of all possible chase sequences for DB and form an instance referred to as , which can be infinite. We denote the result of all chase sequences up to length for some as . The chase variant assumed here is the so-called oblivious chase [8, 19], i.e., if a rule ever becomes applicable with some homomorphism , then contains exactly one atom of the form such that extends to the existential variables in the head of . Intuitively, each rule is applied exactly once for every applicable homomorphism.

Consider a DB and a Datalog program (the former the EDB for the latter). As a logical theory, may have multiple models, but the model turns out to be a correct representative for the class of models: for every BCQ , iff [14]. There are classes of Datalog that, even with an infinite chase, allow for decidable or even tractable CQA in the size of the EDB. Much effort has been made in identifying and characterizing interesting syntactic classes of programs with this property (see [8] for an overview). In this direction, warded Datalog was introduced in [3, 4, 17], as a particularly well-behaved fragment of Datalog, for which CQA is tractable. We briefly recall and illustrate it here, for which we need some preliminary notions.

A position in Datalog program is affected if: (a) an existential variable appears in , or (b) there is such that a variable appears in in and all occurrences of in are in affected positions. and denote the sets of affected, resp. non-affected, positions of . Intuitively, contains all positions where the chase may possibly introduce a null.

Consider the following program:


By the first case, are affected. By the second case, are affected. Now that are affected, also is. We thus have , and .

For a rule , and a variable : (a) is harmless if it appears at least once in at a position in .   denotes the set of harmless variables in . Otherwise, a variable is called harmful. Intuitively, harmless variables will always be instantiated to constants in any chase step, whereas harmful variables may be instantiated to nulls. (b) is dangerous if and .   denotes the set of dangerous variables in  . These are the variables which may propagate nulls into the atom created by a chase step.

(ex. 2.2 cont.)   and are both harmful but only is dangerous for (1);   is harmless, is dangerous, is harmful but not dangerous for (2);   is harmless and is dangerous for (3). Now, a rule is warded if either or there exists an atom , the ward, such that (1) and (2) . A program is warded if every rule is warded.

(ex. 2.2 cont.)  Rule (1) is trivially warded with the single body atom as the ward. Rule (2) is warded by : variable is the only dangerous variable and (the only variable shared by the ward with the rest of the body) is harmless. Actually, the other body atom contains the harmful variable ; but it is not dangerous and not shared with the ward. Finally, rule (3) is warded by ; the other atom contains no affected variable. Since all rules are warded, the program is warded.

Datalog can be extended with safe, stratified negation in the usual way, similarly as stratified Datalog [1]. The resulting Datalog can also be given a chase-based semantics [10]. The notions of affected/non-affected positions and harmless/harmful/dangerous variables carry over to a Datalog program by considering only the program obtained from by deleting all negated body atoms. For warded Datalog, only a restricted form of stratified negation is allowed – so-called stratified ground negation. This means that we require for every rule : if contains a negated atom , then every must be either a constant (i.e, ) or a harmless variable. Hence, negated atoms can never contain a null in the chase. We write Datalog for programs in this language.

The class of warded Datalog programs extends the class of Datalog programs. Warded Datalog is expressive enough to capture query answering of a core fragment of SPARQL under certain OWL 2 QL entailment regimes [16], and this task can actually be accomplished in polynomial time (data complexity) [3, 4]. Hence, Datalog constitutes a very good compromise between expressive power and complexity. Recently, a powerful query engine for warded Datalog has been presented [6, 7], namely the VADALOG system.

3 Datalog-Based Bag Semantics for Datalog

We now provide a set-semantics that represents the bag semantics for a Datalog program with a multiset EDB via the transformation into a Datalog program over a set EDB obtained from . For this, we assume w.l.o.g., that the set of nulls, , for a Datalog program is partitioned into two infinite ordered sets , for unique, global tuple identifiers (tids), and , for usual nulls in Datalog programs. Given a multiset EDB and a program , instead of using colors and syntactically different derivation trees, we will use elements of to identify both the elements of the EDB and the tuples resulting from applications of the chase.

For every predicate , we introduce a new version with an extra, first argument (its -th position) to accommodate a tid, which is a null from . If an atom appears in as duplicates, we create the tuples  , with the pairwise different nulls from as tids, and not used to identify any other element of . We obtain a set EDB from the multiset EDB . Given a rule in , we introduce tid-variables (i.e. appearing in the -th positions of predicates) and existential quantifiers in the rule head, to formally generate fresh tids when the rule applies. More precisely, a rule in of the form  , with , becomes the Datalog rule  , with fresh, different variables . The resulting Datalog program can be evaluated according to the usual set semantics on the set EDB via the chase: when the instantiated body , of rule becomes true, then the new tuple is created, with the first (new) null from that has not been used yet, i.e., the tid of the new atom.

(ex. 2.1 cont.)  The EDB from Example 2.1 becomes

, ;111Notice that this set version of can also be created by means of Datalog rules. For example, with the rule   for the EDB predicate .

and program becomes with ;  ;  

The following is a 3-step chase sequence of and : .

Analogously to the depth-two and depth-three trees in Example 2.1, the chase produces 10 new atoms. In total, we get:   , .

In order to extend the PBB Semantics to Datalog, we have to extend our transformation of programs into to rules with negated atoms. Consider a rule of the form:


with, ; we transform it into the following two rules:


The introduction of auxiliary predicates is crucial since adding fresh variables directly to the negated atoms would yield negated atoms of the form in the rule body, which make the rule unsafe. The resulting Datalog program is from the desired class Datalog:

Let be a Datalog program and let be the transformation of into a Datalog program. Then, is a warded Datalog program.

Operation of Section 2 inspires de-identification and multiset merging operations. Sometimes we use double braces, , to emphasize that the operation produces a multiset.

For a set of tuples with tids, and , for de-identification and set-projection, respectively, are: (a) , a multiset;  and  (b) , a set.

Given a Datalog program and a multiset EDB , the program-based bag semantics (PBB semantics) assigns to the multiset:

The main results in this section are the correspondence of PBB semantics and DTB semantics and the relationship of both with classical set semantics of Datalog:

For a Datalog program with a multiset EDB ,   holds.

Proof Idea.

The theorem is proved by establishing a one-to-one correspondence between DTs in with a fixed root atom and (minimal) chase-derivations of from via . This proof proceeds by induction on the depth of the DTs and length of the chase sequences. ∎

Given a Datalog (resp. Datalog) program and a multiset EDB , the set is the minimal model (resp. the standard model) of the program .

4 Bag Semantics for Datalog

In the previous section, we have seen that warded Datalog (possibly extended with stratified ground negation) is well suited to capture the bag semantics of Datalog in terms of classical set semantics. We now want to go one step further and study the bag semantics for Datalog itself. Note that this question makes a lot of sense given the results from [4], where it is shown that warded Datalog captures a core fragment of SPARQL under certain OWL2 QL entailment regimes and the official W3C semantics of SPARQL is a bag semantics.

4.1 Extension of the DTB Semantics to Datalog

The definition of a DT-based bag semantics for Datalog is not as straightforward as for Datalog, since atoms in a model of for a (multiset) EDB and Datalog program may have labelled nulls as arguments, which correspond to existentially quantified variables and may be arbitrarily chosen. Hence, when counting duplicates, it is not clear whether two atoms differing only in their choice of nulls should be treated as copies of each other or not. We therefore treat multiplicities of atoms analogously to multiple answers to single-atom queries, i.e., the multiplicity of an atom wrt. EDB and program corresponds to the multiplicity of answer to the query over the database and program . In other words, we ask for all instantiations of such that is true in every model of . It is well known that only ground instantiations (on ) can have this property (see e.g. [14]). Hence, below, we restrict ourselves to considering duplicates of ground atoms containing constants from only (in this section we are not using tid-positions ). In the rest of this section, unless otherwise stated, “ground atom" means instantiated on ; and programs belong to Datalog.

In order to define the multiplicity of a ground atom wrt. a (multiset) EDB and a warded Datalog program , we adopt the notion of proof tree used in [4, 11], which generalizes the notion of derivation tree to Datalog. We consider first positive Datalog programs. A proof tree (PT) for an atom (possibly with nulls) wrt. (a set) EDB and Datalog program is a node- and edge-labelled tree with labelling function , such that: (1) The nodes are labelled by atoms over .  (2) The edges are labelled by rules from .  (3) The root is labelled by .  (4) The leaf nodes are labelled by atoms from .  (5) The edges from a node to its child nodes are all labelled with the same rule .  (6) The atom labelling corresponds to the result of a chase step where is instantiated to and becomes when instantiating the existential variables of with fresh nulls.  (7) If (resp. ) is the parent node of (resp. ) such that and share at least one null, then the entire subtrees rooted at and at must be isomorphic (i.e., the same tree structure and the same labels).  (8) If, for two nodes and , and share a null , then there exist ancestors of and of such that and are siblings corresponding to two body atoms and of rule with and for some variable and is applied with some substitution which sets ; moreover, occurs in the labels of the entire paths from to and from to . A proof tree for Example 4.1 below is shown in Figure 3, left. As with derivation trees, we assume that siblings in the proof tree are arranged in the order of the corresponding body atoms in the rule labelling the edge to the parent node (cf. Section 2.1).

Intuitively, a PT is a tree-representation of the derivation of an atom by the chase. The parent/child relationship in a PT corresponds to the head/body of a rule application in the chase. Condition (7) above refers to the case that a non-ground atom is used in two different rule applications within the chase sequence. In this case, the two occurrences of this atom must have identical proof sub-trees. A PT can be obtained from a chase-derivation by reversing the edges and unfolding the chase graph into a tree by copying some of the nodes [4]. By definition of the chase in Section 2.2, it can never happen that the same null is created by two different chase steps. Note that the nulls in (and, likewise in ) are precisely the newly created ones. Hence, if and share such a null, then and are the same atom and the subtrees rooted at these nodes are obtained by unfolding the same subgraph of the chase graph. Condition (8) makes sure that we use the same null in a PT only if this is required by a join condition of some rule ; otherwise nulls are renamed apart.

Figure 3: Proof tree (left) and witness (right) for atom in Examples 4.1 and 5.

Let be the Datalog program with ;  ;  ;  ;  . This program belongs to Datalog, and – although not necessary to build a proof tree for it – we notice that it is also warded: ; and all other positions are not affected. Rule is warded with ward (where is the only dangerous variable in this rule). All other rules are trivially warded because they have no dangerous variables.

Now let . A possible proof tree for is shown in Figure 3 on the left. It is important to note that nodes and introducing labelled null are labelled with the same atom and give rise to identical subtrees. Of course, could also result from a chase step applying to . However, this would generate a null different from and, subsequently, the nulls in and (in and ) would be different, and rule could not be applied anymore. In contrast, the nodes and with label span different subtrees. This is desired: there are two possible derivations for each occurrence of atom in the PT.

Here we deviate slightly from the definition of PTs in [4], in that we allow the same ground atom to have different derivations. This is needed to detect duplicates and to make sure that PTs in fact constitute a generalization of the derivation trees in Section 2.1. Moreover, condition (8) is needed to avoid non-isomorphic PTs by “unforced” repetition of nulls (i.e., identical nulls that are not required by a join condition further up in the tree). Analogous to the generalization in Section 2.1 of DTs for Datalog to DTs for Datalog, it is easy to generalize proof trees to Datalog. Here it is important that we only allow stratified ground negation. Hence, analogously to DTs for Datalog, we allow negated ground atoms to occur as node labels of leaf nodes in a PT, provided that the positive atom has no PT. Moreover, it is no problem to allow also multiset EDBs since, as in Section 2.1, we can keep duplicates apart by means of a coloring function col.

Finally, we can define proof trees and as equivalent, denoted , if one is obtained from the other by renaming of nulls. We can thus normalize PTs by assuming that nulls in node labels are from some fixed set and that these nulls are introduced in the labels of the PT by some fixed-order traversal (e.g., top-down, left-to-right).

For a PT , we define as of the root when is a single-node tree, and the root-label of , otherwise. For a set of PTs :  , which is a multiset that multi-contains .  If is the set of normalized, non-equivalent PTs, the proof-tree bag (PTB) semantics for is the multiset . For a ground atom , denotes the multiplicity of in the multiset .

(ex. 4.1 cont.) To compute for and