Log In Sign Up

Declarative Statistical Modeling with Datalog

by   Vince Bárány, et al.

Formalisms for specifying statistical models, such as probabilistic-programming languages, typically consist of two components: a specification of a stochastic process (the prior), and a specification of observations that restrict the probability space to a conditional subspace (the posterior). Use cases of such formalisms include the development of algorithms in machine learning and artificial intelligence. We propose and investigate a declarative framework for specifying statistical models on top of a database, through an appropriate extension of Datalog. By virtue of extending Datalog, our framework offers a natural integration with the database, and has a robust declarative semantics. Our Datalog extension provides convenient mechanisms to include numerical probability functions; in particular, conclusions of rules may contain values drawn from such functions. The semantics of a program is a probability distribution over the possible outcomes of the input database with respect to the program; these outcomes are minimal solutions with respect to a related program with existentially quantified variables in conclusions. Observations are naturally incorporated by means of integrity constraints over the extensional and intensional relations. We focus on programs that use discrete numerical distributions, but even then the space of possible outcomes may be uncountable (as a solution can be infinite). We define a probability measure over possible outcomes by applying the known concept of cylinder sets to a probabilistic chase procedure. We show that the resulting semantics is robust under different chases. We also identify conditions guaranteeing that all possible outcomes are finite (and then the probability space is discrete). We argue that the framework we propose retains the purely declarative nature of Datalog, and allows for natural specifications of statistical models.


page 1

page 2

page 3

page 4


Generative Datalog with Stable Negation

Extending programming languages with stochastic behaviour such as probab...

Generative Datalog with Continuous Distributions

Arguing for the need to combine declarative and probabilistic programmin...

Measuring and Computing Database Inconsistency via Repairs

We propose a generic numerical measure of inconsistency of a database wi...

Measure Transformer Semantics for Bayesian Machine Learning

The Bayesian approach to machine learning amounts to computing posterior...

Paradoxes of Probabilistic Programming

Probabilistic programming languages allow programmers to write down cond...

Modelling contextuality by probabilistic programs with hypergraph semantics

Models of a phenomenon are often developed by examining it under differe...

A logical framework to model software development by multiple agents following a common specification

In this paper, we address program development by multiple different prog...

1 Introduction

Formalisms for specifying general statistical models are commonly used for developing machine learning and artificial intelligence algorithms for problems that involve inference under uncertainty. A substantial scientific effort has been made on developing such formalisms and corresponding system implementations. An intensively studied concept in that area is that of Probabilistic Programming [21] (PP), where the idea is that the programming language allows for building general random procedures, while the system executes the program not in the standard programming sense, but rather by means of inference. Hence, a PP system is built around a language and an inference engine (which is typically based on variants of Markov Chain Monte Carlo, most notably Metropolis-Hastings

). An inference task is a probability-aware aggregate operation over all the possible worlds, such as finding the most likely possible world, or estimating the probability of an event (which is phrased over the outcome of the program). Recently, DARPA initiated the project of

Probabilistic Programming for Advancing Machine Learning, aimed at advancing PP systems (with a focus on a specific collection of systems, e.g., [37, 29, 31]) towards facilitating the development of algorithms based on machine learning.

In probabilistic programming, a statistical model is typically phrased by means of two components. The first component is a generative process that produces a random possible world by straightforwardly following instructions with randomness, and in particular, sampling from common numerical probability functions; this gives the prior distribution. The second component allows to phrase constraints that the relevant possible worlds should satisfy, and, semantically, transforms the prior distribution into the posterior distribution—the subspace conditional on the constraints.

As an example, in supervised text classification (e.g., spam detection) the goal is to classify a text document into one of several known classes (e.g., spam/non-spam). Training data consists of a collection of documents labeled with classes, and the goal of learning is to build a model for predicting the classes of unseen documents. One common approach to this task assumes a generative process that produces random

parameters for every class, and then uses these parameters to define a generator of random words in documents of the corresponding class [32, 30]. So, the prior distribution generates parameters and documents for each class, and the posterior is defined by the actual documents of the training data. In unsupervised text classification the goal is to cluster a given set of documents, so that different clusters correspond to different topics (which are not known in advance). Latent Dirichlet Allocation [9] approaches this problem in a similar generative way as the above, with the addition that each document is associated with a distribution over topics.

While the agenda of probabilistic programming is the deployment of programming languages to developing statistical models, in this framework paper we explore this agenda from the point of view of database programming. Specifically, we propose and investigate an extension of Datalog for declarative specification of statistical models on top of a database. We believe that Datalog can be naturally extended to a language for building statistical models, since its essence is the production of new facts from known (database) facts. Of course, traditionally these facts are deterministic, and our extension enables the production of probabilistic facts that, in particular, involve numbers from available numerical distributions. And by virtue of extending Datalog, our framework offers a natural integration with the database, and has a robust declarative semantics: a program is a set of rules that is semantically invariant under transformations that retain logical equivalence. Moreover, the semantics of a program (i.e., the probability space it specifies) is fully determined by the satisfaction of rules, and does not depend on the specifics of any execution engine.

In par with languages for probabilistic programming, our proposed extension consists of two parts: a generative Datalog program

that specifies a prior probability space over (finite or infinite) sets of facts that we call

possible outcomes

, and a definition of the posterior probability by means of

observations, which come in the form of an ordinary logical constraint over the extensional and intensional relations.

The generative component of our Datalog extension provides convenient mechanisms to include conventional parameterized numerical probability functions (e.g., Poisson, geometrical, etc.). Syntactically, this extension allows to sample values in the conclusion of rules, according to specified parameterized distributions. As an example, consider the relation that represents clients of a service provider, along with their associated branch and average number of visits (say, per month). The following distributional rule models a random number of visits for that client in the branch.


Note, however, that a declarative interpretation of the above rule is not straightforward. Suppose that we have another rule of the following form:


Then, what would be the semantics if a person is both a client and a preferred client? Do we sample twice for that person? And what if the two s of the two facts are not the same? Is sampling according to one rule considered a satisfaction of the other rule? What if we have also the following rule:


From the viewpoint of Datalog syntax, Rule (3) is logically implied by Rule (1), since the premise of Rule (3) implies the premise of Rule (1). Hence, we would like the addition of Rule (3) to have no effect on the program. This means that some rule instantiations will not necessarily fire an actual sampling.

To make sense of rules such as the above, we associate with every program an auxiliary program , such that does not use distributions, but is rather an ordinary Datalog program where a rule can have an existentially quantified variable in the conclusion. Intuitively, in our example such a rule states that “if the premise holds, then there exists a fact where is associated with the distribution and the parameter .” In particular, if the program contains the aforementioned Rule (1), then Rule (3) has no effect; similarly, if the tuple is in both and , then in the presence of Rule (2) the outcome does not change if one of these tuples is removed.

In this paper we focus on numerical probability distributions that are discrete (e.g., the aforementioned ones). Our framework has a natural extension to continuous distributions (e.g., Gaussian or Pareto), but our analysis requires a nontrivial generalization that we defer to future work.

When applying the program to an input instance , the probability space is over all the minimal solutions of w.r.t. , such that all the numerical samples have a positive probability. To define the probabilities of a sample in this probability space, we consider two cases. In the case where all the possible outcomes are finite, we get a discrete probability distribution, and the probability of a possible outcome can be defined immediately from its content. But in general, a possible outcome can be infinite, and moreover, the set of all possible outcomes can be uncountable. Hence, in the general case we define a probability measure space. To make the case for the coherence of our definitions (i.e., our definitions yield proper probability spaces), we define a natural notion of a probabilistic chase where existential variables are produced by invoking the corresponding numerical distributions. We use cylinder sets [6] to define a measure space based on a chase, and prove that this definition is robust, since one establishes the same probability measure no matter which chase is used.

Related Work.     Our contribution is a marriage between probabilistic programming and the declarative specification of Datalog. The key features of our approach are the ability to express probabilistic models concisely and declaratively in a Datalog extension with probability distributions as first-class citizens. Existing formalisms that associate a probabilistic interpretation with logic are either not declarative (at least in the Datalog sense) or depart from the probabilistic programming paradigm (e.g., by lacking the support for numerical probability distributions). We next discuss representative related formalisms and contrast them with our work. They can be classified into three broad categories: (1) imperative specifications over logical structures, (2) logic over probabilistic databases, and (3) indirect specifications over the Herbrand base. (Some of these formalisms belong to more than one category.)

The first category includes imperative probabilistic programming languages [43], such as BLOG [31], that can express probability distributions over logical structures, via generative stochastic models that can draw values at random from numerical distributions, and condition values of program variables on observations. In contrast with closed-universe languages such as SQL and logic programs, BLOG considers open-universe probability models that allow for uncertainty about the existence and identity of objects. Instantiations of this category also do not focus on a declarative specification, and indeed, their semantics is dependent on their particular imperative implementations. P-log [7]

is a Prolog-based language for specifying Bayesian networks. Although declarative in nature, the semantics inherently assumes a form of acyclicity that allows the rules to be executed serially. Here we are able to avoid such an assumption since our approach is based on the minimal solutions of an existential Datalog program.

The formalisms in the second category view the generative part of the specification of a statistical model as a two-step process. In the first step, facts are being randomly generated by a mechanism external to the program. In the second step, a logic program, such as Prolog [27] or Datalog [1], is evaluated over the resulting random structure. This approach has been taken by PRISM [40], the Independent Choice Logic [38], and to a large extent by probabilistic databases [41] and their semistructured counterparts [26]. The focus of our work, in contrast, is on a formalism that completely defines the statistical model, without referring to external processes.

One step beyond the second category and closer to our work is taken by uncertainty-aware query languages for probabilistic data such as TriQL [42], I-SQL, and world-set algebra [4, 5]. The latter two are natural analogs to SQL and relational algebra for the case of incomplete information and probabilistic data [4]. They feature constructs such as repair-key, choice-of, possible, certain, and group-worlds-by that can construct possible worlds representing all repairs of a relation with respect to (w.r.t.) key constraints, close the possible worlds by unioning or intersecting them, or group the worlds into sets with the same results to sub-queries. World-set algebra has been extended to (world-set) Datalog, fixpoint, and while-languages [14] to define Markov chains. While such languages cannot explicitly specify probability distributions, they may simulate a specific categorical distribution indirectly using non-trivial programs with specialized language constructs like repair-key on input tuples with weights representing samples from the distribution.

MCDB [25] and SimSQL [11] propose SQL extensions (with for-loops and probability distributions) coupled with Monte Carlo simulations and parallel database techniques for stochastic analytics in the database. In contrast, our work focuses on existential Datalog with recursion and probability spaces over the minimal solutions of the data w.r.t. the Datalog program.

Formalisms in the third category are indirect specifications of probability spaces over the Herbrand base, which is the set of all the facts that can be obtained using the predicate symbols and the constants of the database. This category includes Markov Logic Networks (MLNs) [15, 33], where the logical rules are used as a compact and intuitive way of defining factors. In other words, the probability of a possible world is the product of all the numbers (factors) that are associated with the rules that the world satisfies. This approach is applied in DeepDive [34], where a database is used for storing relational data and extracted text, and database queries are used for defining the factors of a factor graph. We view this approach as indirect since a rule does not determine directly the distribution of values. Moreover, the semantics of rules is such that the addition of a rule that is logically equivalent to (or implied by, or indeed equal to) an existing rule changes the semantics and thus the probability distribution. A similar approach is taken by Probabilistic Soft Logic [10], where in each possible world every fact is associated with a weight (degree of truth).

Further formalisms in this category are probabilistic Datalog [19], probabilistic Datalog+/- [22], and probabilistic logic programming (ProbLog) [27]. In these formalisms, every rule is associated with a probability. For ProbLog, the semantics is not declarative as the rules follow a certain evaluation order; for probabilistic Datalog, the semantics is purely declarative. Both semantics are different from ours and that of the other formalisms mentioned thus far. A Datalog rule is interpreted as a rule over a probability distribution over possible worlds, and it states that, for a given grounding of the rule, the marginal probability of being true is as stated in the rule. Probabilistic Datalog+/- uses MLNs as the underlying semantics. Besides our support for numerical probability distributions, our formalism is used for defining a single probability space, which is in par with the standard practice in probabilistic programming.

As said earlier, the programs in our proposed formalism allow for recursion. As we show in the paper, the semantics is captured by Markov chains that may be infinite. Related formalisms are those of the Probabilistic Context-Free Grammar (PCFG) and the more general Recursive Markov Chain (RMC) [17], where the probabilistic specification is by means of a finite set of transition graphs that can call one another (in the sense of method call) in a possibly recursive fashion. In database research, PCFGs and RMCs have been explored in the context of probabilistic XML [13, 8]. Although these formalisms do not involve numerical distributions, in future work we plan to conduct a study of the relative expressive power between them and restrictions of our framework. Moreover, we plan to study whether and how inference techniques on PCFGs and RMCs can be adapted to our framework.

Organization.     The remainder of the paper is organized as follows. In Section 2 we give basic definitions. The syntax and semantics of generative Datalog is introduced in Section 3, where we focus on the case where all solutions are finite. In Section 4 we present our adaptation of the chase. The general case of generative Datalog, where solutions can be infinite, is presented in Section 5. We complete our development in Section 6, where generative Datalog is extended with constraints (observations) to form Probabilistic-Programming Datalog (PPDL). Finally, we discuss extensions and future directions in Section 7 and conclude in Section 8.

2 Preliminaries

In this section we give some preliminary definitions that we will use throughout the paper.

Schemas and instances.     A (relational) schema is a collection of relation symbols, where each relation symbol is associated with an arity, denoted , which is a natural number. An attribute of a relation symbol is any number in . For simplicity, we consider here only databases over real numbers; our examples may involve strings, which we assume are translatable into real numbers. A fact over a schema is an expression of the form where is an -ary relation in and . An instance over is a finite set of facts over . We will denote by the set of all tuples such that is a fact of .

Datalog programs.     In this work we use Datalog with the option of having existential variables in the head [12]. Formally, an existential Datalog program, or just Datalog program for short, is a triple where: (1) is a schema, called the extensional database (EDB) schema, (2) is a schema, called the intensional database (IDB) schema, and is disjoint from , and (3) is a finite set of Datalog rules, i.e.,, first-order formulas of the form

where is a conjunction of atomic formulas over and is an atomic formula over , such that each variable in occurs in at least one atomic formula of . Here, by an atomic formula (or, atom) we mean an expression of the form where is an -ary relation and are either constants (i.e., real numbers) or variables. We usually omit the universal quantifiers for readability’s sake. Datalog is the fragment of Datalog where the conclusion (left-hand side) of each rule is a single atomic formula without existential quantifiers.

Let be a Datalog program. An input instance for is an instance over . A solution of w.r.t.  is a possibly-infinite set of facts over , such that and satisfies all rules in (viewed as first-order sentences). A minimal solution of (w.r.t. ) is a solution of such that no proper subset of is a solution of . The set of all, finite and infinite, minimal solutions of w.r.t.  is denoted by , and the set of all finite minimal solutions is denoted by . It is a well known fact that, if is a Datalog program (that is, without existential quantifiers), then every input instance has a unique minimal solution, which is finite, and therefore .

Probability spaces.     We separately consider discrete and continuous probability spaces. We initially focus on the discrete case; there, a probability space is a pair , where is a finite or countably infinite set, called the sample space, and is such that . If is a probability space, then is a probability distribution over . We say that is a numerical probability distribution if . In this work we focus on discrete numerical distributions.

A parameterized probability distribution is a function , such that is a probability distribution for all . We use to denote the number , called the parameter dimensionality of . For presentation’s sake, we may write instead of . Moreover, we denote the (non-parameterized) distribution by . Examples of (discrete) parameterized distributions follow.

  • : is , and for a parameter we have and .

  • : , and for a parameter we have .

  • : , and for a parameter we have .

In Section 7 we will discuss the extension of our framework to models that have an unbounded number of parameters, and to continuous distributions.

3 Generative Datalog

A Datalog program without existential quantifiers specifies how to obtain a solution from an input EDB instance by producing the set of inferred IDB facts. In this section we present generative Datalog programs, which specify how to infer a distribution over possible outcomes given an input EDB instance.

3.1 Syntax

We first define the syntax of a generative Datalog program, which we call a GDatalog[] program.

Definition 3.1 (GDatalog[])

Let be a finite set of parametrized numerical distributions.

  1. A -term is a term of the form where is a parametrized distribution with , and are variables and/or constants.

  2. A -atom in a schema is an atomic formula with an -ary relation, such that exactly one term () is a -term, and all other are constants and/or variables.

  3. A GDatalog[] rule over a pair of disjoint schemas and is a first-order sentence of the form where is a conjunction of atoms in and is either an atom in or a -atom in .

  4. A GDatalog[] program is a triple , where and are disjoint schemas and is a finite set of GDatalog[] rules over and .

Example 3.2

Our example is based on the burglar example of Pearl [36] that has been frequently used for illustrating probabilistic programming (e.g., [35]). Consider the EDB schema consisting of the following relations: represents houses and their location cities , represents businesses and their location cities , represents cities and their associated burglary rates , and represents units (houses or businesses) where the alarm is on. Figure 3 shows an instance over this schema. Now consider the GDatalog[] program of Figure 1.

Figure 1: An example GDatalog[] program

Here, consists of only one distribution, namely . The first rule above, intuitively, states that, for every fact of the form , there must be a fact where

is drawn from the Flip (Bernoulli) distribution with the parameter


3.2 Possible Outcomes

To define the possible outcomes of a GDatalog[] program, we associate to each GDatalog[] program a corresponding Datalog program . The possible outcomes of an input instance w.r.t.  will then be minimal solutions of w.r.t. . Next, we describe and .

The schema extends with the following additional relation symbols: whenever a rule in contains a -atom of the form , and is the argument position at which the -term in question occurs, then we add to a corresponding relation symbol , whose arity is . These relation symbols are called the distributional relation symbols of , and the other relation symbols of (namely, those of ) are referred to as the ordinary relation symbols. Intuitively, a fact in asserts the existence of a tuple in and a sequence of parameters, such that the th element of the tuple is sampled from using the parameters.

The set contains three kinds of rules:

  1. All Datalog rules from that contain no -terms;

  2. The rule for every rule of the form in , where is the position of in ;

  3. The rule for every distributional relation symbol .

Note that in (ii), and are the terms that occur before and after the -term , respectively. A rule in (iii) states that every fact in should be reflected in the relation .

Example 3.3

The GDatalog[] program given in Example 3.2 gives rise to the corresponding Datalog program of Figure 2. As an example of (ii), rule 6 of Figure 1 is replaced with rule 6 of Figure 2. Rules 8–10 of Figure 2 are examples of (iii).

Figure 2: The Datalog program for the GDatalog[] program of Figure 1

A possible outcome is defined as follows.

Definition 3.4 (Possible Outcome)

Let be an input instance for a GDatalog[] program . A possible outcome for w.r.t.  is a minimal solution of w.r.t. , such that for every distributional fact with in the th position.

We denote the set of all possible outcomes of w.r.t.  by , and we denote the set of all finite possible outcomes by .

The following proposition provides an insight into the possible outcomes of an instance, and will reappear later on in our study of the chase. For any distributional relation , the functional dependency associated to is the functional dependency , expressing that the -th attribute is functionally determined by the rest.

Proposition 3.5

Let be any input instance for a GDatalog[] instance . Then every possible outcome in satisfies all functional dependencies associated to distributional relations.

The proof of Proposition 3.5 is easy: if an instance violates the funtional dependency associated to a distributional relation , then one of the two facts involved in the violation can be removed, showing that is, in fact, not a minimal solution w.r.t. .

Figure 3: EDB instance of the running example

3.3 Finiteness and Weak Acyclicity

Our presentation first focuses on the case where all the possible outcomes for the GDatalog[] program are finite. Before we proceed to defining the semantics of such a GDatalog[] program, we present the notion of weak acyclicity for a GDatalog[] program, as a natural syntactic property that guarantees finiteness of all possible outcomes. This draws on the notion of weak acyclicity for Datalog [18]. Consider any GDatalog[] program . A position of is a pair where and is an attribute of . The dependency graph of is the directed graph that has the attributes of as the nodes, and the following edges:

  • A normal edge whenever there is a rule and a variable at position in , and at position in .

  • A special edge whenever there is a rule of the form

    and an exported variable at position in . By an exported variable, we mean a variable that appears in both the premise and the conclusion.

We say that is weakly acyclic if no cycle in the dependency graph of contains a special edge.

Theorem 3.6

If a GDatalog[] program is weakly acyclic, then for all input instances .

Figure 4: A possible outcome for the input instance in the running example

3.4 Probabilistic Semantics

Intuitively, the semantics of a GDatalog[] program is a function that maps every input instance to a probability distribution over . We now make this precise. Let be a GDatalog[] program, let be an input for . Again, we first consider the case where an input instance only has finite possible outcomes (i.e., ). Observe that, when all possible outcomes of are finite, the set is countable, since we assume that all of our numerical distributions are discrete. In this case, we can define a discrete probability distribution over the possible outcomes of w.r.t. . We denote this probability distribution by .

For a distributional fact , we define the weight of (notation: ) to be . For an ordinary (non-distributional) fact , we set . For a finite set of facts, we denote by the product of the weights of all the facts in .

The probability assigned to a possible outcome , denoted , is simply . If a possible outcome does not contain any distributional facts, then by definition.

Example 3.7

(continued) Let be the instance that consists of all of the relations in Figures 3 and 4. Then is a possible outcome of w.r.t. . For convenience, in the case of distributional relation symbols, we have added the weight of each fact to the corresponding row as the rightmost attribute. This weight is not part of our model (since it can be inferred from the rest of the attributes). For presentation’s sake, the sampled values are under the attribute name (while attribute names are again external to our formal model). is the product of all of the numbers in the columns titled “,” that is, .

The following theorem states that is indeed a probability space over all the possible outcomes.

Theorem 3.8

Let be a GDatalog[] program, and an input instance for , such that . Then is a discrete probability function over .

We prove Theorem 3.8 in Section 4. In Section 5 we consider the general case, and in particular the generalization of Theorem 3.8, where not all possible outcomes are guaranteed to be finite. There, if one considers only the (countable set of all) finite possible outcomes, then the sum of probabilities is not necessarily one. But still:

Theorem 3.9

Let be a GDatalog[] program, and an input for . Then .

We conclude this section with some comments. First, we note that the restriction of a conclusion of a rule to include a single -term significantly simplifies the presentation, but does not reduce the expressive power. In particular, we could simulate multiple -terms in the conclusion using a collection of predicates and rules. For example, if one wishes to have conclusion where a person gets both a random height and a random weight (possibly with shared parameters), then she can do so by deriving and separately, and using the rule . We also highlight the fact that our framework can easily simulate the probabilistic database model of independent tuples [41] with probabilities mentioned in the database, using the distribution, as follows. Suppose that we have the EDB relation where represents the probability of every tuple. Then we can obtain the corresponding probabilistic relation using the rules and . Finally, we note that a disjunctive Datalog rule [16], where the conclusion can be a disjunction of atoms, can be simulated by our model (with probabilities ignored): If the conclusion has disjuncts, then we construct a distributional rule with a probability distribution over , and additional deterministic rules corresponding to the atoms.

4 Chasing Generative Programs

The chase [28, 3] is a classic technique used for reasoning about tuple-generating dependencies and equality-generating dependencies. In the special case of full tuple-generating dependencies, which are syntactically isomorphic to Datalog rules, the chase is closely related to (a tuple-at-a-time version of) the naive bottom-up evaluation strategy for Datalog program (cf. [2]). In this section, we present a suitable variant of the chase for generative Datalog programs, and analyze some of its properties. The goal of that is twofold. First, as we will show, the chase provides an intuitive executional counterpart of the declarative semantics in Section 3. Second, we use the chase to prove Theorems 3.8 and 3.9.

We note that, although the notions and results could arguably be phrased in terms of a probabilisitic extension of bottom-up Datalog evaluation strategy, the fact that a GDatalog[] rule can create new values makes it more convenient to phrase them in terms of a suitable adaptation of the chase procedure.

To simplify the notation in this section, we fix a GDatalog[] program . Let be the associated Datalog program.

We define the notions of chase step and chase tree.

Chase step.     Consider an instance , a rule of the form , and a tuple such that is satisfied in but is not satisfied in . If is a distributional atom of the form , then being “not satisfied” is interpreted in the logical sense (regardless of probabilities): there is no such that the tuple is in . In that case, let be the set of all instances obtained by extending with for a specific value of the existential variable , such that . Furthermore, let be the discrete probability distribution over that assigns to the probability mass . If is an ordinary atom without existential quantifiers, is simply defined as , where extends with the facts in , and . Then, we say that

is a valid chase step.

Chase tree.     Let be an input instance for . A chase tree for w.r.t.  is a possibly infinite tree, whose nodes are labeled by instances over and where each edge is labeled by a real number such that

  1. The root is labeled by ;

  2. For each non-leaf node labeled , if is the set of labels of the children of the node, and if is the map assigning to each the label of the edge from to , then is a valid chase step for some rule and tuple .

  3. For each leaf node labeled , there does not exist a valid chase step of the form . In other words, the tree cannot be extended to a larger chase tree.

We denote by the label (instance) of the node . Each instance of a node of of a chase tree is said to be an intermediate instance w.r.t. that chase tree. A chase tree is said to be injective if no intermediate instance is the label of more than one node; that is, for we have . As we will see shortly, due to the specific construction of , every chase tree turns out to be injective.

Properties of the chase. We now state some properties of our chase procedure.

Proposition 4.1

Let be any input instance, and consider any chase tree for w.r.t. . Then every intermediate instance satisfies all functional dependencies associated to distributional relations.

Proposition 4.2

Every chase tree w.r.t.  is injective.

We denote by the set of leaves of a chase tree , and we denote by the set .

Theorem 4.3

Let be a chase tree for an input instance w.r.t. . The following hold.

  1. Every intermediate instance is a subset of some possible outcome in .

  2. If does not have infinite directed paths, then .

This theorem is a special case of a more general result, Theorem 5.3, which we prove later.

4.1 Proof of Theorems 3.8 and 3.9

By construction, for every node of a chase tree , the weights of the edges that emanate from the node in question sum up to one. We can associate to each intermediate instance a weight, namely the product of the edge labels on the path from the root to . This weight is well defined, since is injective. We can then consider a random walk over the tree, where the probabilities are given by the edge labels. Then, for a node , the weight of is equal to the probability of visiting in this random world. From Theorem 4.3 we conclude that, if all the possible outcomes are finite, then does not have any infinite paths, and moreover, the random walk defines a probability distribution over the labels of the leaves, which are the possible outcomes. This is precisely the probability distribution of Theorem 3.8. Moreover, in the general case, is the probability that the random walk terminates (at a leaf), and hence, Theorem 3.9 follows from the fact that this probability (as is any probability) is a number between zero and one.

5 Infinite Possible Outcomes

In the general case of a GDatalog[] program, possible oucomes may be infinite, and moreover, the space of possible outcomes may be uncountable.

Example 5.1

We now discuss examples that show what would happen if we straightforwardly extended our current definition of the probability of possible outcomes to infinite possible outcomes (where, in the case where is infinite, would be the limit of an infinite product of weights).

Consider the GDatalog[] program defined by the rule where is a probability distribution with one parameter and such that is equal to if and otherwise. Then, has no finite possible outcome. In fact, has exactly one infinite possible outcome: .

Now consider the previous program extended with the rule , and consider the input instance . Then, has one finite possible outcome with , and another infinite possible outcome with .

Next, consider the GDatalog[] program defined by , where is a probability distribution with one parameter , and is equal to if and otherwise. Then, for , every possible outcome is infinite, and would have the probability 0.

Now consider the previous program extended with the rule , and consider again the input instance . Then would have exactly one possible outcome with , namely where .

5.1 Generalization of Probabilistic Semantics

To generalize our framework, we need to consider probability spaces over uncountable domains; those are defined by means of measure spaces, which are defined as follows.

Let be a set. A -algebra over is a collection of subsets of , such that contains and is closed under complement and countable unions. (Implied properties include that contains the empty set, and that is closed under countable intersections.) If is a nonempty collection of subsets of , then the closure of under complement and countable unions is a -algebra, and it is said to be generated by .

A probability measure space is a triple , where: (1) is a set, called the sample space, (2) is a -algebra over , and (3) , called a probability measure, is such that , and for every countable set of pairwise-disjoint measurable sets.

Let be a GDatalog[] program, and let be an input for . We say that a sequence of facts is a derivation (w.r.t. ) if for all , the fact is the result of applying some rule of that is not satisfied in (in the case of applying a rule with a -atom in the head, choosing a value randomly). If is a derivation, then the set is a derivation set. Hence, a finite set of facts is a derivation set if and only if is an intermediate instance in some chase tree.

Let be a GDatalog[] program, let be an input for , and let be a set of facts. We denote by the set of all the possible outcomes such that . The following theorem states how we determine the measure space defined by a GDatalog[] program.

Theorem 5.2

Let be a GDatalog[] program, and let be an input for . There exists a unique probability measure space , denoted , that satisfies all of the following.

  1. ;

  2. The -algebra is generated from the sets of the form where is finite;

  3. for every derivation set .

Moreover, if is a finite possible outcome, then is equal to .

Observe that the items (i) and (ii) of Theorem 5.2 describe the unique properties of the probability measure space. The proof will be given in the next section. The last part of the theorem states that our discrete and continuous probability definitions coincide on finite possible outcomes; this is a simple consequence of item (ii), since for a finite possible outcome , the set is such that , and is itself a derivation set (e.g., due to Theorem 4.3).

5.2 Measure Spaces by Infinite Chase

We prove Theorem 5.2 by defining and investigating measure spaces that are defined in terms of the chase. Consider a GDatalog[] program and an input for . A maximal path of a chase tree is a path that starts with the root, and either ends in a leaf or is infinite. Observe that the labels (instances) along a maximal path form a chain (w.r.t. the set-containment partial order). A maximal path of a chase tree is fair if whenever the premise of a rule is satisfied by some tuple in some intermediate instance on , then the conclusion of the rule is satisfied for the same tuple in some intermediate instance on . A chase tree is fair (or has the fairness property) if every maximal path is fair. Note that every finite chase tree is fair. We will restrict attention to fair chase trees. Fairness is a classic notion in the study of infinite computations; moreover, fair chase trees can easily be constructed, for examples, by maintaining a queue of “active rule firings” (cf. any textbook on term rewriting systems or lambda calculus).

Let be a GDatalog[] program, let be an input for , and let be a chase tree. We denote by the set of all the maximal paths of . (Note that may be uncountably infinite.) For , we denote by the union of the (chain of) labels along . The following generalizes Theorem 4.3.

Theorem 5.3

Let be a GDatalog[] program, an input for , and a fair chase tree. The mapping is a bijection between and .

5.3 Chase Measures

Let be a GDatalog[] program, let be an input for , and let be a chase tree. Our goal is to define a probability measure over . Given Theorem 5.3, we can do that by defining a probability measure over . A random path in can be viewed as a Markov chain that is defined by a random walk over , starting from the root. A measure space for such a Markov chain is defined by means of cylinderification [6]. Let be a node of . The -cylinder of , denoted , is the subset of that consists of all the maximal paths that contain . A cylinder of is a subset of that forms a -cylinder for some node . We denote by the set of all the cylinders of .

Recall that is a finite set of facts, and observe that is the product of the weights along the path from the root to . The following theorem is a special case of a classic result on Markov chains (cf. [6]).

Theorem 5.4

Let be a GDatalog[] program, let be an input for , and let be a chase tree. There exists a unique probability measure that satisfies all of the following.

  1. .

  2. is the -algebra generated from .

  3. for all nodes of .

Theorems 5.3 and 5.4 suggest the following definition.

Definition 5.5 (Chase Probability Measure)

Let be a GDatalog[] program, let be an input for , let be a chase tree, and let be the probability measure of Theorem 5.4. The probability measure over is the one obtained from by replacing every maximal path with the possible outcome .

Next, we prove that the probability measure space represented by a chase tree is independent of the specific chase tree of choice. For that, we need some notation and a lemma. Let be a GDatalog[] program, let be an input for , let be a chase tree, and let be a node of . We denote by the set . The following lemma is a consequence of Proposition 4.1 and Theorem 5.3.

Lemma 5.6

Let be a GDatalog[] program, let be an input for , and let be a fair chase tree. Let be a node of and . Then ; that is, is the set .

Using Lemma 5.6 we can prove the following theorem.

Theorem 5.7

Let be a GDatalog[] program, let be an input for , and let and be two fair chase trees. Then .

5.4 Proof of Theorem 5.2

We can now prove Theorem 5.2. Let be a GDatalog[] program, let be an input for , and let be a fair chase tree for w.r.t. . Let be the probability measure on associated to , as defined in Definition 5.5.

Lemma 5.8

The -algebra is generated by the sets of the form , where is finite.

Let be the -algebra generated from the sets . We will show that every is in , and that every is in . The second claim is due to Lemma 5.6, so we will prove the first. So, let be given. Due to Lemma 5.6, the set is the countable union where is the set of all the nodes such that . Hence, .

Lemma 5.9

For every derivation set we have .

Let be a derivation set. Due to Theorem 5.7, it suffices to prove that for some chase tree it is the case that . But since is a derivation set, we can craft a chase tree that has a node with . Then we have that is the product of the weights along the path to , which is exactly .

Lemma 5.10

Let be any probability space that satisfies (i)–(iii) of Theorem 5.2. Then .

Let . Due to Lemma 5.8, we have that . So it is left to prove that . Due to Lemmas 5.9 and 5.6, we have that agrees with