Inference in Probabilistic Logic Programs with Continuous Random Variables

12/12/2011 ∙ by Muhammad Asiful Islam, et al. ∙ Stony Brook University 0

Probabilistic Logic Programming (PLP), exemplified by Sato and Kameya's PRISM, Poole's ICL, Raedt et al's ProbLog and Vennekens et al's LPAD, is aimed at combining statistical and logical knowledge representation and inference. A key characteristic of PLP frameworks is that they are conservative extensions to non-probabilistic logic programs which have been widely used for knowledge representation. PLP frameworks extend traditional logic programming semantics to a distribution semantics, where the semantics of a probabilistic logic program is given in terms of a distribution over possible models of the program. However, the inference techniques used in these works rely on enumerating sets of explanations for a query answer. Consequently, these languages permit very limited use of random variables with continuous distributions. In this paper, we present a symbolic inference procedure that uses constraints and represents sets of explanations without enumeration. This permits us to reason over PLPs with Gaussian or Gamma-distributed random variables (in addition to discrete-valued random variables) and linear equality constraints over reals. We develop the inference procedure in the context of PRISM; however the procedure's core ideas can be easily applied to other PLP languages as well. An interesting aspect of our inference procedure is that PRISM's query evaluation process becomes a special case in the absence of any continuous random variables in the program. The symbolic inference procedure enables us to reason over complex probabilistic models such as Kalman filters and a large subclass of Hybrid Bayesian networks that were hitherto not possible in PLP frameworks. (To appear in Theory and Practice of Logic Programming).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Logic Programming (LP) is a well-established language model for knowledge representation based on first-order logic. Probabilistic Logic Programming (PLP) is a class of Statistical Relational Learning (SRL) frameworks [srlbook] which are designed for combining statistical and logical knowledge representation.

The semantics of PLP languages is defined based on the semantics of the underlying non-probabilistic logic programs. A large class of PLP languages, including ICL [PooleICL], PRISM [sato-kameya-prism], ProbLog [deRaedt] and LPAD [lpad], have a declarative distribution semantics

, which defines a probability distribution over possible models of the program. Operationally, the combined statistical/logical inference is performed based on the proof structures analogous to those created by purely logical inference. In particular, inference proceeds as in traditional LPs except when a random variable’s valuation is used. Use of a random variable creates a branch in the proof structure, one branch for each valuation of the variable. Each proof for an answer is associated with a probability based on the random variables used in the proof and their distributions; an answer’s probability is determined by the probability that at least one proof holds. Since the inference is based on enumerating the proofs/explanations for answers, these languages have limited support for continuous random variables. We address this problem in this paper. A comparison of our work with recent efforts at extending other SRL frameworks to continuous variables appears in Section

2.

We provide an inference procedure to reason over PLPs with Gaussian or Gamma-distributed random variables (in addition to discrete-valued ones), and linear equality constraints over values of these continuous random variables. We describe the inference procedure based on extending PRISM with continuous random variables. This choice is based on the following reasons. First of all, the use of explicit random variables in PRISM simplifies the technical development. Secondly, standard statistical models such as Hidden Markov Models (HMMs), Bayesian Networks and Probabilistic Context-Free Grammars (PCFGs) can be naturally encoded in PRISM. Along the same lines, our extension permits natural encodings of Finite Mixture Models (FMMs) and Kalman Filters. Thirdly, PRISM’s inference naturally reduces to the Viterbi algorithm 

[Viterbi] over HMMs, and the Inside-Outside algorithm [InsideOutside] over PCFGs. The combination of well-defined model theory and efficient inference has enabled the use of PRISM for synthesizing knowledge in sensor networks [Sensys].

It should be noted that, while the technical development in this paper is limited to PRISM, the basic technique itself is applicable to other similar PLP languages such as ProbLog and LPAD (see Section 7).

Our Contribution:

We extend PRISM at the language level to seamlessly include discrete as well as continuous random variables. We develop a new inference procedure to evaluate queries over such extended PRISM programs.

  • We extend the PRISM language for specifying distributions of continuous random variables, and linear equality constraints over such variables.

  • We develop a symbolic inference

    technique to reason with constraints on the random variables. PRISM’s inference technique becomes a special case of our technique when restricted to logic programs with discrete random variables.

  • These two developments enable the encoding of rich statistical models such as Kalman Filters and a large class of Hybrid Bayesian Networks; and exact inference over such models, which were hitherto not possible in LP and its probabilistic extensions.

Note that the technique of using PRISM for in-network evaluation of queries in a sensor network [Sensys] can now be applied directly when sensor data and noise are continuously distributed. Tracking and navigation problems in sensor networks are special cases of the Kalman Filter problem [DSN]. There are a number of other network inference problems, such as the indoor localization problem, that have been modeled as FMMs [WiGEM]. Moreover, our extension permits reasoning over models with finite mixture of Gaussians and discrete distributions (see Section 7). Our extension of PRISM brings us closer to the ideal of finding a declarative basis for programming in the presense of noisy data.

The rest of this paper is organized as follows. We begin with a review of related work in Section 2, and describe the PRISM framework in detail in Section 3. We introduce the extended PRISM language and the symbolic inference technique for the extended language in Section 5. In section 6 we show the use of this technique on an example encoding of the Kalman Filter. We conclude in Section 7 with a discussion on extensions to our inference procedure.

2 Related Work

Over the past decade, a number of Statistical Relational Learning (SRL) frameworks have been developed, which support modeling, inference and/or learning using a combination of logical and statistical methods. These frameworks can be broadly classified as statistical-model-based or logic-based, depending on how their semantics is defined. In the first category are frameworks such as Bayesian Logic Programs (BLPs) 

[blp], Probabilistic Relational Models (PRMs) [prm], and Markov Logic Networks (MLNs) [mln], where logical relations are used to specify a model compactly. A BLP consists of a set of Bayesian clauses (constructed from Bayesian network structure), and a set of conditional probabilities (constructed from CPTs of Bayesian network). PRMs encodes discrete Bayesian Networks with Relational Models/Schemas. An MLN is a set of formulas in first order logic associated with weights. The semantics of a model in these frameworks is given in terms of an underlying statistical model obtained by expanding the relations.

Inference in SRL frameworks such as PRISM [sato-kameya-prism], Stochastic Logic Programs (SLP) [Muggleton], Independent Choice Logic (ICL) [PooleICL], and ProbLog [deRaedt] is primarily driven by query evaluation over logic programs. In SLP, clauses of a logic program are annotated with probabilities, which are then used to associate probabilities with proofs (derivations in a logic program). ICL [Poole] consists of definite clauses and disjoint declarations of the form that specifies a probability distribution over the hypotheses (i.e., ). Any probabilistic knowledge representable in a discrete Bayesian network can be represented in this framework. While the language model itself is restricted (e.g., ICL permits only acyclic clauses), it had declarative distribution semantics. This semantic foundation was later used in other frameworks such as PRISM and ProbLog. CP-Logic [CPlogic] is a logical language to represent probabilistic causal laws, and its semantics is equivalent to probability distribution over well-founded models of certain logic programs. Specifications in LPAD [lpad] resemble those in CP-Logic: probabilistic predicates are specified with disjunctive clauses, i.e. clauses with multiple disjunctive consequents, with a distribution defined over the consequents. LPAD has a distribution semantics, and a proof-based operational semantics similar to that of PRISM. ProbLog specifications annotate facts in a logic program with probabilities. In contrast to SLP, ProbLog has a distribution semantics and a proof-based operational semantics. PRISM (discussed in detail in the next section), LPAD and ProbLog are equally expressive. PRISM uses explicit random variables and a simple inference but restricted procedure. In particular, PRISM demands that the set of proofs for an answer are pairwise mutually exclusive, and that the set of random variables used in a single proof are pairwise independent. The inference procedures of LPAD and ProbLog lift these restrictions.

SRL frameworks that are based primarily on statistical inference, such as BLP, PRM and MLN, were originally defined over discrete-valued random variables, and have been naturally extended to support a combination of discrete and continuous variables. Continuous BLP [hblp] and Hybrid PRM [hprm] extend their base models by using Hybrid Bayesian Networks [hbn]. Hybrid MLN [hmln] allows description of continuous properties and attributes (e.g., the formula with weight ) deriving MRFs with continuous-valued nodes (e.g., length(a) for a grounding of , with mean

and standard deviation

).

In contrast to BLP, PRM and MLN, SRL frameworks that are primarily based on logical inference offer limited support for continuous variables. In fact, among such frameworks, only ProbLog has been recently extended with continuous variables. Hybrid ProbLog  [hproblog] extends Problog by adding a set of continuous probabilistic facts (e.g., , where is a variable appearing in atom , and denotes its Gaussian density function). It adds three predicates namely below, above, ininterval to the background knowledge to process values of continuous facts. A ProbLog program may use a continuous random variable, but further processing can be based only on testing whether or not the variable’s value lies in a given interval. As a consequence, statistical models such as Finite Mixture Models can be encoded in Hybrid ProbLog, but others such as certain classes of Hybrid Bayesian Networks (with continuous child with continuous parents) and Kalman Filters cannot be encoded. The extension to PRISM described in this paper makes the framework general enough to encode such statistical models.

More recently, [apprProblog] introduced a sampling based approach for (approximate) probabilistic inference in a ProbLog-like language that combines continuous and discrete random variables. The inference algorithm uses forward chaining and rejection sampling. The language permits a large class of models where discrete and continuous variables may be combined without restriction. In contrast, we propose an exact inference algorithm with a more restrictive language, but ensure that inference matches the complexity of specialized inference algorithms for important classes of statistical models (e.g., Kalman filters).

3 Background: an overview of PRISM

PRISM programs have Prolog-like syntax (see Fig. 1). In a PRISM program the msw relation (“multi-valued switch”) has a special meaning: msw(X,I,V) says that V is the outcome of the I-th instance from a family X of random processes111Following PRISM, we often omit the instance number in an msw when a program uses only one instance from a family of random processes.. The set of variables are i.i.d. for a given random process . The distribution parameters of the random variables are specified separately.

The program in Fig. 1 encodes a Hidden Markov Model (HMM) in PRISM.

Figure 1: PRISM program for an HMM

The set of observations is encoded as facts of predicate obs, where obs(I,V) means that value V was observed at time I. In the figure, the clause defining hmm says that T is the N-th state if we traverse the HMM starting at an initial state S (itself the outcome of the random process init). In hmm_part(I, N, S, T), S is the I-th state, T is the N-th state. The first clause of hmm_part defines the conditions under which we go from the I-th state S to the I+1-th state NextS. Random processes trans(S) and emit(S) give the distributions of transitions and emissions, respectively, from state S.

The meaning of a PRISM program is given in terms of a distribution semantics [sato-kameya-prism, sato]. A PRISM program is treated as a non-probabilistic logic program over a set of probabilistic facts, the msw relation. An instance of the msw relation defines one choice of values of all random variables. A PRISM program is associated with a set of least models, one for each msw relation instance. A probability distribution is then defined over the set of models, based on the probability distribution of the msw relation instances. This distribution is the semantics of a PRISM program. Note that the distribution semantics is declarative. For a subclass of programs, PRISM has an efficient procedure for computing this semantics based on OLDT resolution [OLDT].

Inference in PRISM proceeds as follows. When the goal selected at a step is of the form msw(X,I,Y), then Y is bound to a possible outcome of a random process X. Thus in PRISM, derivations are constructed by enumerating the possible outcomes of each random variable. The derivation step is associated with the probability of this outcome. If all random processes encountered in a derivation are independent, then the probability of the derivation is the product of probabilities of each step in the derivation. If a set of derivations are pairwise mutually exclusive, the probability of the set is the sum of probabilities of each derivation in the set. PRISM’s evaluation procedure is defined only when the independence and exclusiveness assumptions hold. Finally, the probability of an answer is the probability of the set of derivations of that answer.

4 Extended PRISM

Support for continuous variables is added by modifying PRISM’s language in two ways. We use the msw relation to sample from discrete as well as continuous distributions. In PRISM, a special relation called values is used to specify the ranges of values of random variables; the probability mass functions are specified using set_sw directives. In our extension, we extend the set_sw

directives to specify probability density functions as well. For instance,

set_sw(r, norm(Mu,Var)) specifies that outcomes of random processes r

have Gaussian distribution with mean

Mu

and variance

Var222The technical development in this paper considers only univariate Gaussian variables; see Discussions section on a discussion on how multivariate Gaussian as well as other continuous distributions are handled.. Parameterized families of random processes may be specified, as long as the parameters are discrete-valued. For instance, set_sw(w(M), norm(Mu,Var)) specifies a family of random processes, with one for each value of M. As in PRISM, set_sw directives may be specified programmatically; for instance, in the specification of w(M), the distribution parameters may be computed as functions of M.

Additionally, we extend PRISM programs with linear equality constraints over reals. Without loss of generality, we assume that constraints are written as linear equalities of the form where and are all floating-point constants. The use of constraints enables us to encode Hybrid Bayesian Networks and Kalman Filters as extended PRISM programs. In the following, we use Constr to denote a set (conjunction) of linear equality constraints. We also denote by

a vector of variables and/or values, explicitly specifying the size only when it is not clear from the context. This permits us to write linear equality constraints compactly (e.g.,

).

Encoding of Kalman Filter specifications uses linear constraints and closely follows the structure of the HMM specification, and is shown in Section 6.

Distribution Semantics:

We extend PRISM’s distribution semantics for continuous random variables as follows. The idea is to construct a probability space for the msw definitions (called probabilistic facts in PRISM) and then extend it to a probability space for the entire program using least model semantics. Sample space for the probabilistic facts is constructed from those of discrete and continuous random variables. The sample space of a continuous random variable is the set of real numbers, . The sample space of a set of random variables is a Cartesian product of the sample spaces of individual variables. We complete the definition of a probability space for continuous random variables by considering the Borel -algebra over , and defining a Lebesgue measure on this set as the probability measure. Lifting the probability space to cover the entire program needs one significant step. We use the least model semantics of constraint logic programs [JaffarCLP] as the basis for defining the semantics of extended PRISM programs. A point in the sample space is an arbitrary interpretation of the program, with its Herbrand universe and as the domain of interpretation. For each sample, we distinguish between the interpretation of user-defined predicates and probabilistic facts. Note that only the probabilistic facts have probabilistic behavior in PRISM; the rest of a model is defined in terms of logical consequence. Hence, we can define a probability measure over a set of sample points by using the measure defined for the probabilistic facts alone. The semantics of an extended PRISM program is thus defined as a distribution over its possible models.

5 Inference

Recall that PRISM’s inference explicitly enumerates outcomes of random variables in derivations. The key to inference in the presence of continuous random variables is avoiding enumeration by representing the derivations and their attributes symbolically. A single step in the construction of a symbolic derivation is defined below.

Definition 1 (Symbolic Derivation)

A goal directly derives goal , denoted , if:

PCR:

, and there exists a clause in the program,
, such that ; then, ;

MSW:

: then ;

CONS:

and Constr is satisfiable: then .

A symbolic derivation of is a sequence of goals such that and, for all , .

We only consider successful derivations, i.e., the last step of a derivation resolves to an empty clause. Note that the traditional notion of derivation in a logic program coincides with that of symbolic derivation when the selected subgoal (literal) is not an msw or a constraint. When the selected subgoal is an msw, PRISM’s inference will construct the next step by enumerating the values of the random variable. In contrast, symbolic derivation skips msw’s and constraints and continues with the remaining subgoals in a goal. The effect of these constructs is computed by associating (a) variable type information and (b) a success function (defined below) with each goal in the derivation. The symbolic derivation for the goal widget(X) over the program in Example 1 is shown in Fig. 1(b).

(a) Mixture model program
(b) Symbolic derivation for goal widget(X)
Figure 2: Finite Mixture Model Program and Symbolic Derivation
Example 1

Consider a factory with two machines a and b. Each machine produces a widget structure and then the structure is painted with a color. In the program Fig. 1(a), msw(m, M) chooses either machine a or b, msw(st(M), Z) gives the cost Z of a product structure, msw(pt, Y) gives the cost Y of painting, and finally X = Y + Z returns the price of a painted widget X.    

(a) Example program
(b) Symbolic derivation for goal q(Y)
Figure 3: Symbolic derivation
Example 2

This example illustrates how symbolic derivation differs from traditional logic programming derivation. Fig. 2(b) shows the symbolic derivation for goal in Fig. 2(a).

Notice that the symbolic derivation still makes branches in the derivation tree for various logic definitions and outcomes. But the main difference with traditional logic derivation is that it skips msw and Constr definitions, and continues with the remaining subgoals in a goal.    

Success Functions:

Goals in a symbolic derivation may contain variables whose values are determined by msw’s appearing subsequently in the derivation. With each goal in a symbolic derivation, we associate a set of variables, , that is a subset of variables in . The set is such that the variables in subsequently appear as parameters or outcomes of msw’s in some subsequent goal , . We can further partition into two disjoint sets, and , representing continuous and discrete variables, respectively. The sets and are called the derivation variables of , defined below.

Definition 2 (Derivation Variables)

Let such that is derived from using:

PCR:

Let be the mgu in this step. Then and are the largest sets of variables in such that and .

MSW:

Let . Then and are the largest sets of variables in such that , and if is continuous, otherwise , and .

CONS:

Let . Then and are the largest sets of variables in such that , and .

Given a goal in a symbolic derivation, we can associate with it a success function, which is a function from the set of all valuations of to . Intuitively, the success function represents the probability that the symbolic derivation represents a successful derivation for each valuation of .

Representation of success functions:

Given a set of variables , let denote the set of all linear equality constraints over reals using . Let be the set of all linear functions over with real coefficients. Let be the PDF of a univariate Gaussian distribution with mean and variance , and be the Dirac delta function which is zero everywhere except at and integration of the delta function over its entire range is 1. Expressions of the form , where is a non-negative real number and , are called product PDF (PPDF) functions over . We use (possibly subscripted) to denote such functions. A pair where is called a constrained PPDF function. A sum of a finite number of constrained PPDF functions is called a success function, represented as .

We use to denote the constraints (i.e., ) in the constrained PPDF function of success function ; and to denote the PPDF function of .

Success functions of base predicates:

The success function of a constraint is . The success function of true is . The PPDF component of ’s success function is the probability density function of rv’s distribution if rv is continuous, and its probability mass function if rv is discrete; its constraint component is true.

Figure 4: Success Functions
Example 3

The success function of msw(m,M) for the program in Example 1 is such that . Note that we can represent the success function using tables, where each table row denotes discrete random variable valuations. For example, the above success function can be represented as Fig. 4a. Thus instead of using delta functions, we often omit it in examples and represent success functions using tables.

Fig. 4b represents the success function of msw(st(M), Z) for the program in Example 1. Similarly, the success function of msw(pt, Y) for the program in Example 1 is .

Finally, the success function of X = Y + Z for the program in Example 1 is .    

Success functions of user-defined predicates:

If is a step in a derivation, then the success function of is computed bottom-up based on the success function of . This computation is done using join and marginalize operations on success functions.

Definition 3 (Join)

Let and be two success functions, then join of and represented as is the success function .

Example 4

Let Fig. 4(a) and 4(b) represent the success functions and respectively.

(a)
(b)
(c)
Figure 5: Join of Success Functions

Then Fig. 4(c) shows the join of and .    

Note that we perform a simplification of success functions after the join operation. We eliminate any PPDF term in which is inconsistent w.r.t. delta functions. For example, as can not be both and at the same time.

Given a success function for a goal , the success function for is computed by the marginalization operation. Marginalization w.r.t. a discrete variable is straightforward and omitted. Below we define marginalization w.r.t. continuous variables in two steps: first rewriting the success function in a projected form and then doing the required integration.

The goal of projection is to eliminate any linear constraint on , where is the continuous variable to marginalize over. The projection operation involves finding a linear constraint (i.e., ) on and replacing all occurrences of in the success function by .

Definition 4 (Projection)

Projection of a success function w.r.t. a continuous variable , denoted by , is a success function such that
; and ,
where is a linear constraint () on in and denotes replacement of all occurrences of in by .

Note that the replacement of by in PDFs and linear constraints does not alter the general form of a success function. Thus projection returns a success function. Notice that if does not contain any linear constraint on , then the projected form remains the same.

Example 5

Let represent a success function. Then projection of w.r.t. yields

(1)

Notice that is replaced by .    

Proposition 1

Integration of a PPDF function with respect to a variable is a PPDF function, i.e.,

where and .

For example,

(2)

Here are linear combinations of variables (except ). A proof of the proposition is presented in Section 8.

Definition 5 (Integration)

Let be a success function that does not contain any linear constraints on . Then integration of with respect to , denoted by is a success function such that .

It is easy to see (using Proposition 1) that the integral of success functions are also success functions. Note that if does not contain any PDF on , then the integrated form remains the same.

Example 6

Let represent a success function. Then integration of w.r.t. yields

Definition 6 (Marginalize)

Marginalization of a success function with respect to a variable , denoted by , is a success function such that

We overload to denote marginalization over a set of variables, defined such that and .

Proposition 2

The set of all success functions is closed under join and marginalize operations.

The success function for a derivation is defined as follows.

Definition 7 (Success function of a goal)

The success function of a goal , denoted by , is computed based on the derivation :

Note that the above definition carries PRISM’s assumption that an instance of a random variable occurs at most once in any derivation. In particular, the PCR step marginalizes success functions w.r.t. a set of variables; the valuations of the set of variables must be mutually exclusive for correctness of this step. The MSW step joins success functions; the goals joined must use independent random variables for the join operation to correctly compute success functions in this step.

Example 7

Fig. 1(b) shows the symbolic derivation for the goal widget(X) over the mixture model program in Example 1. The success function of goal is .

The success function of goal is (Fig. 4(b)).

Then join of and yields the success function in Fig. 4(c) (see Example 4).

Finally, .

First we marginalize w.r.t. :

Next we marginalize the above success function w.r.t. :

Finally, we marginalize the above function over variable to get :

Example 8

In this example, we compute success function of goal in Example 2. Fig. 2(b) shows the symbolic derivation for goal . Success function of is , and success function of is . Similarly, success function of is . Now

Success function of is . Join of and yields . Finally, .

When , only is true. Thus . On the other hand, as both and are true when Y=2. Similarly, .    

Complexity:

Let denote the number of constrained PPDF terms in ; denote the maximum number of product terms in any PPDF function in ; and denote the maximum size of a constraint set () in . The time complexity of the two basic operations used in constructing a symbolic derivation is as follows.

Proposition 3 (Time Complexity)

The worst-case time complexity of is .

The worst-case time complexity of is when is discrete and when is continuous.

Note that when computing the success function of a goal in a derivation, the join operation is limited to joining the success function of a single msw or a single constraint set to the success function of a goal, and hence the parameters , and are typically small. The complexity of the size of success functions is as follows.

Proposition 4 (Success Function Size)

For a goal and its symbolic derivation, the following hold:

  1. The maximum number of product terms in any PPDF function in is linear in , the number of continuous variables in .

  2. The maximum size of a constraint set in a constrained PPDF function in is linear in .

  3. The maximum number of constrained PPDF functions in any entry of is potentially exponential in the number of discrete random variables in the symbolic derivation.

The number of product terms and the size of constraint sets are hence independent of the length of the symbolic derivation. Note that for a program with only discrete random variables, there may be exponentially fewer symbolic derivations than concrete derivations. The compactness is only in terms of number of derivations and not the total size of the representations. In fact, for programs with only discrete random variables, there is a one-to-one correspondence between the entries in the tabular representation of success functions and PRISM’s answer tables. For such programs, it is easy to show that the time complexity of the inference algorithm presented in this paper is same as that of PRISM.

Correctness of the Inference Algorithm:

The technically complex aspect of correctness is the closure of the set of success functions w.r.t. join and marginalize operations. Proposition 1 and 2 state these closure properties. Definition 7 represents the inference algorithm for computing the success function of a goal. The distribution of a goal is formally defined in terms of the distribution semantics of extended PRISM programs and is computed using the inference algorithm.

Theorem 5

The success function of a goal computed by the inference algorithm represents the distribution of the answer to that goal.

Proof:

Correctness w.r.t. distribution semantics follows from the definition of join and marginalize operations, and PRISM’s independence and exclusiveness assumptions. We prove this by induction on derivation length . For , the definition of success function for base predicates gives a correct distribution.

Now let’s assume that for a derivation of length , our inference algorithm computes valid distribution. Let’s assume that has a derivation of length and . Thus has a derivation of length . We show that the success function of represents a valid distribution.

We compute using Definition 7 and it carries PRISM’s assumption that an instance of a random variable occurs at most once in any derivation. More specifically, the PCR step marginalizes w.r.t. a set of variables . Since according to PRISM’s exclusiveness assumption the valuations of the set of variables are mutually exclusive, the marginalization operation returns a valid distribution. Analogously, the MSW/CONS step joins success functions, and the goals joined use independent random variables (following PRISM’s assumption) for the join operation to correctly compute in this step. Thus represents a valid distribution.

6 Illustrative Example

kf(N, T) :-
  msw(init, S),
  kf_part(0, N, S, T).

kf_part(I, N, S, T) :-
  I < N, NextI is I+1,
  trans(S, I, NextS),
  emit(NextS, NextI, V),
  obs(NextI, V),
  kf_part(NextI, N, NextS, T).

kf_part(I, N, S, T) :-
  I=N, T=S.

trans(S, I, NextS) :-
  msw(trans_err, I, E),
  NextS = S + E.

emit(NextS, I, V) :-
  msw(obs_err, I, X),
  V = NextS + X.
Figure 6: Logic program for Kalman Filter

In this section, we model Kalman filters [aibook] using logic programs. The model describes a random walk of a single continuous state variable with noisy observation . The initial state distribution is assumed to be Gaussian with mean , and variance . The transition and sensor models are Gaussian noises with zero means and constant variances , respectively.

Fig. 6 shows a logic program for Kalman filter, and Fig. 7 shows the derivation for a query . Note the similarity between this and hmm program (Fig. 1): only trans/emit definitions are different. We label the derivation step by which is used in the next subsection to refer to appropriate derivation step. Here, our goal is to compute filtered distribution of state .

Success Function Computation:

Fig. 7 shows the bottom-up success function computation. Note that is same as except that binds to an observation . Final step involves marginalization w.r.t. ,

(product of two Gaussian PDFs is another PDF)

which is the filtered distribution of state after seeing one observation, which is equal to the filtered distribution presented in [aibook].

Figure 7: Symbolic derivation and success functions for kf(1,T)

7 Discussion and Concluding Remarks

ProbLog and PITA [RiguzziSwift10b], an implementation of LPAD, lift PRISM’s mutual exclusion and independence restrictions by using a BDD-based representation of explanations. The technical development in this paper is based on PRISM and imposes PRISM’s restrictions. However, we can remove these restrictions by using the following approach. In the first step, we materialize the set of symbolic derivations. In the second step, we can factor the derivations into a form analogous to BDDs such that random variables each path of the factored representation are independent, and distinct paths in the representation are mutually exclusive. For instance, consider two non-exclusive branches in a symbolic derivation tree, one of which has msw(r, X) and the other that has msw(s,Y). This will be factored such that one of the two, say msw(r,X’) is done in common, with two branches: and . The branch containing subgoal msw(s,Y) is “and-ed” with the branch, and replicated as the branch, analogous to how BDDs are processed. The factored representation itself can be treated as symbolic derivations augmented with dis-equality constraints (i.e. of the form ). Note that the success function of an equality constraint is . The success function of a dis-equality constraint is , which is representable by extending our language of success functions to permit non-negative constants. The definitions of join and marginalize operations work with no change over the extended success functions, and the closure properties (Prop. 2) holds as well. Hence, success functions can be readily computed over the factored representation. A detailed discussion of this extension appears in [AsifulIslam2012].

Note that the success function of a goal represents the likelihood of a successful derivation for each instance of a goal. Hence the probability measure computed by the success function is what PRISM calls inside probability. Analogously, we can define a function that represents the likelihood that a goal will be encountered in a symbolic derivation starting at goal . This “call” function will represent the outside probability of PRISM. Alternatively, we can use the Magic Sets transformation [magicset] to compute call functions of a program in terms of success functions of a transformed program. The ability to compute inside and outside probabilities can be used to infer smoothed distributions for temporal models.

For simplicity, in this paper we focused only on univariate Gaussians. However, the techniques can be easily extended to support multivariate Gaussian distributions, by extending the integration function (Defn.  5), and set_sw directives. We can also readily extend them to support Gamma distributions. More generally, the PDF functions can be generalized to contain Gaussian or Gamma density functions, such that variables are not shared between Gaussian and Gamma density functions. Again, the only change is to extend the integration function to handle PDFs of Gamma distribution.

The concept of symbolic derivations and success functions can be applied to parameter learning as well. We have developed an EM-based learning algorithm which permits us to learn the distribution parameters of extended PRISM programs with discrete as well as Gaussian random variables [contdist-learning]. Similar to inference, our learning algorithm uses the symbolic derivation procedure to compute Expected Sufficient Statistics (ESS). The E-step of the learning algorithm involves computation of the ESSs of the random variables and the M-step computes the MLE of the distribution parameters given the ESS and success probabilities. Analogous to the inference algorithm presented in this paper, our learning algorithm specializes to PRISM’s learning over programs without any continuous variables. For mixture model, the learning algorithm does the same computation as standard EM learning algorithm [bishop].

The symbolic inference and learning procedures enable us to reason over a large class of statistical models such as hybrid Bayesian networks with discrete child-discrete parent, continuous child-discrete parent (finite mixture model), and continuous child-continuous parent (Kalman filter), which was hitherto not possible in PLP frameworks. It can also be used for hybrid models, e.g., models that mix discrete and Gaussian distributions. For instance, consider the mixture model example where st(a) is Gaussian but st(b) is a discrete distribution with values and with probability each. The density of the mixture distribution can be written as Thus the language can be used to model problems that lie outside traditional hybrid Bayesian networks.

We implemented the extended inference algorithm presented in this paper in the XSB logic programming system [XSB]. The system is available at http://www.cs.sunysb.edu/~cram/contdist. This proof-of-concept prototype is implemented as a meta-interpreter and currently supports discrete and Gaussian distributions. The meaning of various probabilistic predicates (e.g., msw, values, set_sw) in the system are similar to that of PRISM system. This implementation illustrates how the inference algorithm specializes to the specialized techniques that have been developed for several popular statistical models such as HMM, FMM, Hybrid Bayesian Networks and Kalman Filters. Integration of the inference algorithm in XSB and its performance evaluation are topics of future work.

Acknowledgments.

We thank the reviewers for valuable comments. This research was supported in part by NSF Grants CCF-1018459, CCF-0831298, and ONR Grant N00014-07-1-0928.

8 Appendix

This section presents proof of Proposition 1.

Property 6

Integrated form of a PPDF function with respect to a variable is a PPDF function, i.e.,

where and .

(Proof)
The above proposition states that integrated form of a product of Gaussian PDF functions with respect to a variable is a product of Gaussian PDF functions. We first prove it for a simple case involving two standard Gaussian PDF functions, and then generalize it for arbitrary number of Gaussians.

For simplicity, let us first compute the integrated-form of w.r.t. variable where are linear combination of variables (except ). We make the following two assumptions:
1. The coefficient of is in both PDFs.

2. Both PDFs are standard normal distributions (i.e.,

and .

Let denote the integrated form, i.e.,