Probabilistic Logic Programming (PLP) is a class of Statistical Relational Learning (SRL) frameworks [srlbook] which combine statistical and logical knowledge representation and inference. PLP languages, such as SLP [Muggleton], ICL [PooleICL], PRISM [sato-kameya-prism], ProbLog [deRaedt] and LPAD [lpad]
extend traditional logic programming languages by implicitly or explicitly attaching random variables with certain clauses in a logic program. A large class of common statistical models, such as Bayesian networks, Hidden Markov models and Probabilistic Context-Free Grammars have been effectively encoded in PLP; the programming aspect of PLP has also been exploited to succinctly specify complex models, such as discovering links in biological networks[deRaedt]. Parameter learning in these languages is typically done by variants of the EM algorithm [DempsterEM].
Operationally, combined statistical/logical inference in PLP is based on proof structures similar to those created by pure logical inference. As a result, these languages have limited support models with continuous random variables. Recently, we extended PRISM [sato-kameya-prism]
with Gaussian and Gamma-distributed random variables, and linear equality constraints (http://arxiv.org/abs/1112.2681)111Relevant technical aspects of this extension are summarized in this paper to make it self-contained.. This extension permits encoding of complex statistical models including Kalman filters and a large class of Hybrid Bayesian Networks.
In this paper, we present an algorithm for parameter learning in PRISM extended with Gaussian random variables. The key aspect of this algorithm is the construction of symbolic derivations
that succinctly represent large (sometimes infinite) sets of traditional logical derivations. Our learning algorithm represents and computes Expected Sufficient Statistics (ESS) symbolically as well, for Gaussian as well as discrete random variables. Although our technical development is limited to PRISM, the core algorithm can be adapted to parameter learning in (extended versions of) other PLP languages as well.
SRL frameworks can be broadly classified as statistical-model-based or logic-based, depending on how their semantics is defined. In the first category are languages such as Bayesian Logic Programs (BLPs)[hblp], Probabilistic Relational Models (PRMs) [hprm], and Markov Logic Networks (MLNs) [mln], where logical relations are used to specify a model compactly. Although originally defined over discrete random variables, these languages have been extended (e.g. Continuous BLP [hblp], Hybrid PRM [hprm], and Hybrid MLN [hmln]) to support continuous random variables as well. Techniques for parameter learning in statistical-model-based languages are adapted from the corresponding techniques in the underlying statistical models. For example, discriminative learning techniques are used for parameter learning in MLNs [mlnlearningA, mlnlearningB].
Logic-based SRL languages include the PLP languages mentioned earlier. Hybrid ProbLog [hproblog] extends ProbLog by adding continuous probabilistic facts, but restricts their use such that statistical models such as Kalman filters and certain classes of Hybrid Bayesian Networks (with continuous child with continuous parents) cannot be encoded. More recently, apprProblog (apprProblog) introduced a sampling-based approach for (approximate) probabilistic inference in a ProbLog-like language.
Graphical EM [sato] is the parameter learning algorithm used in PRISM. Interestingly, graphical EM reduces to the Baum-Welch [rabiner] algorithm for HMMs encoded in PRISM. probloglearningA (probloglearningA) introduced a least squares optimization approach to learn distribution parameters in ProbLog. CoPrEM [probloglearningB]
is another algorithm for ProLog that computes binary decision diagrams (BDDs) for representing proofs and uses a dynamic programming approach to estimate parameters. BO-EM[satoBDD] is a BDD-based parameter learning algorithm for PRISM. These techniques enumerate derivations (even when represented as BDDs), and do not readily generalize when continuous random variables are introduced.
2 Background: An Overview of PRISM
PRISM programs have Prolog-like syntax (see Example 1). In a PRISM program the msw relation (“multi-valued switch”) has a special meaning: msw(X,I,V) says that V is a random variable. More precisely, V is the outcome of the I-th instance from a family X of random processes222Following PRISM, we often omit the instance number in an msw when a program uses only one instance from a family of random processes.. The set of variables are i.i.d., and their distribution is given by the random process . The msw relation provides the mechanism for using random variables, thereby allowing us to weave together statistical and logical aspects of a model into a single program. The distribution parameters of the random variables are specified separately.
PRISM programs have declarative semantics, called distribution semantics [sato-kameya-prism, sato]. Operationally, query evaluation in PRISM closely follows that for traditional logic programming, with one modification. When the goal selected at a step is of the form msw(X,I,Y), then Y is bound to a possible outcome of a random process X
. The derivation step is associated with the probability of this outcome. If all random processes encountered in a derivation are independent, then the probability of the derivation is the product of probabilities of each step in the derivation. If a set of derivations are pairwise mutually exclusive, the probability of the set is the sum of probabilities of each derivation in the set333The evaluation procedure is defined only when the independence and exclusiveness assumptions hold.. Finally, the probability of an answer to a query is computed as the probability of the set of derivations corresponding to that answer.
As an illustration, consider the query fmix(X) evaluated over program in Example 1. One step of resolution derives goal of the form msw(m, M), msw(w(M),X). Now depending on the value of m, there are two possible next steps: msw(w(a),X) and msw(w(b),X). Thus in PRISM, derivations are constructed by enumerating the possible outcomes of each random variable.
Example 1 (Finite Mixture Model)
In the following PRISM program, which encodes a finite mixture model [fmm], msw(m, M) chooses one distribution from a finite set of continuous distributions, msw(w(M), X) samples X from the chosen distribution.
fmix(X) :- msw(m, M), msw(w(M), X). % Ranges of RVs values(m, [a,b]). values(w(M), real). % PDFs and PMFs :- set_sw(m, [0.3, 0.7]), set_sw(w(a), norm(2.0, 1.0)), set_sw(w(b), norm(3.0, 1.0)).
3 Extended PRISM
Support for continuous variables is added by modifying PRISM’s language in two ways. We use the msw relation to sample from discrete as well as continuous distributions. In PRISM, a special relation called values is used to specify the ranges of values of random variables; the probability mass functions are specified using set_sw directives. We extend the set_sw
directives to specify probability density functions as well. For instance,set_sw(r, norm(Mu,Var)) specifies that outcomes of random processes r
have Gaussian distribution with meanMu
and varianceVar. Parameterized families of random processes may be specified, as long as the parameters are discrete-valued. For instance, set_sw(w(M), norm(Mu,Var)) specifies a family of random processes, with one for each value of M. As in PRISM, set_sw directives may be specified programmatically; for instance, the distribution parameters of w(M), may be computed as functions of M.
Additionally, we extend PRISM programs with linear equality constraints over reals. Without loss of generality, we assume that constraints are written as linear equalities of the form where and are all floating-point constants. The use of constraints enables us to encode Hybrid Bayesian Networks and Kalman Filters as extended PRISM programs. In the following, we use Constr to denote a set (conjunction) of linear equality constraints. We also denote by
a vector of variables and/or values, explicitly specifying the size only when it is not clear from the context. This permits us to write linear equality constraints compactly (e.g.,).
The key to inference in the presence of continuous random variables is avoiding enumeration by representing the derivations and their attributes symbolically. A single step in the construction of a symbolic derivation is defined below.
Definition 1 (Symbolic Derivation)
A goal directly derives goal , denoted , if:
, and there exists a clause in the program, , such that ; then, ;
: then ;
and Constr is satisfiable: then .
A symbolic derivation of is a sequence of goals such that and, for all , .
Note that the traditional notion of derivation in a logic program coincides with that of symbolic derivation when the selected subgoal (literal) is not an msw or a constraint. When the selected subgoal is an msw, PRISM’s inference will construct the next step by enumerating the values of the random variable. In contrast, symbolic derivation skips msw’s and constraints and continues with the remaining subgoals in a goal. The effect of these constructs is computed by associating (a) variable type information and (b) a success function (defined below) with each goal in the derivation. The symbolic derivation for the goal fmix(X) over the program in Example 1 is shown in Fig. 1.
Goals in a symbolic derivation may contain variables whose values are determined by msw’s appearing subsequently in the derivation. With each goal in a symbolic derivation, we associate a set of variables, , that is a subset of variables in . The set is such that the variables in subsequently appear as parameters or outcomes of msw’s in some subsequent goal , . We can further partition into two disjoint sets, and , representing continuous and discrete variables, respectively.
Given a goal in a symbolic derivation, we can associate with it a success function, which is a function from the set of all valuations of to . Intuitively, the success function represents the probability that the symbolic derivation represents a successful derivation for each valuation of . Note that the success function computation uses a set of distribution parameters . For simplicity, we often omit it in the equations and use it when it’s not clear from the context.
Representation of success functions:
Given a set of variables , let denote the set of all linear equality constraints over reals using . Let be the set of all linear functions over with real coefficients. Let be the PDF of a univariate Gaussian distribution with mean and variance , and be the Dirac delta function which is zero everywhere except at and integration of the delta function over its entire range is 1. Expressions of the form , where is a non-negative real number and , are called product PDF (PPDF) functions over . We use (possibly subscripted) to denote such functions. A pair where is called a constrained PPDF function. A sum of a finite number of constrained PPDF functions is called a success function, represented as .
We use to denote the constraints (i.e., ) in the constrained PPDF function of success function ; and to denote the PPDF function of .
Success functions of base predicates:
The success function of a constraint is . The success function of true is . The PPDF component of ’s success function is the probability density function of rv’s distribution if rv is continuous, and its probability mass function if rv is discrete; its constraint component is true.
Success functions of user-defined predicates:
If is a step in a derivation, then the success function of is computed bottom-up based on the success function of . This computation is done using join and marginalize operations on success functions.
Definition 2 (Join)
Let and be two success functions, then join of and represented as is the success function .
Given a success function for a goal , the success function for is computed by the marginalization operation. Marginalization w.r.t. a discrete variable is straightforward and omitted. Below we define marginalization w.r.t. continuous variables in two steps: first rewriting the success function in a projected form and then doing the required integration.
Projection eliminates any linear constraint on , where is the continuous variable to marginalize over. The projection operation, denoted by , involves finding a linear constraint (i.e., ) on and replacing all occurrences of in the success function by .
Integration of a PPDF function with respect to a variable is a PPDF function, i.e.,
where and .
Definition 3 (Integration)
Let be a success function that does not contain any linear constraints on . Then integration of with respect to , denoted by is a success function such that .
Definition 4 (Marginalize)
Marginalization of a success function with respect to a variable , denoted by , is a success function such that
We overload to denote marginalization over a set of variables, defined such that and .
The success function for a derivation is defined as follows.
Definition 5 (Success function of a derivation)
Let . Then the success function of , denoted by , is computed from that of , based on the way was derived:
Let . Then .
Let . Then
Note that the above definition carries PRISM’s assumption that an instance of a random variable occurs at most once in any derivation. In particular, the PCR step marginalizes success functions w.r.t. a set of variables; the valuations of the set of variables must be mutually exclusive for correctness of this step. The MSW step joins success functions; the goals joined must use independent random variables for the join operation to correctly compute success functions in this step.
is which yields . Note that as can not be both and at the same time. Also .
Finally, which is . Note that represents the mixture distribution [fmm] of mixture of two Gaussian distributions.
Here , and .
Note that for a program with only discrete random variables, there may be exponentially fewer symbolic derivations than concrete derivations a la PRISM. The compactness is only in terms of number of derivations and not the total size of the representations. In fact, for programs with only discrete random variables, there is a one-to-one correspondence between the entries in the tabular representation of success functions and PRISM’s answer tables. For such programs, it is easy to show that the time complexity of the inference presented in this paper is same as that of PRISM.
We use the expectation-maximization algorithm[DempsterEM] to learn the distribution parameters from data. First we show how to compute the expected sufficient statistics (ESS) of the random variables and then describe our algorithm.
The ESS of a discrete random variable is a n-tuple where is the number of values that the discrete variable takes. Suppose that a discrete random variable takes as values. Then the ESS of is where is the expected number of times variable had valuation in all possible proofs for a goal. The ESS of a Gaussian random variable is a triple where the components denote the expected sum, expected sum of squares and the expected number of uses of random variable , respectively, in all possible proofs of a goal. When derivations are enumerated, the ESS for each random variable can be represented by a tuple of reals. To accommodate symbolic derivations, we lift each component of ESS to a function, represented as described below.
Representation of ESS functions:
For each component (discrete variable valuation, mean, variance, total counts) of a random variable, its ESS function in a goal is represented as follows:
where is a constrained PPDF function and
Here are constants, and .
Note that the representation of ESS function is same as that of success function for discrete random variable valuations and total counts. Join and Marginalize operations, defined earlier for success functions, can be readily defined for ESS functions as well. The computation of ESS functions for a goal, based on the symbolic derivation, uses the extended join and marginalize operations. The set of all ESS functions is closed under the extended Join and Marginalize operations.
ESS functions of base predicates:
The ESS function of the parameter of a discrete random variable is . The ESS function of the mean of a continuous random variable is , and the ESS function of the variance of a continuous random variable is . Finally, the ESS function of the total count of a continuous random variable is .
In this example, we compute the ESS functions of the random variables (m, w(a), and w(b)) in Example 1. According to the definition of ESS function of base predicates, the ESS functions of these random variables for goals and are
ESS functions of user-defined predicates:
If is a step in a derivation, then the ESS function of a random variable for is computed bottom-up based on the its ESS function for .
The ESS function of a random variable component in a derivation is defined as follows.
Definition 6 (ESS functions in a derivation)
Let . Then the ESS function of a random variable component in the goal , denoted by , is computed from that of , based on the way was derived:
Let . Then .
Let . Then
Using the definition of ESS function of a derivation involving MSW, we compute the ESS function of the random variables in goal of Fig. 1.
|ESS functions for goal|
Notice the way is computed.
Finally, for goal we marginalize the ESS functions w.r.t. .
|ESS functions for goal|
The algorithm for learning distribution parameters () uses a fixed set of training examples (). Note that the success and ESS functions for ’s are constants as the training examples are variable free (i.e., all the variables get marginalized over).
Algorithm 1 (Expectation-Maximization)
Initialize the distribution parameters .
Construct the symbolic derivations for and using current .
E-step: For each training example (), compute the ESS () of the random variables, and success probabilities w.r.t. .
M-step: Compute the MLE of the distribution parameters given the ESS and success probabilities (i.e., evaluate ). contains updated distribution parameters (). More specifically, for a discrete random variable , its parameters are updated as follows:
For each continuous random variable , its mean and variances are updated as follows:
where is the expected total count of .
Evaluate the log likelihood () and check for convergence. Otherwise let and return to step 1.
Algorithm 1 correctly computes the MLE which (locally) maximizes the likelihood.
(Proof) Sketch. The main routine of Algorithm 1 for discrete case is same as the learn-naive algorithm of sato (sato), except the computation of .
where is an explanation for goal and is the total number of times in .
We show that .
Let the goal has a single explanation where is a conjunction of subgoals (i.e., ). Thus we need to show that .
We prove this by induction on the length of . The definition of for base predicates gives the desired result for . Let the above equation holds for length i.e., . For ,
The last step follows from the definition of in a derivation.
Now based on the exclusiveness assumption, for disjunction (or multiple explanations) like it trivially follows that .
Let be the observations. For a given training example , the ESS functions are
|ESS functions for goal|
The E-step of the EM algorithm involves computation of the above ESS functions.
In the M-step, we update the model parameters from the computed ESS functions.
This example illustrates that for the mixture model example, our ESS computation does the same computation as standard EM learning algorithm for mixture models [bishop].
6 Discussion and Concluding Remarks
The symbolic inference and learning procedures enable us to reason over a large class of statistical models such as hybrid Bayesian networks with discrete child-discrete parent, continuous child-discrete parent (finite mixture model), and continuous child-continuous parent (Kalman filter), which was hitherto not possible in PLP frameworks. It can also be used for hybrid models, e.g., models that mix discrete and Gaussian distributions. For instance, consider the mixture model example (Example 1) where w(a) is Gaussian but w(b) is a discrete distribution with values and with probability each. The density of the mixture distribution can be written as
Thus the language can be used to model problems that lie outside traditional hybrid Bayesian networks.
ProbLog and LPAD do not impose PRISM’s mutual exclusion and independence restrictions. Their inference technique first materializes the set of explanations for each query, and represents this set as a BDD, where each node in the BDD is a (discrete) random variable. Distinct paths in the BDD are mutually exclusive and variables in a single path are all independent. Probabilities of query answers are computed trivially based on this BDD representation. The technical development in this paper is limited to PRISM and imposes its restrictions. However, by materializing the set of symbolic derivations first, representing them in a factored form (such as a BDD) and then computing success functions on this representation, we can readily lift the restrictions for the parameter learning technique.
This paper considered only univariate Gaussian distributions. Traditional parameter learning techniques have been described for multivariate distributions without introducing additional machinery. Extending our learning algorithm to the multivariate case is a topic of future work.