# Discrete-Continuous Mixtures in Probabilistic Programming: Generalized Semantics and Inference Algorithms

Despite the recent successes of probabilistic programming languages (PPLs) in AI applications, PPLs offer only limited support for random variables whose distributions combine discrete and continuous elements. We develop the notion of measure-theoretic Bayesian networks (MTBNs) and use it to provide more general semantics for PPLs with arbitrarily many random variables defined over arbitrary measure spaces. We develop two new general sampling algorithms that are provably correct under the MTBN framework: the lexicographic likelihood weighting (LLW) for general MTBNs and the lexicographic particle filter (LPF), a specialized algorithm for state-space models. We further integrate MTBNs into a widely used PPL system, BLOG, and verify the effectiveness of the new inference algorithms through representative examples.

There are no comments yet.

## Authors

• 32 publications
• 24 publications
• 3 publications
• 3 publications
• 39 publications
• ### On Discrete-Continuous Mixtures in Probabilistic Programming: Generalized Semantics and Inference Algorithms

Despite of the recent successes of probabilistic programming languages (...
06/06/2018 ∙ by Yi Wu, et al. ∙ 0

• ### Measure Transformer Semantics for Bayesian Machine Learning

The Bayesian approach to machine learning amounts to computing posterior...
08/03/2013 ∙ by Johannes Borgström, et al. ∙ 0

• ### Semiring Programming: A Framework for Search, Inference and Learning

To solve hard problems, AI relies on a variety of disciplines such as lo...
09/21/2016 ∙ by Vaishak Belle, et al. ∙ 0

• ### Exploiting Uniform Assignments in First-Order MPE

The MPE (Most Probable Explanation) query plays an important role in pro...
10/16/2012 ∙ by Udi Apsel, et al. ∙ 0

• ### Random Sampling: Practice Makes Imperfect

The pseudo-random number generators (PRNGs), sampling algorithms, and al...
10/25/2018 ∙ by Philip B. Stark, et al. ∙ 0

• ### Probabilistic programming for birth-death models of evolution using an alive particle filter with delayed sampling

We consider probabilistic programming for birth-death models of evolutio...
07/10/2019 ∙ by Jan Kudlicka, et al. ∙ 0

• ### Efficient Inference in Large Discrete Domains

In this paper we examine the problem of inference in Bayesian Networks w...
10/19/2012 ∙ by Rita Sharma, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As originally defined by Pearl (1988)

, Bayesian networks express joint distributions over finite sets of random variables as products of conditional distributions. Probabilistic programming languages (PPLs)

(Koller et al., 1997; Milch et al., 2005a; Goodman et al., 2008; Wood et al., 2014b) apply the same idea to potentially infinite sets of variables with general dependency structures. Thanks to their expressive power, PPLs have been used to solve many real-world applications, including Captcha (Le et al., 2017), seismic monitoring (Arora et al., 2013)

, 3D pose estimation

(Kulkarni et al., 2015), generating design suggestions (Ritchie et al., 2015), concept learning (Lake et al., 2015), and cognitive science applications (Stuhlmüller & Goodman, 2014).

In practical applications, we often have to deal with a mixture of continuous and discrete random variables. Existing PPLs support both discrete and continuous random variables, but not discrete-continuous mixtures, i.e., variables whose distributions combine discrete and continuous elements. Such variables are fairly common in practical applications: sensors that have thresholded limits, e.g. thermometers, weighing scales, speedometers, pressure gauges; or a hybrid sensor that can report a either real value or an error condition. The occurrence of such variables has been noted in many other applications from a wide range of scientific domains

(Kharchenko et al., 2014; Pierson & Yau, 2015; Gao et al., 2017).

Many PPLs have a restricted syntax that forces the expressed random variables to be either discrete or continuous, including WebPPL (Goodman & Stuhlmüller, 2014), Edward (Tran et al., 2016), Figaro (Pfeffer, 2009) and Stan (Carpenter et al., 2016). Even for PPLs whose syntax allows for mixtures of discrete and continuous variables, such as BLOG (Milch et al., 2005a), Church (Goodman, 2013), Venture (Mansinghka et al., 2014) and Anglican (Wood et al., 2014a), the underlying semantics of these PPLs implicitly assumes the random variables are not mixtures. Moreover, the inference algorithms associated with the semantics inherit the same assumption and can produce incorrect results when discrete-continuous mixtures are used.

Consider the following GPA example: a two-variable Bayes net where the nationality follows a binary distribution

 P(\emphNationality=\emphUSA)=P(\emphNationality=\emphIndia)=0.5−0.4em

and the conditional probabilities are discrete-continuous mixtures

 \emphGPA|\emphNationality=\emphUSA ∼ 0.01⋅1{\emphGPA=4}+0.99⋅ % Unif(0,4), \emphGPA|\emphNationality=\emphIndia ∼ 0.01⋅1{\emphGPA=10}+0.99⋅ % Unif(0,10).

This is a typical scenario in practice because many top students have perfect GPAs. Now suppose we observe a student with a GPA of 4.0. Where do they come from? If the student is Indian, the probability of any singleton set where is zero, as this range has a probability density. On the other hand if the student is American, the set has the probability

. Thus, by Bayes theorem,

, which means the student must be from the USA.

However, if we run the default Bayesian inference algorithm for this problem in PPLs, e.g., the standard importance sampling algorithm

(Milch et al., 2005b), a sample that picks India receives a density weight of , whereas one that picks USA receives a discrete-mass weight of . Since the algorithm does not distinguish probability density and mass, it will conclude that the student is very probably from India, which is far from the truth.

We can fix the GPA example by considering a density weight infinitely smaller than a discrete-mass weight (Nitti et al., 2016; Tolpin et al., 2016)

. However, the situation becomes more complicated when involving more than one evidence variable, e.g., GPAs over multiple semesters for students who may study in both countries. Vector-valued variables also cause problems—does a point mass in three dimensions count more or less than a point mass in two dimensions? These practical issues motivate the following two tasks:

• Inherit all the existing properties of PPL semantics and extend it to handle random variables with mixed discrete and continuous distributions;

• Design provably correct inference algorithms for the extended semantics.

In this paper, we carry out all these two tasks and implement the extended semantics as well as the new algorithms in a widely used PPL, Bayesian Logic (BLOG) (Milch et al., 2005a).

### 1.1 Main Contributions

##### Measure-Theoretical Bayesian Nets (MTBNs)

Measure theory can be applied to handle discrete-continuous mixtures or even more abstract measures. In this paper, we define a generalization of Bayesian networks called measure-theoretic Bayesian networks (MTBNs) and prove that every MTBN represents a unique measure on the input space. We then show how MTBNs can provide a more general semantic foundation for PPLs.

More concretely, MTBNs support (1) random variables with infinitely (even uncountably) many parents, (2) random variables valued in arbitrary measure spaces (with as one case) distributed according to any measure (including discrete, continuous and mixed), (3) establishment of conditional independencies implied by an infinite graph, and (4) open-universe semantics in terms of the possible worlds in the vocabulary of the model.

##### Inference Algorithms

We propose a provably correct inference algorithm, lexicographic likelihood weighting (LLW), for general MTBNs with discrete-continuous mixtures. In addition, we propose LPF, a particle-filtering variant of LLW for sequential Monte Carlo (SMC) inference on state-space models.

##### Incorporating MTBNs into an existing PPL

We incorporate MTBNs into BLOG with simple modifications and then define the generalized BLOG language, measure-theoretic BLOG, which formally supports arbitrary distributions, including discrete-continuous mixtures. We prove that every generalized BLOG model corresponds to a unique MTBN. Thus, all the desired theoretical properties of MTBNs can be carried to measure-theoretic BLOG. We also implement the LLW and LPF algorithms in the backend of measure-theoretic BLOG and use three representative examples to show their effectiveness.

### 1.2 Organization

This paper is organized as follows. We first discuss related work in Section 2. In Section 3, we formally define measure-theoretic Bayesian nets and study their theoretical properties. Section 4 describes the LLW and LPF inference algorithms for MTBNs with discrete-continuous mixtures and establishes their correctness. In Section 5, we introduce the measure-theoretic extension of BLOG and study its theoretical foundations for defining probabilistic models. In Section 6, we empirically validate the generalized BLOG system and the new inference algorithms on three representative examples.

## 2 Related Work

The motivating GPA example has been also discussed as a special case under some other PPL systems (Tolpin et al., 2016; Nitti et al., 2016). Tolpin et al. (2016) and Nitti et al. (2016) proposed different solutions specific to this example but did not address the general problems of representation and inference with random variables with mixtures of discrete and continuous distributions. In contrast, we present a general formulation with provably correct inference algorithms.

Our approach builds upon the foundations of the BLOG probabilistic programming language (Milch, 2006). We use a measure theoretic formulation to generalize the syntax and semantics of BLOG to random variables that may have infinitely many parents and mixed continuous and discrete distributions. The BLP framework Kersting & De Raedt (2007)

unifies logic programming with probability models, but requires each random variable to be influenced by a finite set of random variables in order to define the semantics. This amounts to requiring only finitely many ancestors of each random variable.

Choi et al. (2010) present an algorithm for carrying out lifted inference over models with purely continuous random variables. They also require parfactors to be functions over finitely many random variables, thus limiting the set of influencing variables for each node to be finite. Gutmann et al. (2011a) also define densities over finite dimensional vectors. In a relatively more general formulation (Gutmann et al., 2011b) define the distribution of each random variable using a definite clause, which corresponds to the limitation that each random variable (either discrete or continuous) has finitely many parents. Frameworks building on Markov networks also have similar restrictions. Wang & Domingos (2008) only consider networks of finitely many random variables, which can have either discrete or continuous distributions. Singla & Domingos (2007) extend Markov logic to infinite (non-hybrid) domains, provided that each random variable has only finitely many influencing random variables.

In contrast, our approach not only allows models with arbitrarily many random variables with mixed discrete and continuous distributions, but each random variable can also have arbitrarily many parents as long as all ancestor chains are finite (but unbounded). The presented work constitutes a rigorous framework for expressing probability models with the broadest range of cardinalities (uncountably infinite parent sets) and nature of random variables (discrete, mixed, and even arbitrary measure spaces), with clear semantics in terms of first-order possible worlds and the generalization of conditional independences on such models.

Lastly, there are also other works using measure-theoretic approaches to analyze the semantics properties of probabilistic programs but with different emphases, such as the commutativity (Staton, 2017), design choices for monad structures (Ramsey, 2016) and computing a disintegration (Shan & Ramsey, 2017).

## 3 Measure-Theoretic Bayesian Networks

In this section, we introduce measure-theoretic Bayesian networks (MTBNs)

and prove that an MTBN represents a unique measure with desired theoretical properties. We assume familiarity with measure-theoretic approaches to probability theory. Some background is included in Appx.

A.

We begin with some necessary definitions of graph theory.

###### Definition 3.1.

A digraph is a pair of a set of vertices , of any cardinality, and a set of directed edges . The notation denotes , and denotes the existence of a path from to in .

###### Definition 3.2.

A vertex is a root vertex if there are no incoming edges to it, i.e., there is no such that . Let denote the set of parents of a vertex , and denote its set of non-descendants.

###### Definition 3.3.

A well-founded digraph is one with no countably infinite ancestor chain .

This is the natural generalization of a finite directed acyclic graph to the infinite case. Now we are ready to give the key definition of this paper.

###### Definition 3.4.

A measure-theoretic Bayesian network consists of (a) a well-founded digraph of any cardinality, (b) an arbitrary measurable space for each , and (c) a probability kernel from to for each .

By definition, MTBNs allow us to define very general and abstract models with the following two major benefits:

1. We can define random variables with infinitely (even uncountably) many parents because MTBN is defined on a well-founded digraph.

2. We can define random variables in arbitrary measure spaces (with as one case) distributed according to any measure (including discrete, continuous and mixed).

Next, we related MTBN to a probability measure. Fix an MTBN . For let be the product measurable space over variables . With this notation, is a kernel from to . Whenever let denote the projection map. Let be our base measurable space upon which we will consider different probability measures . Let for denote both the underlying set of and the random variable given by the projection , and for the underlying space of and the random variable given by the projection .

###### Definition 3.5.

An MTBN represents a measure on , if for all :

• is conditionally independent of its non-descendants given its parents .

• holds almost surely for any , i.e., is a version of the conditional distribution of given its parents.

Def. 3.5 captures the generalization of the local properties of Bayes networks – conditional independence and conditional distributions defined by parent-child relationships. Here we assume the conditional probability exists and is unique. This is a mild condition because this holds as long as the probability space is regular (Kallenberg, 2002).

The next theorem shows that MTBNs are well-defined.

###### Theorem 3.6.

An MTBN represents a unique measure on .

The proof of theorem 3.6 requires several intermediate results and is presented in Appx. B. The proof proceeds by first defining a projective family of measures. This gives a way to recursively construct our measure . We then define a notion of consistency such that every consistent projective family constructs a measure that represents. Lastly, we give an explicit characterization of the unique consistent projective family, and thus of the unique measure represents.

## 4 Generalized Inference Algorithms

We introduce the lexicographic likelihood weighting (LLW) algorithm for provably correct inference on MTBNs. We also present lexicographic particle filter (LPF) for state-space models by adapting LLW for the sequential Monte Carlo (SMC) framework.

### 4.1 Lexicographic likelihood weighting

Suppose we have an MTBN with finitely many random variables , and that, without loss of generality, we observe real-valued random variables for as evidence. Suppose the distribution of given its parents is a mixture between a density with respect to the Lebesgue measure and a discrete distribution , i.e., for any , we have This implies that is nonzero for at most countably many values . If is nonzero for finitely many points, it can be represented by a list of those points and their values.

Lexicographic Likelihood Weighting (LLW) extends the classical likelihood weighting (Milch et al., 2005b) to this setting. It visits each node of the graph in topological order, sampling those variables that are not observed, and accumulating a weight for those that are observed. In particular, at an evidence variable we update a tuple of the number of densities and a weight, initially , by:

 (d,w)←{(d,wFi(xi|xpa(i)))Fi(xi|xpa(i))>0,(d+1,wfi(xi|xpa(i)))otherwise. (1)

Finally, having samples by this process and accordingly a tuple for each sample , let and estimate by

 ∑{i:d(i)=d∗}w(i)f(x(i))∑{i:d(i)=d∗}w(i). (2)

The algorithm is summarised in Alg. 1 The next theorem shows this procedure is consistent.

###### Theorem 4.1.

LLW is consistent: (2) converges almost surely to .

In order to prove Theorem 4.1, the main technique we adopt is to use a more restricted algorithm, the Iterative Refinement Likelihood Weighting (IRLW) as a reference.

#### 4.1.1 Iterative refinement likelihood weighting

Suppose we want to approximate the posterior distribution of an -valued random variable conditional on a -valued random variable , for arbitrary measure spaces and . In general, there is no notion of a probability density of given for weighing samples. If, however, we could make a discrete approximation of then we could weight samples by the probability . If we increase the accuracy of the approximation with the number of samples, this should converge in the limit. We show this is possible, if we are careful about how we approximate:

###### Definition 4.2.

An approximation scheme for a measurable space consists of a measurable space and measurable approximation functions for and for such that and can be measurably recovered from the subsequence for any .

When is a real-valued variable we will use the approximation scheme where denotes the ceiling of , i.e., the smallest integer no smaller than it. Observe in this case that which we can compute from the CDF of .

###### Lemma 4.3.

If are real-valued random variables with , then .

###### Proof.

Let be the sigma algebra generated by . Whenever we have and so . This means is a martingale, so we can use martingale convergence results. In particular, since

 E[X|Fi]→E[X|F∞]a.s.\ and in L1,

where is the sigma-algebra generated by (see Theorem 7.23 in (Kallenberg, 2002)).

is a measurable function of the sequence , as , and so . By definition the sequence is a measurable function of , and so , and so giving our result. ∎

Iterative refinement likelihood weighting (IRLW) samples from the prior and evaluates:

 ∑Ki=1P(αn(Y)|X=x(i))f(x(i))∑Ki=1P(αn(Y)|X=x(i)) (3)

Using Lemma 4.3, G.12, and G.13, we can show IRLW is consistent.

###### Theorem 4.4.

IRLW is consistent: (3) converges almost surely to .

#### 4.1.2 Proof of Theorem 4.1

Now we are ready to prove Theorem 4.1.

###### Proof of Theorem 4.1.

We prove the theorem for evidence variables that are leaves It is straightforward to extend the proof when the evidence variables are non-leaf nodes. Let be a sample produced by the algorithm with number of densities and weight . With a -cube around we have

 limn→∞P(X1:M∈In|XM+1:N=xM+1:N)w2−dn=1.

Using as an approximation scheme by Def. 4.2, the numerator in the above limit is the weight used by IRLW. But given the above limit, using as the weight will give the same result in the limit. Then if we have samples, in the limit of only those samples with minimal will contribute to the estimation, and up to normalization they will contribute weight to the estimation. ∎

### 4.2 Lexicographic particle filter

We now consider inference in a special class of high-dimensional models known as state-space models, and show how LLW can be adapted to avoid the curse of dimensionality when used with such models. A state-space model (SSM) consists of latent states

and the observations with a special dependency structure where and for .

SMC methods (Doucet et al., 2001), also knowns as particle filters, are a widely used class of methods for inference on SSMs. Given the observed variables , the posterior distribution is approximated by a set of particles where each particle represents a sample of . Particles are propagated forward through the transition model and resampled at each time step according to the weight of each particle, which is defined by the likelihood of observation .

In the MTBN setting, the distribution of 111There can be multiple variables observed. Here the notation denotes for conciseness. given its parent can be a mixture of density and a discrete distribution . Hence, the resampling step in a particle filter should be accordingly modified: following the idea from LLW, when computing the weight of a particle, we enumerate all the observations at time step and again update a tuple , initially (0,1), by

 (d,w)←{(d,wFt(yt,i|xt)) Ft(yt,i|xt)>0,(d+1,wft(yt,i|xt))otherwise. (4)

We discard all those particles with a non-minimum value and then perform the normal resampling step. We call this algorithm lexicographical particle filter (LPF), which is summarized in Alg. 2.

The following theorem guarantees the correctness of LPF. Its Proof easily follows the analysis for LLW and the classical proof of particle filtering based on importance sampling.

###### Theorem 4.5.

LPF is consistent: the outputs of Alg. 2 converges almost surely to .

## 5 Generalized Probabilistic Programming Languages

In Section 3 and Section 4 we provided the theoretical foundation of MTBN and general inference algorithms. This section describes how to incorporate MTBN into a practical PPL. We focus on a widely used open-universe PPL, BLOG (Milch, 2006). We define the generalized BLOG language, the measure-theoretic BLOG, and prove that every well-formed measure-theoretic BLOG model corresponds to a unique MTBN. Note that our approach also applies to other PPLs222It has been shown that BLOG has equivalent semantics to other PPLs (Wu et al., 2014; McAllester et al., 2008)..

We begin with a brief description of the core syntax of BLOG, with particular emphasis on (1) number statements, which are critical for expressing open-universe models333The specialized syntax in BLOG to express models with infinite number of variables., and (2) new syntax for expressing MTBNs, i.e., the Mix distribution. Further description of BLOG’s syntax can be found in Li & Russell (2013).

### 5.1 Syntax of measure-theoretic BLOG

Fig. 1 shows a BLOG model with measure-theoretic extensions for a multi-student GPA example. Line 1 declares two types, Applicant and Country. Line 2 defines 3 distinct countries with keyword distinct, New Zealand, India and USA. Lines 3 to 5 define a number statement

, which states that the number of US applicants follows a Poisson distribution with a higher mean than those from New Zealand or India. Line 6 defines an

origin function, which maps the object being generated to the arguments that were used in the number statement that was responsible for generating it. Here Nationality maps applicants to their nationalities. Lines 7 and 13 define two random variables by keyword random. Lines 7 to 12 state that the GPA of an applicant is distributed as a mixture of weighted discrete and continuous distributions. For US applicants, the range of values follows a truncated Gaussian with bounds 0 and 4 (line 9). The probability mass outside the range is attributed to the corresponding bounds: (line 10). GPA distributions for other countries are specified similarly. Line 13 defines a random applicant . Line 15 states that the David’s GPA is observed to be 4 and we query in line 16 whether David is from USA.

##### Number Statement (line 3 to 5)

Fig. 2 shows the syntax of a number statement for . In this specification, are origin functions (discussed below); are tuples of arguments drawn from ; are first-order formulas with free variables ; are tuples of expressions over a subset of ; and specify kernels where is the type of the expression .

The arguments provided in a number statement allow one to utilize information about the rest of the model (and possibly other generated objects) while describing the number of objects that should be generated for each type. These assignments can be recovered using the origin functions , each of which is declared as:

 origin Typej  gj(Typei),

where is the type of the argument in the number statement of where was used. The value of the variable used in the number statement that generated , an element of the universe, is given by . Line 6 in Fig. 1 is an example of origin function.

##### Mixture Distribution (line 9 to 12)

In measure-theoretic BLOG, we introduce a new distribution, the mixture distribution (e.g., lines 9-10 in Fig. 1). A mixture distribution is specified as:

 Mix({c1(¯e1)→w1(¯e′),…,ck(¯ek)→wk(¯e′)}),

where are arbitrary distributions, and ’s are arbitrary real valued functions that sum to 1 for every possible assignment to their arguments: . Note that in our implementation of measure-theoretical BLOG, we only allow a Mix distribution to express a mixture of densities and masses for simplifying the system design, although it still possible to express the same semantics without Mix.

### 5.2 Semantics of measure-theoretic BLOG

In this section we present the semantics of measure-theoretic BLOG and its theoretical properties. Every BLOG model implicitly defines a first-order vocabulary consisting of the set of functions and types mentioned in the model. BLOG’s semantics are based on the standard, open-universe semantics of first-order logic. We first define the set of all possible elements that may be generated for a BLOG model.

###### Definition 5.1.

The set of possible elements for a BLOG model with types is , where

• , is a distinct constant in

• , where is a number statement of type , is a tuple of elements of the type of from ,

Def. 5.1 allows us to define the set of random variables corresponding to a BLOG model.

###### Definition 5.2.

The set of basic random variables for a BLOG model , , consists of:

• for each number statement , a number variable over the standard measurable space , where is of the type of .

• for each function and tuple from of the type of , a function application variable with the measurable space , where is the measurable space corresponding to , the return type of .

We now define the space of consistent assignments to random variables.

###### Definition 5.3.

An instantiation of the basic RVs defined by a BLOG model is consistent if and only if:

• For every element used in an assignment of the form or , ;

• For every fixed function symbol with the interpretation , ; and

• For every element , generated by the number statement , with origin functions , for every , . That is, origin functions give correct inverse maps.

###### Lemma 5.4.

Every consistent assignment to the basic RVs for defines a unique possible world in the vocabulary of .

The proof of Lemma 5.4 is in Appx. F. In the following definition, we use the notation to denote a substitution of every occurrence of the variable with in the expression . For any BLOG model , let ; for each , is the measurable space corresponding to . Let consist of the following edges for every number statement or function application statement of the form :

• The edge if is a function symbol in such that appears in , and either or an occurrence of in uses quantified variables , is a tuple of elements of the type of and .

• The edge , for element .

Note that the first set of edges defined in above may include infinitely many parents for . Let the dependency statement in the BLOG model corresponding to a number or function variable be . Let be the set of expressions used in . Each such statement then defines in a straightforward manner, a kernel . In order ensure consistent assignments, we include a special value for each in , and require that whenever violates the first condition of consistent assignments (Def. 5.3). In other words, all the local kernels ensure are locally consistent: variables involving an object get a non-null assignment only if the assignment to its number statement represents the generation of at least objects (). Each kernel of the form can be transformed into a kernel from its parent vertices (representing basic random variables) by composing the kernels determining the truth value of each expression in terms of the basic random variables, with the kernel . Let .

###### Definition 5.5.

The MTBN for a BLOG model is defined using , the set of measurable spaces and the kernels for each vertex given by .

By Thm. 3.6, we have the main result of this section, which provides the theoretical foundation for the generalized BLOG language:

###### Theorem 5.6.

If the MTBN for a BLOG model is a well-founded digraph, then represents a unique measure on .

## 6 Experiment Results

We implemented the measure-theoretic extension of BLOG and evaluated our inference algorithms on three models where naive algorithms fail: (1) the GPA model (GPA); (2) the noisy scale model (Scale); and (3) a SSM, the aircraft tracking model (Aircraft-Tracking). The implementation is based on BLOG’s C++ compiler (Wu et al., 2016).

GPA model: Fig. 1 presents the BLOG code for the GPA example as explained in Sec. 5. Since the GPA of David is exactly 4, Bayes rule implies that David must be from USA. We evaluate LLW and the naive LW on this model in Fig 4(a), where the naive LW converges to an incorrect posterior.

Scale model: In the noisy scale example (Fig. 3

), we have an even number of coins and there might be a fake coin among them (Line 4). The fake coin will be slightly heavier than a normal coin (Line 2-3). We divide the coins into two halves and place them onto a noisy scale. When there is no fake coin, the scale always balances (Line 7). When there is a fake coin, the scale will noisily reflect the weight difference with standard deviation

(sigma in Line 6). Now we observe that the scale is balanced (Line 8) and we would like to infer whether a fake coin exists. We again compare LLW against the naive LW with different choices of the parameter in Fig. 4(b). Since the scale is precisely balanced, there must not be a fake coin. LLW always produces the correct answer but naive LW converges to different incorrect posteriors for different values of ; as increases, naive LW’s result approaches the true posterior.

## 7 Conclusion

We presented a new formalization, measure-theoretic Bayesian networks, for generalizing the semantics of PPLs to include random variables with mixtures of discrete and continuous distributions. We developed provably correct inference algorithms for such random variables and incorporated MTBNs into a widely used PPL, BLOG. We believe that together with the foundational inference algorithms, our proposed rigorous framework will facilitate the development of powerful techniques for probabilistic reasoning in practical applications from a much wider range of scientific areas.

### Acknowledgment

This work is supported by the DARPA PPAML program, contract FA8750-14-C-0011. Simon S. Du is funded by NSF grant IIS1563887, AFRL grant FA8750-17-2-0212 and DARPA D17AP00001.

## References

• Arora et al. (2013) Arora, N. S., Russell, S., and Sudderth, E. NET-VISA: Network processing vertically integrated seismic analysis. Bulletin of the Seismological Society of America, 103(2A):709–729, 2013.
• Carpenter et al. (2016) Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., Riddell, A., et al. Stan: A probabilistic programming language. Journal of Statistical Software, 20(2):1–37, 2016.
• Choi et al. (2010) Choi, J., Amir, E., and Hill, D. J. Lifted inference for relational continuous models. In UAI, volume 10, pp. 126–134, 2010.
• Doucet et al. (2001) Doucet, A., De Freitas, N., and Gordon, N. An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo methods in practice, pp. 3–14. Springer, 2001.
• Durrett (2013) Durrett, R. Probability: Theory and Examples. Cambridge University Press, 2013.
• Gao et al. (2017) Gao, W., Kannan, S., Oh, S., and Viswanath, P. Estimating mutual information for discrete-continuous mixtures. In Advances in Neural Information Processing Systems, pp. 5988–5999, 2017.
• Goodman (2013) Goodman, N. D. The principles and practice of probabilistic programming. In ACM SIGPLAN Notices, volume 48, pp. 399–402. ACM, 2013.
• Goodman & Stuhlmüller (2014) Goodman, N. D. and Stuhlmüller, A. The Design and Implementation of Probabilistic Programming Languages. http://dippl.org, 2014. Accessed: 2018-6-5.
• Goodman et al. (2008) Goodman, N. D., Mansinghka, V. K., Roy, D. M., Bonawitz, K., and Tenenbaum, J. B. Church: A language for generative models. In UAI-08, 2008.
• Gutmann et al. (2011a) Gutmann, B., Jaeger, M., and De Raedt, L. Extending problog with continuous distributions. In Inductive Logic Programming, pp. 76–91. Springer, 2011a.
• Gutmann et al. (2011b) Gutmann, B., Thon, I., Kimmig, A., Bruynooghe, M., and De Raedt, L. The magic of logical inference in probabilistic programming. Theory and Practice of Logic Programming, 11(4-5):663–680, 2011b.
• Jech (2003) Jech, T. Set theory. Springer, 2003.
• Kallenberg (2002) Kallenberg, O. Foundations of Modern Probability. Springer, 2002.
• Kersting & De Raedt (2007) Kersting, K. and De Raedt, L. Bayesian logic programming: Theory and tool. Statistical Relational Learning, pp. 291, 2007.
• Kharchenko et al. (2014) Kharchenko, P. V., Silberstein, L., and Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nature methods, 11(7):740, 2014.
• Koller et al. (1997) Koller, D., McAllester, D., and Pfeffer, A. Effective Bayesian inference for stochastic programs. In AAAI-97, 1997.
• Kulkarni et al. (2015) Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., and Mansinghka, V. Picture: A probabilistic programming language for scene perception. In

Proceedings of the ieee conference on computer vision and pattern recognition

, pp. 4390–4399, 2015.
• Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
• Le et al. (2017) Le, T. A., Baydin, A. G., and Wood, F. Inference compilation and universal probabilistic programming. In Artificial Intelligence and Statistics, pp. 1338–1348, 2017.
• Li & Russell (2013) Li, L. and Russell, S. J. The BLOG language reference. Technical report, Technical Report UCB/EECS-2013-51, EECS Department, University of California, Berkeley, 2013.
• Mansinghka et al. (2014) Mansinghka, V., Selsam, D., and Perov, Y. Venture: a higher-order probabilistic programming platform with programmable inference. arXiv preprint arXiv:1404.0099, 2014.
• McAllester et al. (2008) McAllester, D., Milch, B., and Goodman, N. D. Random-world semantics and syntactic independence for expressive languages. Technical report, 2008.
• Milch et al. (2005a) Milch, B., Marthi, B., Russell, S. J., Sontag, D., Ong, D. L., and Kolobov, A. BLOG: Probabilistic models with unknown objects. In Proc. of IJCAI, pp. 1352–1359, 2005a.
• Milch et al. (2005b) Milch, B., Marthi, B., Sontag, D., Russell, S., Ong, D. L., and Kolobov, A. Approximate inference for infinite contingent Bayesian networks. In Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 2005b.
• Milch (2006) Milch, B. C. Probabilistic models with unknown objects. PhD thesis, University of California at Berkeley, Berkeley, CA, USA, 2006.
• Nitti et al. (2016) Nitti, D., De Laet, T., and De Raedt, L. Probabilistic logic programming for hybrid relational domains. Machine Learning, 103(3):407–449, 2016.
• Pearl (1988) Pearl, J. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988.
• Pfeffer (2009) Pfeffer, A. Figaro: An object-oriented probabilistic programming language. Charles River Analytics Technical Report, 137:96, 2009.
• Pierson & Yau (2015) Pierson, E. and Yau, C. Zifa: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome biology, 16(1):241, 2015.
• Ramsey (2016) Ramsey, N. All you need is the monad.. what monad was that again. In PPS Workshop, 2016.
• Ritchie et al. (2015) Ritchie, D., Lin, S., Goodman, N. D., and Hanrahan, P. Generating design suggestions under tight constraints with gradient-based probabilistic programming. In Computer Graphics Forum, volume 34, pp. 515–526. Wiley Online Library, 2015.
• Shan & Ramsey (2017) Shan, C.-c. and Ramsey, N. Exact Bayesian inference by symbolic disintegration. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, pp. 130–144. ACM, 2017.
• Singla & Domingos (2007) Singla, P. and Domingos, P. Markov logic in infinite domains. In In Proc. UAI-07, 2007.
• Staton (2017) Staton, S. Commutative semantics for probabilistic programming. In European Symposium on Programming, pp. 855–879. Springer, 2017.
• Stuhlmüller & Goodman (2014) Stuhlmüller, A. and Goodman, N. D. Reasoning about reasoning by nested conditioning: Modeling theory of mind with probabilistic programs. Cognitive Systems Research, 28:80–99, 2014.
• Tolpin et al. (2016) Tolpin, D., van de Meent, J. W., Yang, H., and Wood, F. Design and implementation of probabilistic programming language anglican. arXiv preprint arXiv:1608.05263, 2016.
• Tran et al. (2016) Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016.
• Wang & Domingos (2008) Wang, J. and Domingos, P. Hybrid Markov logic networks. In AAAI, volume 8, pp. 1106–1111, 2008.
• Wood et al. (2014a) Wood, F., Meent, J. W., and Mansinghka, V. A new approach to probabilistic programming inference. In Artificial Intelligence and Statistics, pp. 1024–1032, 2014a.
• Wood et al. (2014b) Wood, F., van de Meent, J. W., and Mansinghka, V. A new approach to probabilistic programming inference. In Proceedings of the 17th International conference on Artificial Intelligence and Statistics, pp. 1024–1032, 2014b.
• Wu et al. (2014) Wu, Y., Li, L., and Russell, S. BFiT: From possible-world semantics to random-evaluation semantics in open universe. 3rd NIPS Workshop on Probabilistic Programming, 2014.
• Wu et al. (2016) Wu, Y., Li, L., Russell, S., and Bodik, R. Swift: Compiled inference for probabilistic programming languages. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), 2016.

## Appendix A Background on Measure-theoretical Probability Theory

We assume familiarity with measure-theoretic approaches to probability theory, but provide the fundamental definitions. The standard Borel -algebra is assumed in all the discussion. See (Durrett, 2013) and (Kallenberg, 2002) for introduction and further details.

A measurable space (space, for short) is an underlying set paired with a -algebra of measurable subsets of , i.e., a family of subsets containing the underlying set which is closed under complements and countable unions. We’ll denote the measurable space simply by where no ambiguity results. A function between measurable spaces is measurable if measurable sets pullback to measurable sets: for all . A measure on a measurable space is a function which satisfies countable additivity: for any countable sequence of disjoint measurable sets . denotes the probability of a statement under the base measure

, and similarly for conditional probabilities. A probability kernel is the measure-theoretic generalization of a conditional distribution. It is commonly used to construct measures over a product space, analogously to how conditional distributions are used to define joint distributions in the chain rule.

###### Definition A.1.

A probability kernel from one measurable space to another is a function such that (a) for every , is a probability measure over , and (b) for every , is a measurable function from to .

Given an arbitrary index set and spaces for each index , the product space is the space with underlying set the Cartesian product of the underlying sets, adorned with the smallest -algebra such that the projection functions are measurable.

## Appendix B MTBNs Represent Unique Measures

We prove here Theorem 3.6. Its proof requires a series of intermediate results. We first define a projective family of measures. This gives a way to recursively construct our measure . We define a notion of consistency such that every consistent projective family constructs a measure that represents. We end by giving an explicit characterization of the unique consistent projective family, and thus of the unique measure represents. The appendix contains additional technical material required in the proofs.

Intuitively, the main objective of this section is to show that an MTBN defines a unique measure that “factorizes” according to the network, as an extension to the corresponding result for Bayes Nets.

### b.1 Consistent projective family of measures

Let be a kernel from and a kernel from . Their composition (note the ordering!) is a kernel from to defined for , by:

 (K∘L)(x,C)=∫K(x,dy)∫L(y,dz)1C(z). (5)

To allow uniform notation, we will treat measurable functions and measures as special cases of kernels. A measurable function corresponds to the kernel from to given by for and . A measure on a space is a kernel from , the one element measure space, to given by for . Where this yields no confusion, we use and in place of and . (5) simplifies if the kernels are measures or functions. Let be a measure on , be a kernel from to , be a measurable function from to , and be a measurable function from to . Then is a measure on and is a kernel from to with: , and .

Let denote the class of upwardly closed sets: subsets of containing all their elements’ parents.

###### Definition B.1.

A projective family of measures is a family consisting of a measure on for every such that whenever we have , i.e., for all , .

Def. B.1 captures the measure-theoretic version of the probability of a subset of variables being equal to the marginals obtained while “summing out” the probabilities of the other variables in a joint distribution.

###### Definition B.2.

Let be a measure on a measure space , and a kernel from to a measure space . Then is the measure on defined for by: .

Def. B.2 defines the operation of composing a conditional probability with a prior on a parent, to obtain the corresponding joint distribution.

###### Definition B.3.

Let for be kernels from to . Denote by the kernel from to defined for each by the infinite product of measures: .

See (Kallenberg, 2002) 1.27 and 6.18 for definition and existence of infinite products of measures. Def. B.3 captures the kernel representation for taking the equivalent of products of conditional distributions of a set of variables with a common set of parents.

###### Definition B.4.

A projective family is consistent with if for any such that and , then: .

Consistency in Def. B.4 captures the global condition that we would like to see in a generalization of a Bayes network. Namely, the distribution of any set of parent-closed random variables should “factorize” according to the network

A projective family is consistent with exactly when represents :

###### Lemma B.5.

Let be a measure on , and define the projective family by . This projective family is consistent with iff represents .

###### Proof.

First we’ll relate consistency (Def. 8) with conditional expectation and distribution properties of random variables. Take any such that and and observe that the following are equivalent:

• is a version of the conditional distribution of given ,

• is a version of the conditional distribution of given for all , and are mutually independent conditional on .

The forward direction is straightforward. For the converse we use the fact that conditional independence of families of random variables holds if it holds for all finite subsets, establishing that by chaining conditional independence (see (Kallenberg, 2002) p109 and 6.8). ∎

Lemma B.5 shows that Def. B.4 follows iff an MTBN represents the joint distribution – in other words, it follows iff the local Markov property holds.

### b.2 There exists a unique consistent family

Each vertex is assigned the unique minimal ordinal such that whenever (see (Jech, 2003) for an introduction to ordinals). For any denote by the restriction of to vertices of depth less than . Defining , the least strict upper bound on depth, we have that for all . In the following, fix a limit ordinal .

###### Definition B.6.

is a projective sequence of measures on if whenever we have .

Def. B.6 generalizes the notion of subset relationships and the marginalization operations that hold between supersets and subsets to the case of infinite dependency chains

###### Definition B.7.

The limit of a projective sequence of measures is the unique measure on such that for all .

###### Definition B.8.

Given any , inductively define a measure on by

 μ0U =1, μα+1U =μαU⊗∏v∈U:d(v)=α(πUαpa(v)∘Kv), μλU =limα<λμαUif λ is% a limit ordinal.

stabilizes for to define a measure on .

The above definition is coherent as can be inductively shown to be a projective sequence. Lemma B.9 and B.10 allow us to show in Theorem B.11 that is the unique consistent projective family of measures.

###### Lemma B.9.

If for , then for all :

Proof is in Appx. C.

###### Lemma B.10.

If where , and if , then , , and