 # Discontinuous Hamiltonian Monte Carlo for Probabilistic Programs

Hamiltonian Monte Carlo (HMC) is the dominant statistical inference algorithm used in most popular first-order differentiable probabilistic programming languages. HMC requires that the joint density be differentiable with respect to all latent variables. This complicates expressing some models in such languages and prohibits others. A recently proposed new integrator for HMC yielded a new Discontinuous HMC (DHMC) algorithm that can be used for inference in models with joint densities that have discontinuities. In this paper we show how to use DHMC for inference in probabilistic programs. To do this we introduce a sufficient set of language restrictions, a corresponding mathematical formalism that ensures that any joint density denoted in such a language has a suitably low measure of discontinuous points, and a recipe for how to apply DHMC in the more general probabilistic-programming context. Our experimental findings demonstrate the correctness of this approach.

## Code Repositories

### PyFOPPL-2

Framework for First Order Probabilistic Programming Languages

### pyfoco

Compiler for Probabilistic Programs.

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2011)

is an efficient Markov Chain Monte Carlo (MCMC) algorithm that has been widely used for inference in a wide-range of probabilistic models

(Jacobs et al., 2016; Svensson et al., 2017; Doyle et al., 2016). Its superior performance arises from the advantageous properties of the sample paths that it generates via Hamiltonian mechanics. HMC, particularly the No-U-turn sampler (NUTS) (Hoffman and Gelman, 2014) variant, is implemented in many Probabilistic Programming Systems (PPSs)(Eli et al., 2017; Tran et al., 2017; Salvatier et al., 2016; Gelman et al., 2015), and is the main inference engine of both PyMC3 (Salvatier et al., 2016) and Stan (Gelman et al., 2015; Carpenter et al., 2015).

One drawback of using HMC in probabilistic programming is that complications can arise when the target distribution is not differentiable with respect to one or more of its parameters. Developers often impose restrictions on the models that can be encoded to try and avoid these complications, such as preventing the use of discrete variables or only allowing them if they are directly marginalized out.

However, even these restrictions are not sufficient to guarantee the program is discontinuity-free because control flow special forms, such as if-else statements in PPLs, can also induce discontinuities.

Though it turns out, perhaps surprisingly, that HMC still constitutes a valid inference algorithm when the target is discontinuous or even if it has disjoint support, this can substantially reduce the acceptance rate leading to inefficient inference.

To mitigate this issue, several variants of HMC have been developed (Afshar and Domke, 2015; Nishimura et al., 2017) to perform inference on models with discontinuous densities. However, existing systems, like Stan, are not able to support these variants as their compilation procedures do not unearth the required discontinuity information. Creating a system that addresses this issue in an automated way is non-trivial and requires significant heavy lifting from the programming languages perspective.

Therefore, in this extended abstract, we define a carefully designed probabilistic programming language with an accompanying denotational semantics and compiler. Together, these are able to both recover the discontinuity information required to use these HMC variants as inference engines and provide a framework amenable to the theoretical analysis required to demonstrate the correctness of the resulting engines. To ensure that the measure of the set of discontinuities in the target density is of measure zero, we provide a mathematical formalism to compliment our language. This then provides us with a framework in which we can conservatively and correctly employ HMC and its variants. We demonstrate this for the Discontinuous Hamiltonian Monte Carlo  (DHMC) algorithm (Nishimura et al., 2017) and provide an example of our language being employed on models that have non-differentiable densities.

## 2. A Simple PPL

SPPL uses a Lisp style syntax, like that of Church (Goodman et al., 2008), Anglican (Wood et al., 2014) and Venture (Mansinghka et al., 2014). The syntax of expressions in our language is given as follows:

We use for a real-valued variable, for a real number, for an analytic primitive operation on real, such as and , and for the type of a distribution on , such as Normal, that has a piece-wise smooth density under analytic partition.

To be less technical, this is restricted on only allows continuous distribution as primitives. However, one can easily construct discrete distributions as was seen in Figure 1, where Program 1 (a) and 2 (b) define the joint density of the model respectively as following,

 p1(z,y)= pB(z;q)pN(y;1,1)I(z=1)pN(y;0,1)I(z=0) p2(x,y)= pU(x;0,1)pN(y;1,1)I(x>q)pN(y;0,1)I(x≤q)

Note that there are no forms for applying general functions in this language and no recursion is possible at all. As a result, all programs written in this language may only have a finite and fixed number of sample and observe

statements. This means, among other things, that programs in this language can be compiled to graphical models in which there are finite number of random variable vertices coming from every sample and observe statement. For this reason we will mix our use of the terms graphical model and probabilistic program, or just program, because of their equivalence. Figure 1. A simple probabilistic program example written in Anglican (a) with Bernoulli distribution as a primitive and an equivalent one in SPPL (b) where the Bernoulli draw is constructed by a Uniform-draw and if-else statement.

The primitives are, by design, restricted to be analytic. Analytic functions are abundant and most primitive functions that we encounter in machine learning applications are analytic, and their composition is also analytic.

Intuitively, programs in SPPL have a joint density in the form as Definition 1 (See Appendix A). It can be understood as a collection of smooth functions in each partition, with no partition overlapping and the union of all partitions is the total space . To evaluate the sum, therefore, we just need to evaluate these products at one-by-one until we find one that returns a non-zero value. Then, we have to compute the function corresponding to this product at the input .

###### Theorem 1 ().

If the density has the form of Definition 1 and so is piecewise smooth under analytic partition, then there exists a (Borel) measurable subset such that is differentiable outside of and the Lebesgue measure of is zero.

Together with Definition 1 and Theorem 1, we ensures that SPPL conforms to Lemma 1 and Theorems 2 and 3 of DHMC (Nishimura et al., 2017) and so, by design, all requirements for DHMC are trivially met as the density is a piecewise smooth function and all discontinuities are of measure zero.

## 3. The Compilation Scheme

Accompanying the language, our compilation scheme is designed to establish variables which the density is discontinuous with respect to and provide information for runtime checking on boundary crossing. It works by automatically extracting if predicates and evaluating them with the corresponding random variables. Each predicate is assigned a unique name and corresponding boolean. If the boolean changes value, it indicates the corresponding random variable has crossed the boundary during the update in inference at runtime. The runtime checker will record this information and pass it to the inference engine. Note that the runtime checking can only detect boundary crossing instead of being able to calculate the analytical solution on where the boundaries are exactly. We recognize the former part is sufficient for many specialized inference engines and leave the later part to be implemented as future work.

## 4. Experiment Figure 2. Mean Squared Error for the posterior estimates to true posterior of the cluster means μ1:2. We compare the results from DHMC and NUTS with Metropolis-within-Gibbs(MwG). The median (dashed lines) and 20%/80%confidence intervals(shaded) over 20 independent runs are shown.

We now consider a classic Gaussian mixture model (GMM), where one is interested to estimate the mean of each cluster and the cluster assignment for each data. The density of this model contains a mixture of continuous and discrete variables (See details of the model in Appendix

C).

We compared the Mean Squared Error (MSE) of the posterior estimates for the cluster means of both an unoptimized version of DHMC and PyMC3 (Salvatier et al., 2016) optimized implementation of NUTS with Metropolis-within-Gibbs(MwG), with the same computation budget. The results are shown in Figure 2 as a function of number of samples. We take samples and discard for burn in. We calculate the confidence intervals over 20 independent runs and find that both approaches perform consistently well. We find that our unoptimized DHMC implementation, performs comparable to the optimized NUTS with MwG approach.

## Appendix A Piecewise Smooth Function

###### Definition 1 ().

A function is piecewise smooth under analytic partition if it has the following form:

where

1. the are analytic;

2. the are smooth;

3. is a non-negative integer or ;

4. are non-negative integers; and

5. the indicator functions

for the indices define a partition of as,

## Appendix B Proof of Theorem 1

###### Proof.

Assume that is piecewise smooth under analytic partition. Thus,

 (1) f(x)=N∑i=1Mi∏j=1I(pi,j(x)≥0)⋅Oi∏l=1I(qi,l(x)<0)⋅hi(x)

for some and that satisfy the properties in Definition 1.

We use one well-known fact: the zero set of an analytic function is the entire or has zero Lebesgue measure (Mityagin, 2015). We apply the fact to each and deduce that the zero set of is or has measure zero. Note that if the zero set of is the entire , the indicator function becomes the constant- function, so that it can be omitted from the RHS of equation (1). In the rest of the proof, we assume that this simplification is already done so that the zero set of has measure zero for every .

For every , we decompose the -th region

 Ri={x∣pi,j≥0 and qi,l(x)<0 for all j,l}

to

 R′i={x∣pi,j>0 and qi,l(x)<0 for all j,l}and R′′i=Ri∖R′i.

Note that is open because the and are analytic and so continuous, both and are open, and the inverse images of open sets by continuous functions are open. This means that for each , we can find an open ball at inside so that for all in the ball. Since is smooth, this implies that is differentiable at every . For the other part , we notice that

 R′′i⊆Mi⋃j=1{x∣pi,j(x)=0}.

The RHS of this equation is a finite union of measure-zero sets, so it has measure zero. Thus, also has measure zero as well.

Since is a partition of , we have that

 Rn=N⋃i=1R′i∪N⋃i=1R′′i.

The density is differentiable on the union of ’s. Also, since the union of finitely or countably many measure-zero sets has measure zero, the union of ’s has measure zero. Thus, we can set the set required in the theorem to be this second union. ∎

## Appendix C Details of GMM

The Gaussian Mixture Model (GMM) can be defined as,

 μk ∼N(μ0,σ0),k=1,…,K zn ∼Categorical(p0),n=1,…,N yn|zn, μzn∼N(μzn,σzn),n=1,…,N

where are latent variables, are observed data with as the number of clusters and the total number of data. Although Categorical distribution is not in our primitive, we can easily construct it in our language by the combination of uniform draws and nested if expressions.

For our experiment, we considered a simple case with and , along with as the synthetic dataset.