LF-PPL: A Low-Level First Order Probabilistic Programming Language for Non-Differentiable Models

03/06/2019 ∙ by Yuan Zhou, et al. ∙ 12

We develop a new Low-level, First-order Probabilistic Programming Language (LF-PPL) suited for models containing a mix of continuous, discrete, and/or piecewise-continuous variables. The key success of this language and its compilation scheme is in its ability to automatically distinguish parameters the density function is discontinuous with respect to, while further providing runtime checks for boundary crossings. This enables the introduction of new inference engines that are able to exploit gradient information, while remaining efficient for models which are not everywhere differentiable. We demonstrate this ability by incorporating a discontinuous Hamiltonian Monte Carlo (DHMC) inference engine that is able to deliver automated and efficient inference for non-differentiable models. Our system is backed up by a mathematical formalism that ensures that any model expressed in this language has a density with measure zero discontinuities to maintain the validity of the inference engine.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Non-differentiable densities arise in a huge variety of common probabilistic models [1, 2]. Often, but not exclusively, they occur due to the presence of discrete variables. In the context of probabilistic programming [3, 4, 5, 6] such densities are often induced via branching, i.e. if-else statements, where the predicates depend upon the latent variables of the model. Unfortunately, performing efficient and scalable inference in models with non-differentiable densities is difficult and algorithms adapted for such problems typically require specific knowledge about the discontinuities [7, 8, 9], such as which variables the target density is discontinuous with respect to and catching occurrences of the sampler crossing a discontinuity boundary. However, detecting when discontinuities occur is difficult and problem dependent. Consequently, automating specialized inference algorithms in probabilistic programming languages (PPLs) is challenging.

To address this problem, we introduce a new Low-level First-order Probabilistic Programming Language (LF-PPL), with a novel accompanying compilation scheme. Our language is based around carefully chosen mathematical constraints, such that the set of discontinuities in the density function of any model written in LF-PPL will have measure zero. This is an essential property for many inference algorithms designed for non-differentiable densities [7, 8, 9, 10, 11]

. Our accompanying compilation scheme automatically classifies discontinuous and continuous random variables for any model specified in our language. Moreover, this scheme can be used to detect transitions across discontinuity boundaries at runtime, providing important information for running such inference schemes.

Relative to previous languages, LF-PPL enables one to incorporate a broader class of specialized inference techniques as automated inference engines. In doing so, it removes the burden from the user of manually establishing which variables the target is not differentiable with respect to. Its low-level nature is driven by a desire to establish the minimum language requirements to support inference engines tailored to problems with measure-zero discontinuities, and to allow for a formal proof of correctness. Though still usable in its own right, our main intention is that it will be used as a compilation target for existing systems, or as an intermediate system for designing new languages.

There are a number of different derivative-based inference paradigms for which LF-PPL can help extend to non-differentiable setups [7, 8, 9, 10, 11]. Of particular note, are stochastic variational inference (SVI) [12, 13, 14, 15] and Hamiltonian Monte Carlo (HMC) [16, 17], two of the most widely used approaches for probabilistic programming inference.

In the context of the former, [9] recently showed that the reparameterization trick can be generalized to piecewise differentiable models when the non-differentiable boundaries can be identified, leading to an approach which provides significant improvements over previous methods that do not explicitly account for the discontinuities. LF-PPL provides a framework that could be used to apply their approach in a probabilistic programming setting, thereby paving the way for significant performance improvements for such models.

Similarly, many variants of HMC have been proposed in recent years to improve the sample efficiency and scalability when the target density is non-differentiable [7, 8, 18, 19, 20]. Despite this, no probabilitic programming systems (PPSs) support these tailored approaches at present, as the underlying languages are not able to extract the necessary information for their automation. The novel compilation approach of LF-PPL provides key information for running such approaches, enabling their implementation as automated inference engines. We realize this potential by implementing Discontinuous HMC (DHMC) [8] as an inference engine in LF-PPL, allowing for efficient, automated, HMC-based inference in models with a mixture of continuous and discontinuous variables.

2 Background and Related Work

There exists a number of different approaches to probabilistic programming that are built around a variety of semantics and inference engines. Of particular relevance to our work are PPSs designed around derivative based inference methods that exploit automatic differentiation [21], such as Stan [6], PyMC3 [22], Edward [23], Turing [24] and Pyro [25]. Derivative based inference algorithms have been an essential component in enabling these systems to provide efficient and, in particular, scalable inference, permitting both large datasets and high dimensional models.

One important challenge for these systems occurs in dealing with probabilistic programs that contain discontinuous densities and/or variables. From the statistical perspective, dealing with discontinuities is often important for conducting effective inference. For example, in HMC, discontinuities can cause statistical inefficiency by inducing large errors in the leapfrog integrator, leading to potentially very low acceptance rates [7, 8]. In other words, though the leapfrog integrator remains a valid, reversible, MCMC proposal even when discontinuities break the reversibility of the Hamiltonian dynamics themselves, they can undermine the effectiveness of this proposal.

Different methods have been suggested to improve inference performance in models with discontinuous densities. For example, they use sophisticated integrators in the HMC setting to remain effective when there are discontinuities. Analogously, in the variational inference and deep learning literature, reparameterization methods have been proposed that allow training for discontinuous targets and discrete variables 

[9, 26].

However, these advanced methods are, in general, not incorporated in existing gradient-based PPSs, as existing systems do not have adequate support to deal with the discontinuities in the density functions of the model defined by probabilistic programs. This is usually necessary to guarantee the correct execution of those inference methods in an automated fashion, as many require the set of discontinuities to be of measure zero. That is, the union of all points where the density is discontinuous have zero measure with respect to the Lebesgue measure. In addition to this, some further methods require knowledge of where the discontinuities are, or at least catching occurrences of discontinuity boundaries being crossed.

Of particular relevance to our language and compilation scheme are compilers which compile the program to an artifact representing a direct acyclic graphical model (DAG), such as those employed in BUGS [27] and, in particular, the first order PPL (FOPPL) explored in [28]. Although the dependency structures of the programs in our language are established in a similar manner, unlike these setups, programs in our language will not always correspond to a DAG, due to different restrictions on our density factors, as will be explained in the next section. We also impose necessary constraints on the language by limiting the functions allowed to ensure that the advanced inference processes remain valid.

3 The Language

LF-PPL adopts a Lisp-like syntax, like that of Church [4] and Anglican [5]. The syntax contains two key special forms, sample and observe, between which the distribution of the query is defined and whose interpretation and syntax is analogous to that of Anglican.

More precisely, sample is used for drawing random variables, returning that variable, and observe factors the density of the program using existing variables and fixed observations, returning zero. Both special forms are designed to take a distribution object as input, with observe further taking an observed value. These distribution objects form the elementary random procedures of the language and are constructed using one of a number of internal constructors for common objects such as normal and bernoulli. Figure 1 shows an example of an LF-PPL program.

A distribution object constructor of particular note is factor, which can only be used with observe. Including the statement (observe (factor log-p) _) will factor the program density using the value of (exp log-p), with no dependency on the observed value itself (here _). The significance of factor is that it allows the specification of arbitrary unnormalized target distributions, quantified as log-p which can be generated internally in the program, and thus have the form of any deterministic function of the variables that can be written in the language.

(let [x (sample (uniform 0 1))]
    (if (< (- q x) 0)
        (observe (normal 1 1) y)
        (observe (normal 0 1) y))
    (< (- q x) 0))
Figure 1: An example LF-PPL program sampling from a uniform random variable and invoking a choice between two observe statements that factor the trace weight using different Gaussian likelihoods. The (< (- q x) 0) term, which is usually written as , represents a Bernoulli variable parameterized by q and its boolean value also corresponds to which branch of the if statement is taken. The slightly unusual writing of the program is due to its deliberate low-level nature, with almost all syntactic sugar removed. One sugar that has been left in for exposition is an additional term in the let block, i.e. (let [x e] e e), which can be trivially unraveled.

Unlike many first-order PPLs, such as that of [28], LF-PPL programs do not permit interpretation as DAGs because we allow the observation of internally sampled variables and the use of factor. This increases the range of models that can be encoded and is, for example, critical in allowing undirected models to be written. LF-PPL programs need not correspond to a correctly normalized joint formed by the combination of prior and likelihood terms. Instead we interpret the density of a program in the manner outlined by [29, §4.3.2 and §4.4.3], noting that for any LF-PPL program, the number of sample and observe statements (i.e. and in their notation) must be fixed, a restriction that is checked during the compilation.

To formalize the syntax of LF-PPL, let us use for a real-valued variable, for a real number, op for an analytic primitive operation on reals, such as +, -, *, / and exp, and for a distribution object whose density is defined with respect to a Lebesgue measure and is piecewise smooth under analytic partition (See Definition 1). Then the syntax of expressions in our language are given as:

Our syntax is deliberately low-level to permit theoretical analysis and aid the exposition of the compiler. However, common syntactic sugar such as for-loops and higher-level branching statements can be trivially included using straightforward unravellings. Similarly, we can permit discrete variable distribution objects by noting that these can themselves be desugared to a combination of continuous random variables and branching statements. Thus, it is straightforward to extend this minimalistic framework to a more user-friendly language using standard compilation approaches, such that LF-PPL will form an intermediate representation. For implementation and code, see https://github.com/bradleygramhansen/PyLFPPL.

4 Compilation Scheme

We now provide a high-level description of how the compilation process works. Specifically, we will show how it transforms an arbitrary LF-PPL program to a representation that can be exploited by an inference engine that makes of use of discontinuity information.

The compilation scheme performs three core tasks: a) finding the variables which the target is discontinuous with respect to, b) extracting the density of the program to a convenient form that can be used by an inference engine, and c) allowing boundary crossings to be detected at runtime. Key to providing these features is the construction of an internal representation of the program that specifies the dependency structure of the variables, the Linearized Intermediate Representation (LIR). The LIR contains vertices, arc pairs, and information of the if predicates. Each vertex of the LIR denotes a sample or observe statement, of which only a finite and fix number can occur in LF-PPL. The arcs of the LIR define both the probabilistic and if condition dependencies of the variables. The former of these are constructed in same was as is done in the FOPPL compiler detailed in [28].

Using the dependency structure represented by the LIR, we can establish which variables are capable of changing the path taken by a program trace, that is the change the branch taken by one or more if statements. Because discontinuities only occur in LF-PPL through if statements, the target must be continuous with respect to any variables not capable of changing the traversed path. We can thus mark these variables as being “continuous”. Though it is possible for the target to still be continuous with respect to variables that appear in, or have dependent variables appearing in, the branching function of an if statement, such cases cannot, in general, be statically established. We therefore mark all such variables as “discontinuous”.

To extract the density to a convenient form for the inference engine, the compiler transforms the program into a collection four sets— and —by recursively applying the translation rules given in Section 5.2. Here specifies the set of all variables sampled in the program, while specifies only the variables marked as discontinuous. represents the density associated with all the sample statements in a program, while represents the density factors originating for the observe statements, along with information on the program return value. These densities are themselves represented through a collection of smooth density terms and indicator functions truncating them into disjoint regions, each corresponding to a particular program path. This construction will be discussed in depth in Section 5.2.

To catch boundary crossings at run time, each if predicate is assigned a unique boolean variable within the LIR. We refer to these variables as branching variables. The boolean value of the branching variable denotes whether the current sample falls into the true or false branch of the corresponding if statement and is used to signal boundary crossings at runtime. Specifically, if one branching variable changes its boolean value, this indicates that at least one sampled variables effecting that if predicate has crossed the boundary. The inference engine can therefore track changes in the set of all Boolean values to catch the boundary crossings.

We finish the section by noting two limitations of the compiler and for discontinuity detection more generally. Firstly, we note that it is possible to construct programs which have piecewise smooth densities that contain regions of zero density. Though it is important to allow this ability, for example to construct truncated distributions, it may cause issues for certain inference algorithms if it causes the target to have disconnected regions of non-zero density. As analytic densities are either zero everywhere or “almost-nowhere” (see Section 5.1), we (informally) have that all realizations of a program that take a particular path will either have zero density or all have a non-zero density. Consequently, it is relatively straight forward to establish if a program has regions of zero density. However, whether these regions lead to “gaps” is far more challenging, and potentially impossible, to establish. Moreover, constructing inference procedures for such problems is extremely challenging. We therefore do not attempt to tackle this issue in the current work.

A second limitation is that changes in the vector of branching variables is only a

sufficient condition for the occurrence of a boundary crossing. This is because it is possible for multiple boundaries to be crossed in a single update that results in the new sample following the same path as the old one. For example, when moving from to then a branching variable corresponding to returns true in both cases even though we have crossed two boundaries. The problem of establishing with certainty that no boundaries have been crossed when moving between two points is mathematically intractable in the general case. As this problem is not specific to the probabilistic programming setting, we do not give it further consideration here, noting only that it is important from the perspective of designing inference algorithms that convergence is not undermined by such occurrences.

5 Mathematical Foundation and Compilation Details

Our story so far was developed by introducing a low-level first-order probabilistic programming language (LF-PPL) and its accompanying compilation scheme. We shall now expose the underlying mathematical details, which ensure that discontinuities contained within the densities of the programs one can compile in LF-PPL are of a suitable measure. This enables us to satisfy the requirements of several inference algorithms for non-differentiable densities. We also provide the formal translation rules of the LF-PPL, which are built around these mathematical underpinnings.

5.1 Piecewise Smooth Functions

A function is analytic if it is infinitely differentiable and its multivariate Taylor expansion at any point absolutely converges to point-wise in a neighborhood of

. Most primitive functions that we encounter in machine learning and statistics are analytic, and the composition of analytic functions is also analytic.

Definition 1.

A function is piecewise smooth under analytic partition if it has the following form:

where

  1. the are analytic;

  2. the are smooth;

  3. is a positive integer or ;

  4. are non-negative integers; and

  5. the indicator functions

    for the indices define a partition of , that is, the following family forms a partition of :

Intuitively, is a function defined by partitioning into finitely or countably many regions and using a smooth function within region . The products of the indicator functions of these summands form a partition of , so that only one of these products gets evaluated to a non-zero value at . To evaluate the sum, we just need to evaluate these products at one-by-one until we find one that returns a non-zero value. Then, we have to compute the function corresponding to this product at the input . Even though the number of summands (regions) in the definition is countably infinite, we can still compute the sum at a given .

Theorem 1.

If an unnormalized density has the form of Definition 1 and so is piecewise smooth under analytic partition, then there exists a (Borel) measurable subset such that is differentiable outside of and the Lebesgue measure of is zero.

The proof is given in Appendix A. The target density being almost everywhere differentiable with discontinuities of measure zero is an important property required by many inference techniques for non-differentiable models [8]. As we shall prove in Section 5.2, any program that can be compiled in LF-PPL constructs a density in the form of Definition 1, and thus satisfies this necessary condition.

5.2 Translation Rules

5.2.1 Overview

The compilation scheme translates a program, which can be denoted as an expression according to the syntax in Section 3, to a quadruple of sets . The first set represents the set of all sampled random variables. All variables generated from sample statements in will be recognized and stored in . Variables that have not occurred in any if predicate are guaranteed to be continuous. Otherwise, they will be also put in , as the overall density is discontinuous with respect to them. represents the densities from sample statements and has the form of a set of the pairs, i.e. , where is the number of the pairs, denotes a product of indicator functions indicating the partition of the space, and represents the products of the densities defined by the sample statements. The last set contains the densities from observe statements and the return expression of . It is a set of tuples , where is the number of the tuples, functions similar to , is the product of the densities defined by observe statements and denotes the returning expression. Note that it is a design choice to have included in .

Given , one can then construct the unnormalized density defined by the program as

(1)

which by Theorem 2 will be piecewise smooth under analytic partitions.

Recall that by assumption, the density of each distribution type

is piecewise smooth under analytic partition when viewed as a function of a sampled value and its parameters. Thus, we can assume that the probability density of a distribution has the form in Definition 

1. For each distribution , we define a set of pairs where is the number of the partitions, denotes the product of indicator functions indicating the partition of the space, taking the form of , and

represents a smooth probability density function within that partition. One can then construct the probability density function

for from . For given parameters of the distribution and a given sample value , we let and the probability density function defined by is,

For example, given

drawn from normal distribution

, we have and . Similarly a uniform sampled variable has as

and . Note that in practice one can omit the pair in when for simplicity and the probability density in the region denoting by the corresponding is zero.

5.2.2 Formal Translation Rules

The translation process , is defined recursively on the structure of . We present this recursive definition using the following notation

which says that if the premise holds, then the conclusion holds too. Also, for real-valued functions and on real-valued inputs, we write to denote the composition We now define the formal translation rules.

The first two rules define how we map the set of variables and the set of constants , to their unnormalized density and the values at which they are evaluated.

The third rule allows one to translate the primitive operations op defined in the LF-PPL, such as +, -, * and / with their argument expressions to , where to will be evaluated first. Note that represents the enumeration of all pairs in and the result of this operation among all the is the possible combination of all their elements. For example, given three sets , and which have three, one and two pairs respectively as their elements, the result set will have six pairs. This notation holds to the rest of the paper.

The fourth rule for control flow operation if enables us to translate the predicate , its consequent and alternative . This provides us with the semantics to correctly construct a piecewise smooth function, that can be evaluated at each of the partitions.

The translation rule for the sample statement generates a random variable from a specific distribution. During translation, we pick a fresh variable, i.e. a variable with a unique name to represent this random variable and add it to the set. Then we compose the density of this variable according to the distribution and corresponding parameters .

The translation rule for the observe statement, different from the sample expression, factors the density according to the distribution object, with all parameters and the observed data evaluated.

The translation rule for let expressions first translates the definition of and the body of let, and then joins the results of these translations. When joining the and sets, the rule checks whether appears in the sets from the translation of , and if so, it replaces by variable names appearing in , an expression that defines . Although let is defined as single binding, we can construct the rules to translate the let expression, defining and binding multiple variables by properly desugaring.

Theorem 2.

If is an expression that does not contain any free variables and , then the unnormalized density defined by is in the form of Equation 1. It is a real-valued function on the variables in , which is non-negative and piecewise smooth under analytic partition as per Definition 1.

The proof is provided in Appendix B. By providing this set of mathematical translations we have been able to prove that any such program written in LF-PPL constructs a density in the form of Definition 1, which is piecewise smooth under analytic partitions. Together with Theorem 1, we further show that this density is almost everywhere differentiable and the discontinuities are of measure zero, a necessary condition for several inference schemes such as DHMC [8].

5.3 A Compilation Example

We now present a simple example of how the compiler transforms the program in Figure 1 to the quadruple . The translation rules are applied recursively and within each rule, all individual components are compiled eagerly first. Namely, we step into each individual component and step out until it is fully compiled. A desugared version of is:

(let [x (sample (uniform 0 1))]
     (let [x_ (if (< (- q x) 0)
                   (observe (normal 1 1) y)
                   (observe (normal 0 1) y))]
         (< (- q x) 0)))

where and are constant and is not used. It follows the following steps.

  1. Rule . We start by looking at the outer let expressions, with being the sample statement and corresponding to the entire inner let block. Before we can generate the output of this rule, we step into and and compile them accordingly.

  2. Rule . We then apply the sample rule on from i, with each of its components evaluated first. For (uniform 0 1), and are constant and we have and . represents uniform distribution and has the form . After combining each set following the rule, with a fresh variable , we have .

  3. Rule . We now step into from i with itself being a let expression. is the entire if statement and is the returning value (< (- q x) 0). Similarly, we need to compile and first before having the result for .

  4. Rule . To apply the if rule on , we again need to compile its each individual component first. We start with its predicate , which follows the rule . Then with as a operation - applied to and .

    and both follow . Take as an example, is constant and is the normal distribution and has . We combine each set and have . Similarly, .

    With , and all evaluated, we can now continue the if rule. The key features are to extract variables in and put into and to construct the indicator functions from and take the densities on each branch respectively. As a result, compiles to , , and .

  5. Rule . For in iii, (< (- q x) 0) compiles to .

  6. Result of the inner let. Together with the outcome from iv and v, we can continue compiling the inner let block as in iii, and it is translated to

  7. Result of the outer let. Finally, with compiled in ii and in vi, we step out to i. It is worth to emphasize that the variables are the sampled ones rather than what are named in the let expression, i.e. and . Here is replaced by as declared in by following the let rule, and we have the final quadruple output:

From the quadruple, we have the overall density as . We can also detect when any random variable in , in this case , has crossed the discontinuity, by checking the boolean value of the predicate of the if statement (< (- q x) 0), as discussed in Section 4

6 Example Inference Engine: DHMC

We shall now demonstrate an example inference algorithm that is compatible with LF-PPL. Specifically, we provide an implementation of discontinuous HMC (DHMC)[8], a variant of HMC for performing statistically efficient inference on probabilistic models with non-differentiable densities, using LF-PPL as a compilation target. This satisfies the necessary requirement of DHMC that the target density being piecewise smooth with discontinuities of measure zero. Given the quadruple output from LF-PPL, DHMC updates variables in by the coordinate-wise integrator and the rest of the variables in by the standard leapfrog integrator. In an existing PPS without a special support, the user would be required to manually specify all the discontinuous and continuous variables, in addition to implementing DHMC accordingly. See Appendix C for further details.

6.1 Gaussian Mixture Model (GMM)

Figure 2:

Mean Squared Error for the posterior estimates of the true posterior of the cluster means

. We compare the results from our unoptimized DHMC and the optimized PyMC3 NUTS with Metropolis-within-Gibbs, and show that the performance between the two is comparable for the same computation budget. The median of MSE (dashed lines) with confidence intervals (shaded regions) over independent runs are plotted.
Figure 3: We compare DHMC against HMC on the worst mean absolute error (dashed lines) with the confidence intervals (shaded regions) over 20 independent runs for dimensions (left to right). We demonstrate how the sample efficiency decreases with respect to sample size (top row) and with respect to runtime (bottom row) respectively as dimensionality increases. We see that the performance of HMC deteriorates significantly more than DHMC as the dimensionality increases.

In our first example, we demonstrate how a classic model, namely a Gaussian mixture model, can be encoded in LF-PPL. The density of the GMM contains a mixture of continuous and discrete variables, where the discrete variables lead to discontinuities in the density. We construct the GMM as follows:

where are latent variables, are observed data with as the number of clusters and the total number of data. The Categorical distribution is constructed by a combination of uniform draws and nested if expressions, as shown in Appendix D. For our experiments, we considered a simple case with , , and , along with the synthetic dataset: . We compared the Mean Squared Error (MSE) of the posterior estimates for the cluster means of both an unoptimized version of DHMC and an optimized implementation of NUTS with Metropolis-within-Gibbs (MwG) in PyMC3 [22], with the same computation budget. We take samples and discard for burn in. We find that our DHMC implementation, performs comparable to the NUTS with MwG approach. The results are shown in Figure 2 as a function of the number of samples.

6.2 Heavy Tail Piecewise Model

In our next example we show how the efficiency of DHMC improves, relative to vanilla HMC, on discontinuous target distributions as the dimensionality of the problem increases. We consider the following density[7] which represents a hyperbolic-like potential function,

It generates planes of discontinuities along the boundaries defined by the if expressions. To write this as a density in our language we make use of the factor distribution object as shown in Appendix D.

The results in Figure 3 provide a comparison between the DHMC and the standard HMC on the worst mean absolute error [7] as a function of the number of iterations and time, . We see that as the dimensionality of the model increases, the per-sample performance of HMC deteriorates rapidly as seen in the top row of Figure 3. Even though DHMC is more expensive per iteration than HMC due to its sequential nature, in higher dimensions, the additional time costs occurred by DHMC is much less than the rate at which HMC performance deteriorates. The reason for this is that the acceptance rate of the HMC sampler degrades with increasing dimension, while the coordinate-wise integrator of the DHMC sampler circumvents this.

7 Conclusion

In this paper we have introduced a Low-level First-order Probabilistic Programming Language (LF-PPL) and an accompanying compilation scheme for programs that have non-differentiable densities. We have theoretically verified the language semantics via a series of translations rules. This ensures programs that compile in our language contain only discontinuities that are of measure zero. Therefore, our language together with the compilation scheme can be used in conjunction with other scalable inference algorithms such as adapted versions of HMC and SVI for non-differentiable densities, as we have demonstrated with one such variant of HMC called discontinuous HMC. It provides a road map for incorporating other inference algorithms into PPSs and shows the performance improvement of these inference algorithms over existing ones.

Acknowledgements

Yuan Zhou is sponsored by China Scholarship Council (CSC). Bradley Gram-Hansen is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems (CDT in AIMS). Tom Rainforth’s research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071. Yang was supported by the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921), and also by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (2017M3C4A7068177). Kohn and Wood were supported by DARPA D3M, Intel as part of the NERSC Big Data Center, and NSERC under its Discovery grant and accelerator programs.

References

Appendix A Proof of Theorem 1

Proof.

Assume that is piecewise smooth under analytic partition. Thus,

(2)

for some and that satisfy the properties in Definition 1.

We use one well-known fact: the zero set of an analytic function is the entire or has zero Lebesgue measure [30]. We apply the fact to each and deduce that the zero set of is or has measure zero. Note that if the zero set of is the entire , the indicator function becomes the constant- function, so that it can be omitted from the RHS of equation (2). In the rest of the proof, we assume that this simplification is already done so that the zero set of has measure zero for every .

For every , we decompose the -th region

(3)

to

(4)

Note that is open because the and are analytic and so continuous, both and are open, and the inverse images of open sets by continuous functions are open. This means that for each , we can find an open ball at inside so that for all in the ball. Since is smooth, this implies that is differentiable at every .

For the other part , we notice that

The RHS of this equation is a finite union of measure-zero sets, so it has measure zero. Thus, also has measure zero as well.

Since is a partition of , we have that

The density is differentiable on the union of ’s. Also, since the union of finitely or countably many measure-zero sets has measure zero, the union of ’s has measure zero. Thus, we can set the set required in the theorem to be this second union. ∎

Appendix B Proof of Theorem 2

Proof.

As shown in Equation 1,

it suffices to show that both factors are non-negative and piecewise smooth under analytic partition, because such functions are closed under multiplication.

We prove a more general result. For any expression , let be the set of its free variables. Also, if a function in Definition 1 satisfies additionally that its ’s are analytic, we say that this function is piecewise analytic under analytic partition. We claim that for all expressions (which may contain free variables), if , where and , then and are non-negative functions on variables in and they are piecewise analytic under analytic partition, as and in the sum are analytic. These two properties in turn imply that is a function on variables in and it is also piecewise analytic (and thus piecewise smooth) under analytic partition. Thus, the desired conclusion follows. Regarding our claim, we can prove it by induction on the structure of the expression . ∎

Appendix C Discontinuous Hamiltonian Monte Carlo

The discontinuous HMC (DHMC) algorithm was proposed by [8]. It uses a coordinate-wise integrator, Algorithm 1, coupled with a Laplacian momentum to perform inference in models with non-differentiable densities. The algorithm works because the Laplacian momentum ensures that all discontinuous parameters move in steps of for fixed constants and step size , where the index is associated to each discontinuous coordinate. These properties are advantages because they remove the need to know where the discontinuity boundaries between each region are; the change of the potential energy in the state before and after the move provides us with information of whether we have enough kinetic energy to move into this new region. If we do not have enough energy we reflect backwards . Otherwise, we move to this new region with a proposed coordinate update and momentum . This is in contrast to algorithms such as Reflect, Refract HMC [7], that explictly need to know where the discontinuities boundaries are. Hence, it is important to have a compilation scheme that enables one to do that.

The addition of the random permutation of indices is to ensure that the coordinate-wise integrator satisfies the criterion of reversibility in the Hamiltonian. Although the integrator does not reproduce the exact solution, it nonetheless preserves the Hamiltonian exactly, even if the density is discontinuous. See Lemma 1 and Theorems 2-3 in [8]. This yields a rejection-free proposal.

1:function Coordinatewise()
2:     pick an appropriate random permutation on
3:     for  do
4:         
5:         
6:         
7:         
8:         if   then
9:              
10:              
11:         else
12:              
13:         end if
14:     end for
15:     return
16:end function
Algorithm 1 Coordinate-wise Integrator. A random permutation on is appropriate if the induced random sequences and have the same distribution

Then DHMC algorithm [8] adpated for LF-PPL and our compilation scheme is as follows:

1:function DHMC-LFPPL()
2:     ;
3:     for  do represents the set of continuous variables
4:         ;
5:     end for
6:     for  do
7:         ; represents the set of discontinuous variables
8:     end for
9:     , ; represents the set of continuous variables
10:     , ; represents the set of discontinuous variables
11:     
12:     for  to  do
13:         
14:         Halfstep1
15:         
16:         Coordinate-wise()
17:         
18:         Halfstep2
19:     end for
20:     , ;
21:     
22:     
23:     if   then
24:         return
25:     else
26:         return
27:     end if
28:end function
29:function HALFSTEP1()
30:     
31:     
32:     return ()
33:end function
34:function HALFSTEP2()
35:     
36:     
37:     return ()
38:end function
Algorithm 2 Discontinuous HMC Integrator for the LF-PPL.
is a map from random-variable names in to their values , is the total Hamiltonian, is the step size, and is the trajectory length.

Appendix D Program code

(let [y (vector -2.0  -2.5  ... 2.8)       pi [0.5 0.5]       z1 (sample (categorical pi))       ...       z10(sample (categorical pi))       mu1 (sample (normal 0 2))       mu2 (sample (normal 0 2))       mus (vector mu1 mu2)]   (if (< (- z1) 0)        (observe (normal mu1 1) (nth y 0))        (observe (normal mu2 1) (nth y 0)))   ...   (if (< (- z10) 0)       (observe (normal mu1 1) (nth y 9))       (observe (normal mu2 1) (nth y 9)))   (mu1 mu2 z1 ... z10)) figureThe LF-PPL version of the Gaussian mixture model detailed in Section 6. (let [x (sample (uniform -6 6))        abs-x (max x (- x))        z (- (sqrt (* x (* A x))))]    (if (< (- abs-x 3) 0)        (observe (factor z) 0)        (observe (factor (- z 1)) 0))   x) figureThe LF-PPL version of the heavy-tailed model detailed in Section 6.