 # Deriving Probability Density Functions from Probabilistic Functional Programs

The probability density function of a probability distribution is a fundamental concept in probability theory and a key ingredient in various widely used machine learning methods. However, the necessary framework for compiling probabilistic functional programs to density functions has only recently been developed. In this work, we present a density compiler for a probabilistic language with failure and both discrete and continuous distributions, and provide a proof of its soundness. The compiler greatly reduces the development effort of domain experts, which we demonstrate by solving inference problems from various scientific applications, such as modelling the global carbon cycle, using a standard Markov chain Monte Carlo framework.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Probabilistic programming promises to arm data scientists with declarative languages for specifying their probabilistic models, while leaving the details of how to translate those models to efficient sampling or inference algorithms to a compiler. Many widely used machine learning techniques that might be employed by such a compiler use the probability density function (pdf) of the model as input. Such techniques include maximum likelihood or maximum a posteriori estimation, estimation, importance sampling, and Markov chain Monte Carlo (mcmc) methods (Scott, 2001; Bishop, 2006).

However, despite their utility, density functions have been largely absent from the literature on probabilistic functional programming (Ramsey and Pfeffer, 2002; Goodman et al., 2008; Kiselyov and Shan, 2009). This is because the relationship between programs and their density functions is not straightforward: for a given program, the pdf may not exist or may be non-trivial to calculate. Such programs are not merely infrequent pathological curiosities but in fact arise in many ordinary scenarios. In this paper, we define, prove correct, and implement an algorithm for automatically computing pdfs for a large class of programs written in a rich probabilistic programming language. An abridged version of this paper was published as (Bhat et al., 2013).

##### Probability density functions.

In this work, probabilistic programs correspond directly to probability distributions, which are important because they are a powerful formalism for data analysis. However, many techniques we would like to use require the probability density function of a distribution instead of the distribution itself. Unfortunately, not every distribution has a density function.

Distributions. One interpretation of a probabilistic program is that it is a simulation that can be run to generate a random sample from some set of possible outcomes. The corresponding probability distribution characterizes the program by assigning probabilities to different subsets of (events). The probability for a subset of corresponds to the proportion of runs that generate an outcome in , in the limit of an infinite number of repeated runs of the simulation.

Consider for example a simple mixture of Gaussians, here written in Fun (Borgström et al., 2011), a probabilistic functional language embedded within F# (Syme et al., 2007).

if flip 0.7 then random(Gaussian(0.0, 1.0)) else random(Gaussian(4.0, 1.0))

The program above specifies a distribution on the real line ( is

) and corresponds to a generative process that flips a biased coin and then generates a number from one of two Gaussian distributions, both with standard deviation 1.0 but with mean either 0.0 or 4.0 depending on the result of the coin toss. In this example, we will be more likely to produce a value near 0.0 than near 4.0 because of the bias. The probability

for , for instance, is the proportion of runs that generate a number between and .

Densities. A distribution is a function that takes subsets of as input, but for many purposes it turns out to be more convenient if we can find a function that takes elements of directly, while still somehow capturing the same information as .

When is the real line, we are interested in a function that satisfies for all intervals , and we call the probability density function (pdf) of the distribution . In other words, is a function where the area under its curve on an interval gives the probability of generating an outcome falling in that interval. The pdf of our example (pictured below) is given by the function

 f(x)=0.7⋅\lstinline{{\lst@@@set@language\lst@@@set@numbers% \lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language{% \@listingGroup{ltx_lst_identifier}{\small\color[rgb]{0,0.4,0}pdf% \textunderscore Gaussian}}(0.0,{\@listingGroup{ltx_lst_space}{~{}}}1.0,{% \@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{\small% \color[rgb]{0,0.4,0}x}})}}+0.3⋅\lstinline{{\lst@@@set@language% \lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language% \lst@@@set@language{\@listingGroup{ltx_lst_identifier}{\small\color[rgb]{% 0,0.4,0}pdf\textunderscore Gaussian}}(4.0,{\@listingGroup{ltx_lst_space}{~{}}}% 1.0,{\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{% \small\color[rgb]{0,0.4,0}x}})}}

where pdf_Gaussian(mean, sdev, ) is the pdf of a Gaussian distribution with mean mean and standard deviation sdev (the famous “bell curve” from statistics).

The pdf takes higher values where the generative process described above is more likely to generate an outcome. Now we see that the aforementioned probability for is simply the area under this curve between and . Note that, while there are some loose similarities, the expression for the pdf is different from the expression comprising the source program. In more complicated programs, the correspondence with the pdf is even less obvious.

Non-existence. Sometimes a distribution does not have a pdf. For example, if we change the else-clause in our example to return directly, instead of drawing from a Gaussian with mean , we get the following probabilistic program, which does not have a pdf:

if flip 0.7 then random(Gaussian(0.0, 1.0)) else 4.

In short, the problem is that there is a non-zero amount of probability mass located on a zero-width interval (the process now returns with probability ), but integrals on such intervals yield zero, so we would never find a function that could satisfy the properties of being a pdf.

It is not always obvious which program modifications ruin the property of having a pdf, especially for multivariate distributions (thus far we have only given univariate distributions as examples). This can be a problem if one is innocently exploring different variations of the model. Details and examples are given by Bhat et al. (2012), who provide the theory for addressing this problem, which we extend and implement in this work.

##### The task of data analysis.

So far we have detailed what pdfs are, but not why we want them. We motivate the desire by explaining one popular use-case of pdfs that arises when applying Bayesian learning to data analysis.

In the previous examples, the programs specify fully-known probability distributions. While there is indeed randomness in the probabilistic behavior of the samples they generate, the nature of this uncertainty is entirely known—we know the means and variances of the Gaussians, as well as the bias between the two, thus we can characterize the random behavior.

In real-world analysis tasks, we rarely have this luxury. Instead, we are often in the position of trying to figure out what the parameters should be, given the data we see. Thus, applications typically deal with parameterized models (families of distributions indexed by a parameter), and they try to learn something about which distributions in that family best explain the observed data. For example, in the following, moG is a parameterized model, indexed by the parameter (mA,mB):

let moG (mA,mB) =
if flip 0.7 then random(Gaussian(mA, 1.0)) else random(Gaussian(mB, 1.0))

It specifies an infinite number of distributions, one for each choice of mA and mB.

Now, as a data analyst, we may be presented a dataset that is a sequence of numeric values, and we may also have some domain-specific reason to believe that it can be well modelled as a biased mixture of Gaussians as specified by moG111Whether this is actually an appropriate modelling choice depends on whether it captures enough of the essence of the true, unknown data-generating process (i.e. Nature), and is an entirely separate discussion.. We now face the task of figuring out which choices of mA and mB are likely. Intuitively, if we see a clump of datapoints around 0.0, we might be inclined to believe that the mean of one of the Gaussians is 0.0. Note that this is a one-dimensional, probabilistic version of the problem of clustering. The means are the cluster centroids.

##### Bayesian learning.

Bayesian modelling formalizes this task by requiring the modeller to provide as input a prior distribution over the parameters and a generating distribution over the data. These are used to construct a posterior distribution over the parameters as the output.

The prior distribution is a distribution over the possible values of the parameters and captures our belief about what values they are likely to take, before having seen any data (to a Bayesian, “belief” is synonymous with “probability distribution”). The following is one possible prior:

let prior () =
let mA = random(Uniform(-10.0, 10.0)) in
let mB = random(Uniform(-10.0, 10.0)) in
(mA, mB)

This specifies that we are certain that the means lie between -10.0 and 10.0, but are otherwise uncertain, and our uncertainty about each mean is uniformly distributed between -10.0 and 10.0. This is of course a very particular assertion and is hopefully informed by domain knowledge. In practice, in the absense of domain knowledge, we can select a prior that reflects our open-mindedness about which values the means can take, such as a pair of Gaussians with a high standard deviation. The prior

produces a distribution over pairs of means as output, rather than taking a pair of means as an input, as moG does.

The generating distribution is a model of how we believe Nature is generating a dataset given a specific choice of the parameter. In our example this is given by moG together with

let gen n (mA,mB) = [| for i in 1 .. n ->  moG (mA,mB) |]

This specifies a (parameterized) model for a dataset that is generated as an array of n independent and identically distributed (i.i.d.) values generated by moG.

The posterior distribution is a distribution over the possible values of the parameters and captures our belief about what values they are likely to take, after seeing the data. The posterior is related to the prior and generating distributions by Bayes’ rule, which gives us a way to describe how the prior is updated with the observed data to yield the posterior. Intuitively, this update represents the fact that our understanding about the world (our belief about the parameters) evolves based on the data we see. Bishop provides an excellent account of Bayesian learning (Bishop, 2006).

##### Use-case of pdfs: Bayesian inference with mcmc

Unfortunately, while prior distributions and generating distributions are often straightforward to work with (we have control over them as the modeller), the posterior distributions often end up intractable or unwieldy to work with (Bayes’ rule dictates their form).

Markov chain Monte Carlo (mcmc) methods are one class of techniques that let us actually do something productive with the posterior distribution—mcmc can be used to generate samples from the posterior distribution. The idea of mcmc is to construct a Markov chain in the parameter space of the model, whose equilibrium distribution is the posterior distribution over model parameters. Neal (1993) gives an excellent review of mcmc methods. Here we use Filzbach (Purves and Lyutsarev, 2012), an adaptive mcmc sampler based on the Metropolis-Hastings algorithm. All that is required for such algorithms is the ability to calculate the posterior density given a set of parameters, up to proportion. The posterior does not need to be from a mathematically convenient family of distributions. Samples from the posterior can then serve as its representation, or be used to calculate marginal distributions of parameters or other integrals under the posterior distribution.

The posterior density is a function of the pdfs of the various pieces of the model, so to perform inference using mcmc, we also need functions to compute the pdfs. Below, pdf_moG gives the pdf of a single data point, while pdf_gen gives the pdf of an array of independent data points drawn from the same distribution (iid).

let pdf_prior (mA,mB) = pdf_Uniform(-10.0, 10.0, mA) * pdf_Uniform(-10.0, 10.0, mB)
let pdf_moG (mA,mB) x = 0.7 * pdf_Gaussian(mA, 1.0, x) + 0.3 * pdf_Gaussian(mB, 1.0, x)
let pdf_gen (mA,mB) xs = product [| for x in xs -> pdf_moG (mA,mB) x |]

(The product function multiplies together the elements of an array, returning on the empty array.) Filzbach and other mcmc libraries require users to write these three functions222The actual implementation works with log-densities, as discussed in Section 4., in addition to the generative probabilistic functions prior and gen (which are used for model validation).

The goal of this paper is to instead compile these density functions from the generative code. This relieves domain experts from having to write the density code in the first place, as well as from the error-prone task of manually keeping their model code and their density code in synch. Instead, both the pdf and synthetic data are derived from the same declarative specification of the model.

##### Contributions of this paper.

This work defines and applies automated techniques for computing densities to real inference problems from various scientific applications. The primary technical contribution is a density compiler that is correct, useful, and relatively simple and efficient. Specifically:

• We provide the first implementation of a density compiler based on the specification by Bhat et al. (2012). We compile programs in the probabilistic language Fun (described in Section 2.1) to their corresponding density functions (Section 3).

• We prove that the compilation algorithm is sound (Theorem 3.4). This is the first such proof for any variant of this compiler.

• We show that the compiler greatly reduces the development effort of domain experts by freeing them from writing tricky density code and that the produced code is comparable in performance to density functions hand-coded by experts. Our evaluation is based on textbook examples and on models from ecology (Section 4).

### 1.1. Synthetic Examples

Here are some examples of conditionals on random variables, that are not covered by the naive compiler in the model-learner paper.

Let p = random(Uniform) in if random(Bernoulli(p)) then random(Gaussian(p,1.0)) else random(Gaussian(-p,1.0))
Let p = random(Uniform) in if random(Bernoulli(p)) then p else -p

## 2. Languages

In order to describe the density compiler, we first specify its input (source) and output (target) language. Both languages are variants of a simple first-order functional language where the results of subcomputations can be bound to variables using a let construct.

### 2.1. Fun: Probabilistic Expressions (Review)

Our source language is a version of the core calculus Fun (Borgström et al., 2011), without observation. To mark certain program points as impossible, we add a fail construct (Kiselyov and Shan, 2009). Fun is a first-order functional language without recursion that extends the language of Ramsey and Pfeffer (2002), and this version has a natural semantics in the sub-probability monad. Our implementation efficiently supports a richer language with records and fixed-size arrays and array comprehensions, which can be given a semantics in this core (records and arrays can be encoded as tuples, and comprehensions of fixed size as their unrolling).

#### 2.1.1. Syntax and Types of Fun:

The language Fun has base types int, real and unit, product types (denoting pairs), and sum types (denoting tagged unions). A type is said to be discrete if it does not contain real. We let range over constant data of base type, over integers and over real numbers. We write to mean that constant has type .

 Types of Fun: t,u::=\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language{\@listingGroup{% ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}int}}}}∣\lstinline{{% \lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language{\@listingGroup{ltx_lst_keyword}{\color[% rgb]{0,0.1,0.5}real}}}}∣\lstinline{{\lst@@@set@language% \lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language% \lst@@@set@language{\@listingGroup{ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}unit% }}}}∣(t1∗t2)∣(t1+t2)

We take , and let associate to the right. We assume a collection of total deterministic functions on these types, including arithmetic and logical operators. Each operation of arity has a signature of the form val :  *  *  -> . We also assume standard families of primitive probability distributions, including the following. Distributions:     Above, the names of the arguments to the distributions are present for documentation only. A distribution corresponds to a coin flip with probability bias to come up true. The distribution describes the number of occurrences of independent events that occur at the given average rate. The Gaussian

distribution is also known as the normal distribution; its

pdf has a symmetrical bell shape. The Beta distribution is a suitable prior for the parameter of Bernoulli distributions, and intuitively means that counts of true and events of false have been observed. Similarly the Gamma(shape,scale) distribution is a suitable prior for the parameter of Poisson. The parameters of distributions only make sense within certain ranges (e.g., the bias of the Bernoulli distribution must be in the interval ). Outside these ranges, attempting to draw a value from the distribution (e.g., Bernoulli) results in a failure (fail below). Expressions of Fun:   value variable scalar constant tuple constructor left sum constructor right sum constructor expression variable and scalar constant pairing and projections from a pair sum constructors matching (scope of is ) let (scope of is ) primitive operation (deterministic) primitive distribution failure   The let and match statements bind their variables (); we identify expressions up to alpha-renaming of bound variables. Above, inl (resp. inr) generates a value corresponding to the left (resp. right) branch of a sum type. Values of sum type are deconstructed by the match construct, which behaves as either the left () or the right () branch depending on the result of .

To ensure that a program has at most one type in a given typing environment, and are annotated with a type (see 2.1.1 below). The expression fail is annotated with the type it is used at. These types are included only for the convenience of our technical development, and can usually be inferred given a typable source program: we omit these types where they are not used. () is the unit constant.

A source language term is pure, written “”, iff does not contain any occurrence of random or fail.

We write Uniform for Beta(1.0,1.0). In the binders of let and match expressions, we let  stand for a variable that does not appear free in the scope of the binder. We make use of standard sugar for let, such as writing for . We write if then else for ; this is most commonly used when is Boolean. We let the tuple stand for . Similarly, we write for when .

When is a term from some language (possibly with binders), we write if none of the appear free in .

We write to mean that in the type environment ( distinct) the expression has type . Apart from the following, the typing rules are standard. In 2.1.1, (Fun Inr) (not shown) and 2.1.1, type annotations are used in order to obtain a unique type. In 2.1.1, a random variable drawn from a distribution of type has type .

 Selected Typing Rules: Γ⊢M:t

Substitutions, ranged over by , are finite maps from variables to pure expressions. We write for the result of substituting all free occurrences of variables in with , avoiding capture of bound variables. To compose two substitutions with disjoint domains, we write for . A substitution is called closed if the expressions in its range do not contain any free variables. A value substitution is a substitution where each expression in its range is a value. Below, we define what it means for a closed value substitution to be a valuation for a type environment. Typing Rules for Closed Value Substitutions:

There is a default value at each type , written , that is returned from operations where they otherwise would be undefined, e.g. . Default Value:

#### 2.1.2. Semantics of Fun

As usual, for precision concerning probabilities over uncountable sets, we turn to measure theory. The interpretation of a type is the set of closed values of type (real numbers, integers etc.). Below we consider only Lebesgue-measurable sets of values, defined using the standard (Euclidian) metric, and ranged over by . Indeed, the power of the axiom of choice is needed to construct a non-measurable set (Solovay, 1970).

A measure over is a function, from (measurable) subsets of to the non-negative real numbers extended with , that is -additive, that is, and if are pair-wise disjoint. We write for ; the measure is called a probability measure if , and a sub-probability measure if .

We associate a default or stock measure to each type, inductively defined as the counting measure on and , the Lebesgue measure on , and the Lebesgue-completion of the product and disjoint sum, respectively, of the two measures for and . In particular, the counting measure on a discrete type assigns measure to all sets of finite size , and measure to all infinite sets.

If is a non-negative (measurable) function , we let be the Lebesgue integral of with respect to the stock measure on if the integral is defined, and otherwise 0. This integral coincides with for discrete types , and with the standard Riemann integral (if it is defined) on . We write for , and for Lebesgue integration with respect to the measure on . Below, we often elide the index ; indeed, we may consider any function as a function from the measurable space that is zero except on .

The Iverson brackets are 1.0 if predicate is true, and 0.0 otherwise. We write for when . The function is a density of (with respect to the stock measure) if = for all . If is a (sub-)probability measure, then we say that as above is its pdf.

To turn expressions into density functions, we first need a way of interpreting a closed Fun expression as a sub-probability measure over its return type. Open fail-free Fun expressions have a straightforward semantics (Ramsey and Pfeffer, 2002) as computations in the probability monad (Giry, 1982). In order to treat the fail primitive, we use an existing extension (Gordon et al., 2013) of the semantics of Ramsey and Pfeffer (2002) to a richer monad: the sub-probability monad (Panangaden, 1999)333Sub-probabilities are also used in our compilation of match (and if) statements, where the probability that we have entered a particular branch may be less than 1.. Compared to the operations of the probability monad, the sub-probability monad additionally admits a constant, yielding the zero measure. To accommodate the zero measure, the carrier set is extended from probability measures to sub-probability measures, i.e., admitting all with .

Below we recapitulate the semantics of Fun by Gordon et al. (2013). Here is a closed value substitution whose domain contains all the free variables of , and ranges over , , , and . We also let and .

 Monadic Semantics of Fun with fail: P[[M]] σ (μ>>=f) A≜∫f(x)(A)dμ(x) Sub-probability monad’s bind (returnV) A≜1~{}if~{}V∈A,~{}else~{}0 Sub-probability monad’s return zero A≜0 Sub-probability monad’s zero Below we assume that z♯N,N1,N2,σ and x,x1,x2♯z,σ. P[[x]] σ≜returnσ(x) P[[c]] σ≜returnc P[[detOp(M)]] σ≜P[[M]] σ>>=λx.returndetOp(x) P[[\lstinline{{\lst@@@set@language% \lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language% \lst@@@set@language{\@listingGroup{ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}fail% }}}}]] σ≜zero

We let the semantics of a closed expression be , where denotes the empty substitution.

If and then is a sub-probability measure on type .

###### Proof.

By induction on . ∎

### 2.2. Target Language for Density Computations

For the target language of the density compiler, denoted un, we use a pure version of Fun augmented with real-valued first-order functions and stock integration.

 Expressions of the Target Language: E,F E,F::= target expression x∣c variable and scalar constant (E,F)∣\lstinline{{\lst@@@set@language\lst@@@set@numbers% \lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language{% \@listingGroup{ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}fst}}}} E∣% \lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language{\@listingGroup{% ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}snd}}}} E pairing and projections from a pair \lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language{\@listingGroup{% ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}inl}}}}uE∣\lstinline{{% \lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language{\@listingGroup{ltx_lst_keyword}{\color[% rgb]{0,0.1,0.5}inr}}}}tE sum constructors matching (scope of xi is Fi) \lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language{\@listingGroup{% ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}let}}}} x=E \lstinline{{% \lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language{\@listingGroup{ltx_lst_keyword}{\color[% rgb]{0,0.1,0.5}in}}}} F let (scope of x is F) op(E) primitive operation ∫tλ(x1,…,xn). E stock integration

Above, the binders in let and match are as in Fun. Additionally, in the variables bind into .

If a Fun term is pure then is also an expression in the syntax of un, and we silently treat it as such. In particular, a Fun substitution is also a valid un substitution, and substitution application for un is defined in the same way as for Fun.

The typing rule involving integration is as follows. The other typing rules are as in Fun. Typing Rule for Integration:

[Standard results for the type system of un ]

1. Substitution lemma: if and , then .

2. Strengthening: if and , then .

3. Weakening: if and , then .

#### 2.2.1. Semantics of ∫un

The target language un is equipped with a denotational semantics, written where is a substitution of closed values for variables with . We define this semantics by re-interpreting the monadic semantics of Subsection 2.1.2 with respect to the identity monad: in this monad, is the identity function, and bind ordinary function application. We rely on an auxiliary semantics that returns a function to be integrated.

 Identity Monad and Denotational Semantics of ∫un: I[[λ(z1,…,zn). E]]σ and I[[E]]σ V>>=f≜f(V) Identity monad’s bind (returnV)≜V Identity monad’s return (the other cases of I[[E]]σ are the same as the monadic semantics in Subsection 2.1.2) I[[λ(z1,…,zn). F]]σ≜f where~{}f(V1,…,Vn)≜I[[F]]σ[z1:=V1,…,zn:=Vn]% and z1,…,zn♯σ

We write if there are such that and and for all such that we have .

1. If and then is a value of type .

2. If and and then is a function of type .

###### Proof.

(1) and (2) are proved jointly, by induction on . ∎

## 3. The Density Compiler

We compute the pdf of a Fun program by compilation into un. Our compilation is based on that of Bhat et al. (2012), with modifications to treat fail statements, match (and general if) statements, pure (i.e., deterministic) let bindings, and integer arithmetic.

The compiler translates a well-typed Fun source term into a function computing the density (PDF) of . Given an implementation of stock integration, the un expression may be executed to evaluate the density of at any value of the type of . Like traditional compilers, our compiler is compositional and deterministic, producing a unique translation if any at all (Lemma 3.4). Unlike traditional compilers, our compiler is partial and will fail to produce a translation for some well-typed source terms. In particular, if does not have a density function then the compiler will fail to produce an . However, it may also fail if has a PDF, but the compiler is just not complete enough to compute it. In particular, let-bound expressions must either be pure or have a PDF, even if their result is not used. The correctness statement for the compiler is given by Theorem 3.4.

We will use a version of the moG function from the introduction as a running example (Figure 1), with some expansion in order to make use of more of the translation rules.

The structure of this section is as follows. In Section 3.1 we provide an intuitive outline of the compilation. We make preliminary definitions, such as the syntax of probability contexts , in Section 3.2. We define the compiler itself in Section 3.3 in terms of a couple of judgments. These judgments are inductively defined relations, but they in fact are partial functions and hence have a direct executable interpretation. Finally, in Section 3.4 we state and prove correctness of the compiler.

### 3.1. Outline

The simplest case in the density compilation is fail, which compiles to the function that always returns zero. The compilation works on the let-structure of the term: a sequence of random lets, as in is compiled to the product of the pdfs of the distributions

, following the chain rule of probability.

If the sequence of lets instead has a discrete deterministic return expression , then has a probability for each possible value . This probability is computed by integrating the joint pdf of over the set of values where evaluates to . A continuous deterministic return expression is treated as a mathematical function , using the change of variables rule of integration. In the one-dimensional case, if has inverse , the pdf of at is given by the pdf of at , multiplied with the derivative of . Another simple case is projection , where we simply integrate the joint pdf over the set of all possible values for the other variables .

If a distribution Dist returns a sum type (e.g., Bernoulli) we can write for the subdistribution yielding only the left part of the sum, and for the right part. By additivity of probability, we can compile the match expression to the sum of the pdfs of and .

In a nested let, such as , the expression bound to denotes some subprobability distribution. We compute its pdf by recursively compiling the inner let sequence, holding fixed. Pure lets, as in where is pure, have the same pdf as . The compilation algorithm applies the substitution lazily to avoid introducing unnecessary copies of .

### 3.2. Probability contexts

The density compilation is based on the let-structure of the expression. Variables that are bound in outer lets are referred to as parameters, and are treated as constants. A probability context gathers the variables that are bound in the current sequence of lets, together with the pure expressions defining the deterministic variables.

 Probability Context: Υ Υ::= probability context ϵ empty context Υ,x random variable Υ,x=E deterministic variable

The probability context at line 7 of Figure 1 is branch, temp, resulttemp + mB, containing two random variables and one deterministic variable.

For a probability context to be well-formed, it has to be well-scoped and well-typed. Well-formed probability context:           The probability context is well-formed: , where the type context also contains the types of the parameters mA and mB.

Given a well-formed context , we can extract the random variables , and an idempotent substitution (i.e., always) that gives values to the deterministic variables. Random variables and values of deterministic variables

We have and .

If then

###### Proof.

By simultaneous induction on the derivations of and . ∎

### 3.3. Compilation rules

A probability context is used together with a density expression ( below), which is an open term that expresses the joint density of the random variables in the context and the constraints that have been collected when choosing branches in match statements. Intuitively, the density expression is the body of the pdf of the current branch. The main judgment of the compiler is , which associates a Fun term with its density function . Parameters may occur free in , and binds into . The auxiliary judgment yields a density expression for the variables in its argument, marginalizing (i.e., integrating) out all other random variables in from . Inductively Defined Judgments of the Compiler:   in the function gives the pdf of in expression gives the density of the variables in

We begin the description of the compiler proper with the following judgment of marginal density, which computes an expression for the joint marginal pdf of the random variables in its argument. The variables in the argument are free in the computed expression. Below, we write for the tuple of variables arising from by deleting all instances of variables in , and dually for . Marginal Density:     Here we first substitute in the deterministic let-bound variables, as given by , and then integrate out the remaining random variables . In the definition of the compiler, marg() is also used with , to compute the probability of being in the current branch of the program.

Here where

 F\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language\lst@@@set@language% \lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{\small\color[rgb]{% 0,0.4,0}temp}}}}:=∫λ(\lstinline{{\lst@@@set@language% \lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language% \lst@@@set@language\lst@@@set@language\lst@@@set@numbers{\@listingGroup{% ltx_lst_identifier}{\small\color[rgb]{0,0.4,0}branch}}}}).E[\lstinline% {{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language\lst@@@set@language\lst@@@set@numbers{% \@listingGroup{ltx_lst_identifier}{\small\color[rgb]{0,0.4,0}result}}}}↦\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language\lst@@@set@language% \lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{\small\color[rgb]{% 0,0.4,0}temp}}}}+\lstinline{{\lst@@@set@language\lst@@@set@numbers% \lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language% \lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{% \small\color[rgb]{0,0.4,0}mB}}}}],

which will be used when computing the pdf of the variable result on line 7 (cf. Examples 3.33.3).

The main judgment of the compiler is the dens judgment , which gives the density of in the current context , where is the accumulated body of the density function so far. In this judgment, is binding into . We introduce fresh variables during the compilation: in the rules below we assume that . Density Compiler, base cases:              For a deterministic variable, 3.3 recurses into its definition. The rule 3.3 computes the marginal density of a random variable using the marg judgment. The 3.3 rule states that the probability density of a discrete constant (built from sums and products of integers and units) is the probability of being in the current branch at , and 0 elsewhere. Note the absence of a rule for real number constants, since they do not possess a density. The 3.3 rule gives that the density of fail is zero.

By 3.3 we get as computed in Example 3.3.

To compute the pdf at line 7, 3.3 yields that where is computed in Example 3.3.

 Density Compiler, random variables : Υ;E⊢dens(M)⇒λz. F

In 3.3, a random variable drawn from a primitive distribution with a constant argument has the expected pdf (multiplied with the probability that we are in the current branch). Its precondition that and intuitively means that is constant under . 3.3 treats calls to random with a random argument by marginalizing over the argument to the distribution. We here require that each primitive distribution Dist has a pdf for each value of its arguments, denoted . Using rule 3.3, we can compute the density at line 4 as
where intuitively yields the probability of being in the current branch, and is computed using 3.3 as .

 Density Compiler, rules for tuples: Υ;E⊢dens(M)⇒λz. F

The rule 3.3 computes the joint marginal density of a tuple of variables444Joint marginal densities for tuples of expressions can be computed if those expressions are conditionally independent (Bhat et al., 2012). As an example, has a pdf whenever does. However, the rules in this paper do not support such expressions, to avoid additional complexity.. The rules 3.3 and 3.3 integrate out the right or the left component of a pair, respectively. Density Compiler, let:        The rule 3.3 simply adds a pure let-binding to the context. In 3.3, we compute the density of the let-bound variable in an empty context, and multiply it into the current accumulated density when computing the density of the body. The density expression for the program fragment in Figure 1 is computed using 3.3 as where , the expression is lines 2-7, and is computed by using previously seen rules. Here , since .

The let expression on line 2 of Figure 1 is also handled by 3.3, while the one on line 6 is handled by 3.3 since .

For deterministic matches we use four deterministic operations, which we assume do not occur in source programs. We let be the indicator function for the left branch defined as , and dually for isR. To destruct a value of sum type we use defined as , and its dual fromR.

 Density Compiler, rules for sums and match: Υ;E⊢dens(\lstinline{{\lst@@@set@language% \lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language% \lst@@@set@language\lst@@@set@language\lst@@@set@numbers{\@listingGroup{% ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}match}}}} M \lstinline{{% \lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language\lst@@@set@language\lst@@@set@numbers{% \@listingGroup{ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}with}}}}…)⇒λz. F

3.3 states that the density of is the density of in the left branch of a sum, and 0 in the right. Its dual is 3.3. 3.3 is based on 3.3, and we additionally multiply the constraint that we are in the correct branch ( or ) with the joint density expression. We employ the functions fromL and fromR and their associated rules 3.3 and 3.3 to avoid additional calls to 3.3 arising from 3.3 if the compilation of the density of requires computing the density of the match-bound variable , as in . Since we assume that fromL and fromR do not appear in source programs, these rules are only ever used in the case described above. The 3.3 rule is based on 3.3, and we again multiply in the constraint that we are in the left (or right) branch of the match. The match selector on line 3 is a pure expression, so rule 3.3 applies. For the left branch, we let and and compute

 Υ4;E4⊢dens(\lstinline{{% \lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language\lst@@@set@language\lst@@@set@numbers{% \@listingGroup{ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}random}}({\@listingGroup% {ltx_lst_identifier}{\small\color[rgb]{0,0.4,0}Gaussian}}({\@listingGroup{% ltx_lst_identifier}{\small\color[rgb]{0,0.4,0}mA}},{\@listingGroup{% ltx_lst_space}{~{}}}1.0))}})⇒λz. F4

where is computed using 3.3 and 3.3 as

Here , so the contribution of the left branch to the pdf of the match is the pdf of the branch scaled by the probability of entering the left branch. In general, this holds when the branch expression is independent from the body of the branch.

For the right branch, see Example 3.3. We then obtain the pdf of the match as the sum of the pdfs of the two branches.

Our implementation of the compiler uses the following derived rules for if statements where the branching expression is of type bool, and does not treat other sum types nor matches. Derived rules for if statements

Since the match-bound variables (lines 4 and 5) do not appear in the bodies of the match branches, we can instead use rule 3.3 to avoid adding them to the probability context when computing the pdf of the body (cf. Example 3.2).

 Density compiler, discrete operations : Υ;E⊢dens(f(M))⇒F

The 3.3

rule for discrete operations, such as logical and comparison operations and integer arithmetic, computes the expectation of an indicator function over the joint distribution of the random variables occurring in the expression.

For numeric operations on real numbers we mimic the change of variable rule of integration (often summarized as “”), multiplying the density of the argument with the derivative of the inverse of the operation. For operations of more than one argument (e.g., 3.3 below), we instead use the matrix volume of the Jacobian matrix of the inverse operation (Ben-Israel, 1999). We only require that the operation is invertible on a restricted domain, namely where the pdf of its argument is non-zero. This is exemplified by the following rules.

 Density compiler, numeric operations on reals : Υ;E⊢dens(f(M))⇒F

For the 3.3 rule above, we require that negative arguments to log have zero density. Letting , the sum on line 6 is evaluated using rule 3.3 as

 Υ7;E7⊢dens(\lstinline{{% \lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language\lst@@@set@language\lst@@@set@numbers{% \@listingGroup{ltx_lst_identifier}{\small\color[rgb]{0,0.4,0}temp}}}}+% \lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language\lst@@@set@language% \lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{\small\color[rgb]{% 0,0.4,0}mB}}}})⇒λz. F7

where

 F7=\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language\lst@@@set@language% \lst@@@set@numbers{\@listingGroup{ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}let}}% }} \lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language\lst@@@set@language% \lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{\small\color[rgb]{% 0,0.4,0}temp}}}}=z−\lstinline{{\lst@@@set@language\lst@@@set@numbers% \lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language% \lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{% \small\color[rgb]{0,0.4,0}mB}}}} \lstinline{{\lst@@@set@language% \lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language% \lst@@@set@language\lst@@@set@language\lst@@@set@numbers{\@listingGroup{% ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}in}}}} ∫λ\lstinline% {{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language\lst@@@set@language\lst@@@set@numbers{% \@listingGroup{ltx_lst_identifier}{\small\color[rgb]{0,0.4,0}branch}}}}. E7 ≡λz.0.3⋅\textscPdf\lstinline{{% \lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language\lst@@@set@language\lst@@@set@numbers{% \@listingGroup{ltx_lst_identifier}{\small\color[rgb]{0,0.4,0}Gaussian}}}}(\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language\lst@@@set@language% \lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{\small\color[rgb]{% 0,0.4,0}mB}},1.}})(z)

since for all . Thus we obtain that the pdf of the program fragment in Figure 1 is given by

 λz.F4+F7≡λz.0.7⋅\textscPdf\lstinline{{% \lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language\lst@@@set@language\lst@@@set@numbers{% \@listingGroup{ltx_lst_identifier}{\small\color[rgb]{0,0.4,0}Gaussian}}}}(\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language\lst@@@set@language% \lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{\small\color[rgb]{% 0,0.4,0}mA}},1.}})(z)+0.3⋅\textscPdf\lstinline{{% \lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language\lst@@@set@language\lst@@@set@numbers{% \@listingGroup{ltx_lst_identifier}{\small\color[rgb]{0,0.4,0}Gaussian}}}}(\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language\lst@@@set@language% \lst@@@set@numbers{\@listingGroup{ltx_lst_identifier}{\small\color[rgb]{% 0,0.4,0}mB}},1.}})(z).

Finally, as another example of compilation, the if statement in the program

let p = random(Beta(1.0,1.0)) in // the uniform distribution on the unit interval
let b = random(Bernoulli(p)) in
if b then p+1.0 else p

is handled by 3.3, yielding a density function that (modulo trivial integrals) is equivalent to

 λz. \lstinline{{\lst@@@set@language\lst@@@set@numbers% \lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language% \lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_keyword}{\color[% rgb]{0,0.1,0.5}let}}}} p=z−1 \lstinline{{\lst@@@set@language% \lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language% \lst@@@set@language\lst@@@set@language\lst@@@set@numbers{\@listingGroup{% ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}in}}}} ∫λb.[0≤p≤1]⋅(\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language\lst@@@set@language% \lst@@@set@numbers{\@listingGroup{ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}if}}}% } b \lstinline{{\lst@@@set@language\lst@@@set@numbers% \lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language% \lst@@@set@language\lst@@@set@numbers{\@listingGroup{ltx_lst_keyword}{\color[% rgb]{0,0.1,0.5}then}}}} p \lstinline{{\lst@@@set@language% \lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@language% \lst@@@set@language\lst@@@set@language\lst@@@set@numbers{\@listingGroup{% ltx_lst_keyword}{\color[rgb]{0,0.1,0.5}else}}}} 1−p)⋅\lstinline% {{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor% \lst@@@set@language\lst@@@set@language\lst@@@set@language\lst@@@set@numbers{% \@listingGroup{ltx_lst_identifier}{\small\color[rgb]{0,0.4,0}isL}}}}(b) +∫λb.[0≤z≤1]⋅