# Nesting Probabilistic Programs

We formalize the notion of nesting probabilistic programming queries and investigate the resulting statistical implications. We demonstrate that query nesting allows the definition of models which could not otherwise be expressed, such as those involving agents reasoning about other agents, but that existing systems take approaches that lead to inconsistent estimates. We show how to correct this by delineating possible ways one might want to nest queries and asserting the respective conditions required for convergence. We further introduce, and prove the correctness of, a new online nested Monte Carlo estimation method that makes it substantially easier to ensure these conditions are met, thereby providing a simple framework for designing statistically correct inference engines.

## Authors

• 26 publications
• ### On the Opportunities and Pitfalls of Nesting Monte Carlo Estimators

We present a formalization of nested Monte Carlo (NMC) estimation, where...
09/18/2017 ∙ by Tom Rainforth, et al. ∙ 0

• ### Correctness of Sequential Monte Carlo Inference for Probabilistic Programming Languages

Probabilistic programming languages (PPLs) make it possible to reason un...
03/11/2020 ∙ by Daniel Lundén, et al. ∙ 0

• ### Towards a Formal Foundation of Intermittent Computing

Intermittently powered devices enable new applications in harsh or inacc...
07/29/2020 ∙ by Milijana Surbatovich, et al. ∙ 0

• ### Discontinuous Hamiltonian Monte Carlo for Probabilistic Programs

Hamiltonian Monte Carlo (HMC) is the dominant statistical inference algo...
04/07/2018 ∙ by Bradley Gram-Hansen, et al. ∙ 0

Spreadsheet workbook contents are simple programs. Because of this, prob...
06/14/2016 ∙ by Mike Wu, et al. ∙ 0

• ### Correctness by construction for probabilistic programs

The "correct by construction" paradigm is an important component of mode...
07/30/2020 ∙ by Annabelle McIver, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Probabilistic programming systems (PPSs) allow probabilistic models to be represented in the form of a generative model and statements for conditioning on data (Goodman et al., 2008; Gordon et al., 2014). Informally, one can think of the generative model as the definition of a prior, the conditioning statements as the definition of a likelihood, and the output of the program as samples from a posterior distribution. Their core philosophy is to decouple model specification and inference, the former corresponding to the user-specified program code and the latter to an inference engine capable of operating on arbitrary programs. Removing the need for users to write inference algorithms significantly reduces the burden of developing new models and makes effective statistical methods accessible to non-experts.

Some, so-called universal, systems (Goodman et al., 2008; Goodman and Stuhlmüller, 2014; Mansinghka et al., 2014; Wood et al., 2014) further allow the definition of models that would be hard, or even impossible, to convey using conventional frameworks such as graphical models. One enticing manner they do this is by allowing arbitrary nesting of models, known in the probabilistic programming literature as queries (Goodman et al., 2008), such that it is easy to define and run problems that fall outside the standard inference framework (Goodman et al., 2008; Mantadelis and Janssens, 2011; Stuhlmüller and Goodman, 2014; Le et al., 2016). This allows the definition of models that could not be encoded without nesting, such as experimental design problems (Ouyang et al., 2016) and various models for theory-of-mind (Stuhlmüller and Goodman, 2014). In particular, models that involve agents reasoning about other agents require, in general, some form of nesting. For example, one might use such nesting to model a poker player reasoning about another player as shown in Section 3.1

. As machine learning increasingly starts to try and tackle problem domains that require interaction with humans or other external systems, such as the need for self-driving cars to account for the behavior of pedestrians, we believe that such nested problems are likely to become increasingly common and that PPSs will form a powerful tool for encoding them.

However, previous work has, in general, implicitly, and incorrectly, assumed that the convergence results from standard inference schemes carry over directly to the nested setting. In truth, inference for nested queries falls outside the scope of conventional proofs and so additional work is required to prove the consistency of PPS inference engines for nested queries. Such problems constitute special cases of nested estimation. In particular, the use of Monte Carlo () methods by most PPSs mean they form particular instances of nested Monte Carlo (NMC) estimation (Hong and Juneja, 2009). Recent work (Rainforth et al., 2016a, 2018; Fort et al., 2017) has demonstrated that NMC is consistent for a general class of models, but also that it entails a convergence rate in the total computational cost which decreases exponentially with the depth of the nesting. Furthermore, additional assumptions are required to achieve this convergence, most noticeably that, except in a few special cases, one needs to drive not only the total number of samples used to infinity, but also the number of samples used at each layer of the estimator, a requirement generally flaunted by existing PPSs.

The aim of this work is to formalize the notion of query nesting and use these recent NMC results to investigate the statistical correctness of the resulting procedures carried out by PPS inference engines. To do this, we postulate that there are three distinct ways one might nest one query within another: sampling from the conditional distribution of another query (which we refer to as nested inference

), factoring the trace probability of one query with the partition function estimate of another (which we refer to as

nested conditioning), and using expectation estimates calculated using one query as first class variables in another. We use the aforementioned NMC results to assess the relative correctness of each of these categories of nesting. In the interest of exposition, we will mostly focus on the PPS Anglican (Tolpin et al., 2016; Wood et al., 2014) (and also occasionally Church (Goodman et al., 2008)) as a basis for our discussion, but note that our results apply more generally. For example, our nested inference case covers the problem of sampling from cut distributions in OpenBugs (Plummer, 2015).

We find that nested inference is statistically challenging and incorrectly handled by existing systems, while nested conditioning is statistically straightforward and done correctly. Using estimates as variables turns out to be exactly equivalent to generic NMC estimation and must thus be dealt with on a case-by-case basis. Consequently, we will focus more on nested inference than the other cases.

To assist in the development of consistent approaches, we further introduce a new online NMC (ONMC) scheme that obviates the need to revisit previous samples when refining estimates, thereby simplifying the process of writing consistent online nested estimation schemes, as required by most PPSs. We show that ONMC’s convergence rate only varies by a small constant factor relative to conventional NMC: given some weak assumptions and the use of recommended parameter settings, its asymptotic variance is always better than the equivalent NMC estimator with matched total sample budget, while its asymptotic bias is always within a factor of two.

## 2 Background

### 2.1 Nested Monte Carlo

We start by providing a brief introduction to NMC, using similar notation to that of Rainforth et al. (2018). Conventional estimation approximates an intractable expectation of a function using

 γ0 =E[λ(y(0))]≈I0=1N0N0∑n=1λ(y(0)n) (1)

where , resulting in a mean squared error (MSE) that decreases at a rate . For nested estimation problems, is itself intractable, corresponding to a nonlinear mapping of a (nested) estimation. Thus in the single nesting case, giving

 γ0 =E[f0(y(0),E[f1(y(0),y(1))∣∣y(0)])] ≈I0=1N0N0∑n=1f0(y(0)n,1N1N1∑m=1f1(y(0)n,y(1)n,m))

where each is drawn independently and is now a NMC estimate using samples.

More generally, one may have multiple layers of nesting. To notate this, we first presume some fixed integral depth (with corresponding to conventional estimation), and real-valued functions . We then recursively define

 γD(y(0:D−1))=E[fD(y(0:D))∣∣y(0:D−1)],and γk(y(0:k−1))=E[fk(y(0:k),γk+1(y(0:k)))∣∣y(0:k−1)]

for . Our goal is to estimate , for which the NMC estimate is defined recursively using

 ID (y(0:D−1))=1NDND∑nD=1fD(y(0:D−1),y(D)nD)and Ik (y(0:k−1)) (2) =1NkNk∑nk=1fk(y(0:k−1),y(k)nk,Ik+1(y(0:k−1),y(k)nk))

for , where each is drawn independently. Note that there are multiple values of for each associated and that

is still a random variable given

.

As shown by (Rainforth et al., 2018, Theorem 3), if each is continuously differentiable and

 ς2k

, then the MSE converges at rate

 E[(I0−γ0)2]≤ς20N0+(C0ς212N1+D−2∑k=0(k∏d=0Kd)Ck+1ς2k+22Nk+2)2+O(ϵ) (3)

where and are respectively bounds on the magnitude of the first and second derivatives of , and represents asymptotically dominated terms – a convention we will use throughout. Note that the dominant terms in the bound correspond respectively to the variance and the bias squared. Theorem 2 of Rainforth et al. (2018) further shows that the continuously differentiable assumption must hold almost surely, rather than absolutely, for convergence more generally, such that functions with measure-zero discontinuities still converge in general.

We see from (3) that if any of the remain fixed, there is a minimum error that can be achieved: convergence requires each . As we will later show, many of the shortfalls in dealing with nested queries by existing PPSs revolve around implicitly fixing .

For a given total sample budget , the bound is tightest when giving a convergence rate of . The intuition behind this potentially surprising optimum setting is that the variance is mostly dictated by and bias by the other . We see that the convergence rate diminishes exponentially with . However, this optimal setting of the still gives a substantially faster rate than the from naïvely setting .

### 2.2 The Anglican Pps

Anglican is a universal probabilistic programming language integrated into Clojure (Hickey, 2008), a dialect of Lisp. There are two important ideas to understand for reading Clojure: almost everything is a function and parentheses cause evaluation. For example, is coded as (+ a b) where + is a function taking two arguments and the parentheses cause the function to evaluate.

Anglican inherits most of the syntax of Clojure, but extends it with the key special forms and (Wood et al., 2014; Tolpin et al., 2015, 2016), between which the distribution of the query is defined. Informally, specifies terms in the prior and terms in the likelihood. More precisely, is used to make random draws from a provided distribution and is used to apply conditioning, factoring the probability density of a program trace by a provided density evaluated at an “observed” point.

The syntax of is to take a distribution object as its only input and return a sample. instead takes a distribution object and an observation and returns nil, while changing the program trace probability in Anglican’s back-end. Anglican provides a number of elementary random procedures, i.e. distribution object constructors for common sampling distributions, but also allows users to define their own distribution object constructors using the macro. Distribution objects are generated by calling a class constructor with the required parameters, e.g. (normal 0 1).

Anglican queries are written using the macro . This allows users to define a model using a mixture of and statements and deterministic code, and bind that model to a variable. As a simple example,

(defquery my-query [data]
(let [ (sample (normal 0 1))
$\sigma$ (sample (gamma 2 2))
lik (normal $\mu$ $\sigma$)]
(map (fn [obs] (observe lik obs)) data)
[ $\sigma$]))

corresponds to a model where we are trying to infer the mean and standard deviation of a Gaussian given some data. The syntax of is

(defquery name [args] body) such that we are binding the query to my-query here. The query starts by sampling and , before constructing a distribution object lik to use for the observations. It then maps over each datapoint and observes it under the distribution lik. After the observations are made, and are returned from the variable-binding block and then by proxy the query itself. Denoting the data as

this particular query defines the joint distribution

 p(μ,σ,y1:S)=N(μ;0,1)Γ(σ;2,2)∏Ss=1N(ys;μ,σ).

Inference on a query is performed using the macro , which produces a lazy infinite sequence of approximate samples from the conditional distribution and, for appropriate inference algorithms, an estimate of the partition function. Its calling syntax is (doquery inf-alg model inputs & options).

Key to our purposes is Anglican’s ability to nest queries within one another. In particular, the special form takes a query and returns a distribution object constructor, the outputs of which ostensibly corresponds to the conditional distribution defined by the query, with the inputs to the query becoming its parameters. However, as we will show in the next section, the true behavior of deviates from this, thereby leading to inconsistent nested inference schemes.

## 3 Nested Inference

One of the clearest ways one might want to nest queries is by sampling from the conditional distribution of one query inside another. A number of examples of this are provided for Church in (Stuhlmüller and Goodman, 2014).111Though their nesting happens within the conditioning predicate, Church’s semantics means they constitute nested inference. Such nested inference problems fall under a more general framework of inference for so-called doubly (or multiply) intractable distributions (Murray et al., 2006). The key feature of these problems is that they include terms with unknown, parameter dependent, normalization constants. For nested probabilistic programming queries, this manifests through conditional normalization.

Consider the following unnested model using the Anglican function declaration

(defm inner [y D]
(let [z (sample (gamma y 1))]
(observe (normal y z) D)
z))
(defquery outer [D]
(let [y (sample (beta 2 3))
z (inner y D)]
(* y z)))

Here inner is simply an Anglican function: it takes in inputs y and D, effects the trace probability through its statement, and returns the random variable z as output. The unnormalized distribution for this model is thus straightforwardly given by

 πu(y,z, D)=p(y)p(z|y)p(D|y,z) = \textscBeta(y;2,3)Γ(z;y,1)N(D;y,z2),

for which we can use conventional inference schemes.

We can convert this model to a nested inference problem by using and as follows

(defquery inner [y D]
(let [z (sample (gamma y 1))]
(observe (normal y z) D)
z))
(defquery outer [D]
(let [y (sample (beta 2 3))
dist (conditional inner)
z (sample (dist y D))]
(* y z)))

This is now a nested query: a separate inference procedure is invoked for each call of (sample (dist y D)), returning an approximate sample from the conditional distribution defined by inner when input with the current values of y and D. Mathematically, applies a conditional normalization. Specifically, the component of from the previous example corresponding to inner was

and locally normalizes this to the probability distribution

. The distribution now defined by outer is thus given by

 πn(y,z,D) =p(y)p(z|y,D)=p(y)p(z|y)p(D|y,z)∫p(z|y)p(D|y,z)dz =p(y)p(z|y)p(D|y,z)p(D|y)≠πu(z,y,D).

Critically, the partial normalization constant depends on and so the conditional distribution is doubly intractable: we cannot evaluate exactly.

Another way of looking at this is that wrapping inner in has “protected” from the conditioning in inner (noting ), such that its statement only affects the probability of given and not the marginal probability of . This is why, when there is only a single layer of nesting, nested inference is equivalent to the notion of sampling from “cut distributions” (Plummer, 2015), whereby the sampling of certain subsets of the variables in a model are made with factors of the overall likelihood omitted.

It is important to note that if we had observed the output of the inner query, rather than sampling from it, this would still constitute a nested inference problem. The key to the nesting is the conditional normalization applied by , not the exact usage of the generated distribution object dist. However, as discussed in Appendix B, actually observing a nested query requires numerous additional computational issues to be overcome, which are beyond the scope of this paper. We thus focus on the nested sampling scenario.

### 3.1 Motivating Example

Before jumping into a full formalization of nested inference, we first consider the motivating example of modeling a poker player who reasons about another player. Here each player has access to information the other does not, namely the cards in their hand, and they must perform their own inference to deal with the resulting uncertainty.

Imagine that the first player is deciding whether or not to bet. She could naïvely just make this decision based on the strength of her hand, but more advanced play requires her to reason about actions the other player might take given her own action, e.g. by considering whether a bluff is likely to be successful. She can carry out such reasoning by constructing a model for the other player to try and predict their action given her action and their hand. Again this nested model could just simply be based on a naïve simulation, but we can refine it by adding another layer of meta-reasoning: the other player will themselves try to infer the first player’s hand to inform their own decision.

These layers of meta-reasoning create a nesting: for the first player to choose an action, they must run multiple simulations for what the other player will do given that action and their hand, each of which requires inference to be carried out. Here adding more levels of meta-reasoning can produce smarter models, but also requires additional layers of nesting. We expand on this example to give a concrete nested inference problem in Appendix E.

### 3.2 Formalization

To formalize the nested inference problem more generally, let and denote all the random variables of an outer query that are respectively passed or not to the inner query. Further, let denote all random variables generated in the inner query – for simplicity, we will assume, without loss of generality, that these are all returned to the outer query, but that some may not be used. The unnormalized density for the outer query can now be written in the form

 πo(x,y,z) =ψ(x,y,z)pi(z|y) (4)

where is the normalized density of the outputs of the inner query and encapsulates all other terms influencing the trace probability of the outer query. Now the inner query defines an unnormalized density that can be evaluated pointwise and we have

 pi(z|y)=πi(y,z)∫πi(y,z′)dz′giving (5)
 po(x,y,z)∝πo(x,y,z) =ψ(x,y,z)πi(y,z)∫πi(y,z′)dz′ (6)

where is our target distribution, for which we can directly evaluate the numerator, but the denominator is intractable and must be evaluated separately for each possible value of . Our previous example is achieved by fixing and . We can further straightforwardly extend to the multiple layers of nesting setting by recursively defining in the same way as .

### 3.3 Relationship to Nested Estimation

To relate the nested inference problem back to the nested estimation formulation from Section 2.1, we consider using a proposal to calculate the expectation of some arbitrary function under as per self-normalized importance sampling

 Epo(x,y,z)[g(x,y,z)]=Eq(x,y,z)[g(x,y,z)πo(x,y,z)q(x,y,z)]Eq(x,y,z)[πo(x,y,z)q(x,y,z)] =Eq(x,y,z)[g(x,y,z)ψ(x,y,z)πi(y,z)q(x,y,z)Ez′∼q(z|y)[πi(y,z′)/q(z′|y)]]Eq(x,y,z)[ψ(x,y,z)πi(y,z)q(x,y,z)Ez′∼q(z|y)[πi(y,z′)/q(z′|y)]]. (7)

Here both the denominator and numerator are nested expectations with a nonlinearity coming from the fact that we are using the reciprocal of an expectation. A similar reformulation could also be applied in cases with multiple layers of nesting, i.e. where inner itself makes use of another query. The formalization can also be directly extended to the sequential MC (SMC) setting by invoking extended space arguments (Andrieu et al., 2010).

Typically is not known upfront and we instead return an empirical measure from the program in the form of weighted samples which can later be used to estimate an expectation. That is, if we sample and and return all samples (such that each is duplicated times in the sample set) then our unnormalized weights are given by

 wn,m=ψ(xn,yn,zn,m)πi(yn,zn,m)q(xn,yn,zn,m)1N1∑N1ℓ=1πi(yn,zn,ℓ)q(zn,ℓ|yn). (8)

This, in turn, gives us the empirical measure

 ^p(⋅)=∑N0n=1∑N1m=1wn,mδ(xn,yn,zn,m)(⋅)∑N0n=1∑N1m=1wn,m (9)

where is a delta function centered on . By definition, the convergence of this empirical measure to the target requires that expectation estimates calculated using it converge in probability for any integrable (presuming our proposal is valid). We thus see that the convergence of the ratio of nested expectations in (7) for any arbitrary , is equivalent to the produced samples converging to the distribution defined by the program. Informally, the NMC results then tell us this will happen in the limit provided that is strictly positive for all possible (as otherwise the problem becomes ill-defined). More formally we have the following result. Its proof, along with all others, is given in Appendix A.

###### Theorem 1.

Let be an integrable function, let , and let be a self-normalized MC estimate for calculated using as per (9). Assuming that forms a valid importance sampling proposal distribution for , then

 E[(I0−γ0)2]=σ2N0+δ2N21+O(ϵ) (10)

where and are constants derived in the proof and, as before, represents asymptotically dominated terms.

Note that rather than simply being a bound, this result is an equality and thus provides the exact asymptotic rate. Using the arguments of (Rainforth et al., 2018, Theorem 3), it can be straightforwardly extended to cases of multiple nesting (giving a rate analogous to (3)), though characterizing and becomes more challenging.

### 3.4 Convergence Requirements

We have demonstrated that the problem of nested inference is a particular case of nested estimation. This problem equivalence will hold whether we elect to use the aforementioned nested importance sampling based approach or not, while we see that our finite sample estimates must be biased for non-trivial by the convexity of and Theorem 4 of Rainforth et al. (2018). Presuming we cannot produce exact samples from the inner query and that the set of possible inputs to the inner query is not finite (these are respectively considered in Appendix D and Appendix C), we thus see that there is no “silver bullet” that can reduce the problem to a standard estimation.

We now ask, what behavior do we need for Anglican’s , and nested inference more generally, to ensure convergence? At a high level, the NMC results show us that we need the computational budget of each call of a nested query to become arbitrarily large, such that we use an infinite number of samples at each layer of the estimator: we require each .

We have formally demonstrated convergence when this requirement is satisfied and the previously introduced nested importance sampling approach is used. Another possible approach would be to, instead of drawing samples to estimate (7) directly, importance sample times for each call of the inner query and then return a single sample from these, drawn in proportion to the inner query importance weights. We can think of this as drawing the same raw samples, but then constructing the estimator as

 ^p∗(⋅) =∑N0n=1w∗nδ(xn,yn,zn,m∗(n))(⋅)∑N0n=1w∗n (11) wherew∗n =ψ(xn,yn,zn,m∗(n))q(xn,yn)and (12) m∗(n)∼ \textscDiscrete⎛⎝πi(yn,zn,m)/q(zn,m|yn)∑N1ℓ=1πi(yn,zn,ℓ)/q(zn,ℓ|yn)⎞⎠

As demonstrated formally in Appendix A, this approach also converges. However, if we Rao Blackwellize (Casella and Robert, 1996) the sampling of , we find that this recovers (9). Consequently, this is a strictly inferior estimator (it has an increased variance relative to (9)). Nonetheless, it may often be a convenient setup from the perspective of the PPS semantics and it will typically have substantially reduced memory requirements: we need only store the single returned sample from the inner query to construct our empirical measure, rather than all of the samples generated within the inner query.

Though one can use the results of Fort et al. (2017)

to show the correctness of instead using an MCMC estimator for the outer query, the correctness of using MCMC methods for the inner queries is not explicitly covered by existing results. Here we find that we need to start a new Markov chain for each call of the inner query because each value of

defines a different local inference problem. One would intuitively expect the NMC results to carry over – as all the inner queries will run their Markov chains for an infinitely long time, thereby in principle returning exact samples – but we leave formal proof of this case to future work. We note that such an approach effectively equates to what is referred to as

multiple imputation

by Plummer (2015).

### 3.5 Shortfalls of Existing Systems

Using the empirical measure (9) provides one possible manner of producing a consistent estimate of our target by taking and so we can use this as a gold-standard reference approach (with a large value of ) to assess whether Anglican returns samples for the correct target distribution. To this end, we ran Anglican’s importance sampling inference engine on the simple model introduced earlier and compared its output to the reference approach using and . As shown in Figure 1, the samples produced by Anglican are substantially different to the reference code, demonstrating that the outputs do not match their semantically intended distribution. For reference, we also considered the distribution induced by the aforementioned unnested model and a naïve estimation scheme where a sample budget of is used for each call to inner, effectively corresponding to ignoring the statement by directly returning the first draw of .

We see that the unnested model defines a noticeably different distribution, while the behavior of Anglican is similar, but distinct, to ignoring the statement in the inner query. Further investigation shows that the default behavior of in a query nesting context is equivalent to using (11) but with held fixed to at , inducing a substantial bias. More generally, the Anglican source code shows that defines a Markov chain generated by equalizing the output of the weighted samples generated by running inference on the query. When used to nest queries, this Markov chain is only ever run for a finite length of time, specifically one accept-reject step is carried out, and so does not produce samples from the true conditional distribution.

Plummer (2015) noticed that WinBugs and OpenBugs (Spiegelhalter et al., 1996)

similarly do not provide valid inference when using their cut function primitives, which effectively allow the definition of nested inference problems. However, they do not notice the equivalence to the NMC formulation and instead propose a heuristic for reducing the bias that itself has no theoretical guarantees.

## 4 Nested Conditioning

An alternative way one might wish to nest queries is to use the partition function estimate of one query to factor the trace probability of another. We refer to this as nested conditioning. In its simplest form, we can think about conditioning on the values input to the inner query. In Anglican we can carry this out by using the following custom distribution object constructor

(defdist nest [inner inputs inf-alg M] []
(sample [this] nil)
(observe [this _]
(log-marginal (take M
(doquery inf-alg inner inputs)))))

When the resulting distribution object is observed, this will now generate, and factor the trace probability by, a partition function estimate for inner with inputs inputs, constructed using M samples of the inference algorithm inf-alg. For example, if we were to use the query

(defquery outer [D]
(let [y (sample (beta 2 3))]
(observe (nest inner [y D] :smc 100) nil)
y))

with inner from the nested inference example, then this would form a pseudo marginal sampler (Andrieu and Roberts, 2009) for the unnormalized target distribution

 πc(y,D)= \textscBeta(y;2,3)∫Γ(z;y,1)N(D;y,z2)dz.

Unlike the nested inference case, nested conditioning turns out to be valid even if our budget is held fixed, provided that the partition function estimate is unbiased, as is satisfied by, for example, importance sampling and SMC. In fact, it is important to hold the budget fixed to achieve a MC convergence rate. In general, we can define our target density as

 po(x,y)∝πo(x,y)=ψ(x,y)pi(y), (13)

where is as before (except that we no longer have returned variables from the inner query) and is the true partition function of the inner query when given input . In practice, we cannot evaluate

exactly, but instead produce unbiased estimates

. Using an analogous self-normalized importance sampling to the nested inference case leads to the weights

 wn=ψ(xn,yn)^pi(yn)/q(xn,yn) (14)

and corresponding empirical measure

 ^p(⋅)=1∑N0n=1wnN0∑n=1wn,δ(xn,yn)(⋅) (15)

such that we are conducting conventional MC estimation, but our weights are now themselves random variables for a given due to the term. However, the weights are unbiased estimates of the “true weights” such that we have proper weighting (Naesseth et al., 2015) and thus convergence at the standard MC rate, provided the budget of the inner query remains fixed. This result also follows directly from Theorem 6 of Rainforth et al. (2018), which further ensures no complications arise when conditioning on multiple queries if the corresponding partition function estimates are generated independently. These results further trivially extend to the repeated nesting case by recursion, while using the idea of pseudo-marginal methods (Andrieu and Roberts, 2009), the results also extend to using MCMC based inference for the outermost query.

Rather than just fixing the inputs to the nested query, one can also consider conditioning on the internally sampled variables in the program taking on certain values. Such a nested conditioning approach has been implicitly carried out by Rainforth et al. (2016b); Zinkov and Shan (2017); Scibior and Ghahramani (2016); Ge et al. (2018), each of which manipulate the original program in some fashion to construct a partition function estimator that is used used within a greater inference scheme, e.g. a PMMH estimator (Andrieu et al., 2010).

## 5 Estimates as Variables

Our final case is that one might wish to use estimates as first class variables in another query. In other words, a variable in an outer query is assigned to a MC expectation estimate calculated from the outputs of running inference on another, nested, query. By comparison, the nested inference case (without Rao-Blackwellization) can be thought of as assigning a variable in the outer query to a single approximate sample from the conditional distribution of the inner query, rather than an MC expectation estimate constructed by averaging over multiple samples.

Whereas nested inference can only encode a certain class of nested estimation problems – because the only nonlinearity originates from taking the reciprocal of the partition function – using estimates as variables allows, in principle, the encoding of any nested estimation. This is because using the estimate as a first class variable allows arbitrary nonlinear mappings to be applied by the outer query.

An example of this approach is shown in Appendix G, where we construct a generic estimator for Bayesian experimental design problems. Here a partition function estimate is constructed for an inner query and is then used in an outer query. The output of the outer query depends on the logarithm of this estimate, thereby creating the nonlinearity required to form a nested expectation.

Because using estimates as variables allows the encoding of any nested estimation problem, the validity of doing so is equivalent to that of NMC more generally and must thus satisfy the requirements set out in (Rainforth et al., 2018). In particular, one needs to ensure that the budgets used for the inner estimates increase as more samples of the outermost query are taken.

## 6 Online Nested Monte Carlo

NMC will be highly inconvenient to actually implement in a PPS whenever one desires to provide online estimates; for example, a lazy sequence of samples that converges to the target distribution. Suppose that we have already calculated an NMC estimate, but now desire to refine it further. In general, this will require an increase to all for each sample of the outermost estimator. Consequently, the previous samples of the outermost query must be revisited to refine their estimates. This significantly complicates practical implementation, necessitating additional communication between queries, introducing computational overhead, and potentially substantially increasing the memory requirements.

To highlight these shortfalls concretely, consider the nested inference class of problems and, in particular, constructing the un–Rao–Blackwellized estimator (11) in an online fashion. Increasing requires to be redrawn for each , which in turn necessitates storage of previous samples and weights.222Note that not all previous samples and weights need storing – when making the update we can sample whether to change or not based on combined weights from all the old samples compared to all the new samples. This leads to an overhead cost from the extra computation carried out for re-visitation and a memory overhead from having to store information about each call of the inner query.

Perhaps even more problematically, the need to revisit old samples when drawing new samples can cause substantial complications for implementation. Consider implementing such an approach in Anglican. Anglican is designed to return a lazy infinite sequence of samples converging to the target distribution. Once samples are taken from this sequence, they become external to Anglican and cannot be posthumously updated when further samples are requested. Even when all the output samples remain internal, revisiting samples remains difficult: one either needs to implement some form of memory for nested queries so they can be run further, or, if all information is instead stored at the outermost level, additional non-trivial code is necessary to apply post-processing and to revisit queries with previously tested inputs. The latter of these is likely to necessitate inference–algorithm–specific changes, particularly when there are multiple levels of nesting, thereby hampering the entire language construction.

To alleviate these issues, we propose to only increase the computational budget of new calls to nested queries, such that earlier calls use fewer samples than later calls. This simple adjustment removes the need for communication between different calls and requires only the storage of the number of times the outermost query has previously been sampled to make updates to the overall estimate. We refer to this approach as online NMC (ONMC), which, to the best of our knowledge, has not been previously considered in the literature. As we now show, ONMC only leads to small changes in the convergence rate of the resultant estimator compared to NMC: using recommended parameter settings, the asymptotic root mean squared error for ONMC is never more than twice that of NMC for a matched sample budget and can even be smaller.

Let be monotonically increasing functions dictating the number of samples used by ONMC at depth for the -th iteration of the outermost estimator. The ONMC estimator is defined as

 J0 =1N0N0∑n0=1f0(y(0)n0,I1(y(0)n0,τ1:D(n0))) (16)

where is calculated using in (2), setting and . For reference, the NMC estimator, , is as per (16), except for replacing with . Algorithmically, we have that the ONMC approach is defined as follows.

We see that OMMC uses fewer samples at inner layers for earlier samples of the outermost level, and that each of resulting inner estimates is calculated as per an NMC estimator with a reduced sample budget. We now show the consistency of the ONMC estimator.

###### Theorem 2.

If each for some constants and each is continuously differentiable, then the mean squared error of as an estimator for converges to zero as .

In other words, ONMC converges for any realistic choice of provided : the requirements on are, for example, much weaker than requiring a logarithmic or faster rate of growth, which would already be an impractically slow rate of increase.

In the case where increases at a polynomial rate, we can further quantify the rate of convergence, along with the relative variance and bias compared to NMC:

###### Theorem 3.

If each for some constants and each is continuously differentiable, then

 E[(J0−γ0)2]≤ς20N0+(βg(α,N0)ANα0)2+O(ϵ), (17) whereg(α,N0)=⎧⎪⎨⎪⎩1/(1−α),α<1log(N0)+η,α=1ζ(α)Nα−10,α>1; (18) β=C0ς212+D−2∑k=0(k∏d=0Kd)Ck+1ς2k+22; (19)

is the Euler–Mascheroni constant; is the Riemann–zeta function; and , , and are constants defined as per the corresponding NMC bound given in (3).

###### Corollary 1.

Let be an ONMC estimator setup as per Theorem 3 with outermost samples and let be an NMC estimator with a matched overall sample budget. Defining , then

 \var[J0] →c\var[I0]asN0→∞.

Further, if the NMC bias decreases at a rate proportional to that implied by the bound given in (3), namely

 |E[I0−γ0]|=bMα0+O(ϵ) (20)

for some constant , where is the number of outermost samples used by the NMC sampler, then

 |E[J0−γ0]| ≤cαg(α,N0)|E[I0−γ0]|+O(ϵ).

We expect the assumption that the bias scales as to be satisfied in the vast majority of scenarios, but there may be edge cases, e.g. when an gives a constant output, for which faster rates are observed. Critically, the assumption holds for all nested inference problems because the rate given in (10) is an equality.

We see that if , which will generally be the case in practice for sensible setups, then the convergence rates for ONMC and NMC vary only by a constant factor. Specifically, for a fixed value of , they have the same asymptotic variance and ONMC has a factor of higher bias. However, the cost of ONMC is (asymptotically) only times that of NMC, so for a fixed overall sample budget it has lower variance.

As the bound varies only in constant factors for , the asymptotically optimal value for for ONMC is the same as that for NMC, namely  (Rainforth et al., 2018). For this setup, we have respectively for . Consequently, when , the fixed budget variance of ONMC is always better than NMC, while the bias is no more than times larger if and no more than times large more generally.

### 6.1 Empirical Confirmation

To test ONMC empirically, we consider the simple analytic model given in Appendix F, setting . The rationale for setting a minimum value of is to minimize the burn-in effect of ONMC – earlier samples will have larger bias than later samples and we can mitigate this by ensuring a minimum value for . More generally, we recommend setting (in the absence of other information) , where is the minimum overall budget we expect to spend. In Figure 2, we have chosen to set deliberately low so as to emphasize the differences between NMC and ONMC. Given our value for , the ONMC approach is identical to fixing for , but unlike fixing , it continues to improve beyond this because it is not limited by asymptotic bias. Instead, we see an inflection point-like behavior around , with the rate recovering to effectively match that of the NMC estimator.

### 6.2 USING ONMC IN PPSs

Using ONMC based estimation schemes to ensure consistent estimation for nested inference in PPSs is straightforward – the number of iterations the outermost query has been run for is stored and used to set the number of iterations used for the inner queries. In fact, even this minimal level of communication is not necessary – can be inferred from the number of times we have previously run inference on the current query, the current depth , and .

As with NMC, for nested inference problems ONMC can either return a single sample from each call of a nested query, or Rao–Blackwellize the drawing of this sample when possible. Each respectively produces an estimator analogous to (11) and (9) respectively, except that in the definition of the inner weights is now a function of .

Returning to Figure 1, we see that using ONMC with nested importance sampling and only returning a single sample corrects the previous issues with how Anglican deals with nested inference, producing samples indistinguishable from the reference code.

## 7 Conclusions

We have formalized the notion of nesting probabilistic program queries and investigated the statistical validity of different categories of nesting. We have found that current systems tend to use methods that lead to asymptotic bias for nested inference problems, but that they are consistent for nested conditioning. We have shown how to carry out the former in a consistent manner and developed a new online estimator that simplifies the construction algorithms that satisfy the conditions required for convergence.

## Appendix A Proofs

See 1

###### Proof.

Though informally the high-level result follows directly from (Rainforth et al., 2018, Theorem 3), there are three subtleties that require further attention. Firstly, unlike (Rainforth et al., 2018, Theorem 3), this result is an asymptotic equality rather than a bound – in the limit of large it holds exactly. This more powerful result is made possible by knowing the exact form of the nonlinearity. Secondly, our overall estimator uses the ratio of two NMC estimators. Though Slutsky’s Theorem means this does not create complications in the general demonstration of convergence, additional care is required when calculating the exact rate. Finally, samples are reused in both the inner and outer estimators. This could easily be avoided by sampling an additional for the outer estimator, thereby giving an estimator trivially of the form considered by (Rainforth et al., 2018, Theorem 3). However, doing so would be less efficient and is expected to have a larger variance than the estimator used.

We start by considering the the partition function estimate, noting that true value is ,

 ^Z =1N0N0∑n=01N1N1∑m=1ψ(xn,yn,zn,m)πi(yn,zn,m)q(xn,yn,zn,m)1N1N1∑m=1πi(yn,zn,m)q(zn,m|yn) (21) =1N0N0∑n=01N1∑N1m=1vn,m1N1∑N1m=1un,m (22)

where

 un,m =ψ(xn,yn,zn,m)πi(yn,zn,m)q(xn,yn,zn,m)and (23) vn,m =πi(yn,zn,m)q(zn,m|yn) (24)

will be used as shorthands. Further defining

 πi(yn)=∫ πi(yn,z)dz, (25) Vn=1N1N1∑m=1vn,m, andUn=1N1N1∑m=1un,m, (26)

and using Taylor’s Theorem on about gives

 ^Z=O(ϵ)+1N0N0∑n=0Vnπi(yn)×⎛⎝1+πi(yn)−Unπi(yn)+(πi(yn)−Unπi(yn))2⎞⎠ (27)

provided each to avoid singularity issues. We have by assumption that for all possible as otherwise the problem becomes ill-defined. On the other hand, if , it must also be the case that . Here by taking the convention when , we can avoid all further possible singularity issues, such that (27) always holds.

Meanwhile, the standard breakdown of the mean squared error to the variance plus the bias squared gives

 E[(^Z−Z)2]=\var[^Z]+(E[^Z−Z])2.

Using (27), we see that the first term in the expansion dominates for the variance (as decreases with

), such that the weak law of large numbers gives

 \var[^Z]=1N0\var[V1πi(y1)]+O(ϵ).

Now we have

 V1 =E[v1,1|x1,y1]+1N1N1∑m=1(v1,m−E[v1,m|x1,y1])

and we further see from the weak law of large numbers that the second term tends to as increases, but the first term remains fixed. Thus the first term is dominant and we have

 \var[^Z]=1N0\var[E[v1,1|x1,y1]πi(y1)]+O(ϵ) (28) =1N0\var[∫ψ(x1,y1,z)πi(y1,z)dzq(x1,y1)πi(y1)]+O(ϵ) (29) =σ2zN0+O(ϵ) (30)

where

 σ2z=\var[∫πo(x1,y1,z)dzq(x1,y1)]. (31)

Switching focus to the bias we have

 E[^Z−Z]=O(ϵ)+E[(V1πi(y1)) =O(ϵ)+E[E[(v1,1πi(y1))

For the first order term in the expansion, only the component with respect to is non-zero as, for ,

 E[v1,1(πi(y1)−u1,m)|y1]=E[v1,1|y1]E[(πi(y1)−u1,m)|y1]=0. (32)

Denoting the first order term as , we thus have

 T1=E⎡⎢ ⎢⎣v1,1(1N1∑N1m=1πi(y1)−u1,m)(πi(y1))2⎤⎥ ⎥⎦ =1N1(E[v1,1πi(y1)]−E[v1,1u1,1(πi(y1))2]) =1N1⎛⎝Z