## 1 Introduction

Probabilistic programming systems (PPSs) allow probabilistic models to be represented in the form of a generative model and statements for conditioning on data (Goodman et al., 2008; Gordon et al., 2014). Informally, one can think of the generative model as the definition of a prior, the conditioning statements as the definition of a likelihood, and the output of the program as samples from a posterior distribution. Their core philosophy is to decouple model specification and inference, the former corresponding to the user-specified program code and the latter to an inference engine capable of operating on arbitrary programs. Removing the need for users to write inference algorithms significantly reduces the burden of developing new models and makes effective statistical methods accessible to non-experts.

Some, so-called universal, systems (Goodman et al., 2008; Goodman and Stuhlmüller, 2014; Mansinghka et al., 2014; Wood et al., 2014) further allow the definition of models that would be hard, or even impossible, to convey using conventional frameworks such as graphical models. One enticing manner they do this is by allowing arbitrary nesting of models, known in the probabilistic programming literature as queries (Goodman et al., 2008), such that it is easy to define and run problems that fall outside the standard inference framework (Goodman et al., 2008; Mantadelis and Janssens, 2011; Stuhlmüller and Goodman, 2014; Le et al., 2016). This allows the definition of models that could not be encoded without nesting, such as experimental design problems (Ouyang et al., 2016) and various models for theory-of-mind (Stuhlmüller and Goodman, 2014). In particular, models that involve agents reasoning about other agents require, in general, some form of nesting. For example, one might use such nesting to model a poker player reasoning about another player as shown in Section 3.1

. As machine learning increasingly starts to try and tackle problem domains that require interaction with humans or other external systems, such as the need for self-driving cars to account for the behavior of pedestrians, we believe that such nested problems are likely to become increasingly common and that PPSs will form a powerful tool for encoding them.

However, previous work has, in general,
implicitly, and incorrectly, assumed that the convergence results from standard inference
schemes carry over directly to the nested setting. In truth,
inference for nested queries falls outside the
scope of conventional proofs and so additional
work is required to prove the consistency of PPS inference engines for nested queries.
Such problems constitute special cases of *nested estimation*.
In particular, the use of Monte Carlo () methods by most PPSs mean they form
particular instances
of *nested Monte Carlo* (NMC) estimation (Hong and Juneja, 2009).
Recent work (Rainforth et al., 2016a, 2018; Fort et al., 2017)
has demonstrated that NMC is consistent for a general class of models,
but also that it entails a convergence rate in the total computational cost
which decreases exponentially
with the depth of the nesting. Furthermore, additional assumptions are
required to achieve this convergence, most noticeably that, except in a few special cases,
one needs to drive not only the total number of samples used to infinity, but also
the number of samples used at each layer of the estimator,
a requirement generally flaunted by existing PPSs.

The aim of this work is to formalize the notion of query nesting
and use these recent NMC results
to investigate the statistical correctness of the resulting procedures
carried out by PPS inference engines. To do this,
we postulate that there are three distinct ways one might nest one query within another:
sampling from the conditional distribution of another query (which
we refer to as *nested inference*

), factoring the trace probability of one query with the partition function estimate of another (which we refer to as

*nested conditioning*), and using expectation estimates calculated using one query as

*first class variables*in another. We use the aforementioned NMC results to assess the relative correctness of each of these categories of nesting. In the interest of exposition, we will mostly focus on the PPS

*Anglican*(Tolpin et al., 2016; Wood et al., 2014) (and also occasionally Church (Goodman et al., 2008)) as a basis for our discussion, but note that our results apply more generally. For example, our nested inference case covers the problem of sampling from cut distributions in OpenBugs (Plummer, 2015).

We find that nested inference is statistically challenging and incorrectly handled by existing systems, while nested conditioning is statistically straightforward and done correctly. Using estimates as variables turns out to be exactly equivalent to generic NMC estimation and must thus be dealt with on a case-by-case basis. Consequently, we will focus more on nested inference than the other cases.

To assist in the development of consistent approaches, we further
introduce a new *online* NMC (ONMC) scheme that
obviates the need to revisit previous samples when refining estimates, thereby
simplifying the
process of writing consistent online nested estimation schemes,
as required by most PPSs.
We show that ONMC’s convergence rate only varies by a small
constant factor relative to conventional NMC: given
some weak assumptions and the use of recommended parameter settings,
its asymptotic variance is always
better than the equivalent NMC estimator with matched total sample budget,
while its asymptotic bias is
always within a factor of two.

## 2 Background

### 2.1 Nested Monte Carlo

We start by providing a brief introduction to NMC, using similar notation to that of Rainforth et al. (2018). Conventional estimation approximates an intractable expectation of a function using

(1) |

where , resulting in a mean squared error (MSE) that decreases at a rate . For nested estimation problems, is itself intractable, corresponding to a nonlinear mapping of a (nested) estimation. Thus in the single nesting case, giving

where each is drawn independently and is now a NMC estimate using samples.

More generally, one may have multiple layers of nesting. To notate this, we first presume some fixed integral depth (with corresponding to conventional estimation), and real-valued functions . We then recursively define

for . Our goal is to estimate , for which the NMC estimate is defined recursively using

(2) | ||||

for , where each is drawn independently. Note that there are multiple values of for each associated and that

is still a random variable given

.As shown by (Rainforth et al., 2018, Theorem 3), if each is continuously differentiable and

, then the MSE converges at rate

(3) |

where and are respectively bounds on the magnitude of the first and second derivatives of , and represents asymptotically dominated terms – a convention we will use throughout. Note that the dominant terms in the bound correspond respectively to the variance and the bias squared. Theorem 2 of Rainforth et al. (2018) further shows that the continuously differentiable assumption must hold almost surely, rather than absolutely, for convergence more generally, such that functions with measure-zero discontinuities still converge in general.

We see from (3) that if any of the remain fixed, there is a minimum error that can be achieved: convergence requires each . As we will later show, many of the shortfalls in dealing with nested queries by existing PPSs revolve around implicitly fixing .

For a given total sample budget , the bound is tightest when giving a convergence rate of . The intuition behind this potentially surprising optimum setting is that the variance is mostly dictated by and bias by the other . We see that the convergence rate diminishes exponentially with . However, this optimal setting of the still gives a substantially faster rate than the from naïvely setting .

### 2.2 The Anglican Pps

Anglican is a universal probabilistic programming language integrated into *Clojure* (Hickey, 2008), a dialect of Lisp.
There are two important ideas to understand for reading Clojure:
almost everything is a function and parentheses cause evaluation.
For example, is coded as
(+ a b) where + is a function taking two arguments
and the parentheses cause the
function to evaluate.

Anglican inherits most of the syntax of Clojure, but extends it with the key special forms and (Wood et al., 2014; Tolpin et al., 2015, 2016), between which the distribution of the query is defined. Informally, specifies terms in the prior and terms in the likelihood. More precisely, is used to make random draws from a provided distribution and is used to apply conditioning, factoring the probability density of a program trace by a provided density evaluated at an “observed” point.

The syntax of is to take a *distribution object* as its only input and return a sample. instead
takes a distribution object and an observation and returns nil, while changing
the program trace probability in Anglican’s back-end.
Anglican provides a number of *elementary random procedures*, i.e. distribution object constructors for common sampling distributions,
but also allows users to define their own distribution object constructors using the macro.
Distribution objects are generated by calling a class constructor
with the required parameters, e.g. (normal 0 1).

Anglican queries are written using the macro . This allows users to define a model using a mixture of and statements and deterministic code, and bind that model to a variable. As a simple example,

corresponds to a model where we are trying to infer the mean and standard deviation of a Gaussian given some data. The syntax of is

(defquery name [args] body) such that we are binding the query to my-query here. The query starts by sampling and , before constructing a distribution object lik to use for the observations. It then maps over each datapoint and observes it under the distribution lik. After the observations are made, and are returned from the variable-binding block and then by proxy the query itself. Denoting the data asthis particular query defines the joint distribution

Inference on a query is performed using the macro , which produces a lazy infinite sequence of approximate samples from the conditional distribution and, for appropriate inference algorithms, an estimate of the partition function. Its calling syntax is (doquery inf-alg model inputs & options).

Key to our purposes is Anglican’s ability to *nest* queries within one another.
In particular, the special form takes a query and returns a distribution object constructor, the outputs of which
ostensibly corresponds to the conditional distribution
defined by the query, with the inputs to the query becoming its parameters.
However, as we will show in the next section, the true behavior of deviates from this, thereby leading to inconsistent nested inference schemes.

## 3 Nested Inference

One of the clearest ways one might want to nest queries is by sampling from the conditional
distribution of one query inside another. A number of examples of this are provided
for Church
in (Stuhlmüller and Goodman, 2014).^{1}^{1}1Though their nesting
happens within the conditioning predicate, Church’s semantics
means they constitute nested inference.
Such *nested inference* problems fall under a more general framework of inference for so-called
doubly (or multiply) intractable distributions (Murray et al., 2006).
The key feature of these problems is that they include
terms with unknown, *parameter dependent*, normalization constants. For nested probabilistic programming
queries, this manifests through *conditional normalization*.

Consider the following unnested model using the Anglican function declaration

Here inner is simply an Anglican function: it takes in inputs y and D, effects the trace probability through its statement, and returns the random variable z as output. The unnormalized distribution for this model is thus straightforwardly given by

for which we can use conventional inference schemes.

We can convert this model to a nested inference problem by using and as follows

This is now a nested query: a separate inference procedure is invoked for each call of (sample (dist y D)), returning an approximate sample from the conditional distribution defined by inner when input with the current values of y and D. Mathematically, applies a conditional normalization. Specifically, the component of from the previous example corresponding to inner was

and locally normalizes this to the probability distribution

. The distribution now defined by outer is thus given byCritically, the partial normalization constant depends on and so the conditional distribution is doubly intractable: we cannot evaluate exactly.

Another way of looking at this is that wrapping inner in has “protected” from the conditioning in inner (noting ), such that its statement only affects the probability of given and not the marginal probability of . This is why, when there is only a single layer of nesting, nested inference is equivalent to the notion of sampling from “cut distributions” (Plummer, 2015), whereby the sampling of certain subsets of the variables in a model are made with factors of the overall likelihood omitted.

It is important to note that if we had observed the *output* of the inner query, rather than sampling
from it, this would still constitute a nested inference problem. The key to the nesting
is the conditional normalization applied by , not the exact usage of the generated
distribution object dist. However, as discussed in Appendix B, actually
observing a nested query requires numerous additional computational issues to
be overcome, which are beyond the scope of this paper. We thus focus on the nested sampling scenario.

### 3.1 Motivating Example

Before jumping into a full formalization of nested inference, we first consider the motivating example of modeling a poker player who reasons about another player. Here each player has access to information the other does not, namely the cards in their hand, and they must perform their own inference to deal with the resulting uncertainty.

Imagine that the first player is deciding whether or not to bet. She could naïvely just make this decision based on the strength of her hand, but more advanced play requires her to reason about actions the other player might take given her own action, e.g. by considering whether a bluff is likely to be successful. She can carry out such reasoning by constructing a model for the other player to try and predict their action given her action and their hand. Again this nested model could just simply be based on a naïve simulation, but we can refine it by adding another layer of meta-reasoning: the other player will themselves try to infer the first player’s hand to inform their own decision.

These layers of meta-reasoning create a nesting: for the first player to choose an action, they must run multiple simulations for what the other player will do given that action and their hand, each of which requires inference to be carried out. Here adding more levels of meta-reasoning can produce smarter models, but also requires additional layers of nesting. We expand on this example to give a concrete nested inference problem in Appendix E.

### 3.2 Formalization

To formalize the nested inference problem more generally, let and denote all the random variables of an outer query that are respectively passed or not to the inner query. Further, let denote all random variables generated in the inner query – for simplicity, we will assume, without loss of generality, that these are all returned to the outer query, but that some may not be used. The unnormalized density for the outer query can now be written in the form

(4) |

where is the normalized density of the outputs of the inner query and encapsulates all other terms influencing the trace probability of the outer query. Now the inner query defines an unnormalized density that can be evaluated pointwise and we have

(5) |

(6) |

where is our target distribution, for which we can directly evaluate the numerator, but the denominator is intractable and must be evaluated separately for each possible value of . Our previous example is achieved by fixing and . We can further straightforwardly extend to the multiple layers of nesting setting by recursively defining in the same way as .

### 3.3 Relationship to Nested Estimation

To relate the nested inference problem back to the nested estimation formulation from Section 2.1, we consider using a proposal to calculate the expectation of some arbitrary function under as per self-normalized importance sampling

(7) |

Here both the denominator and numerator are nested expectations with a nonlinearity coming from the fact that we are using the reciprocal of an expectation. A similar reformulation could also be applied in cases with multiple layers of nesting, i.e. where inner itself makes use of another query. The formalization can also be directly extended to the sequential MC (SMC) setting by invoking extended space arguments (Andrieu et al., 2010).

Typically is not known upfront and we instead return an empirical measure from the program in the form of weighted samples which can later be used to estimate an expectation. That is, if we sample and and return all samples (such that each is duplicated times in the sample set) then our unnormalized weights are given by

(8) |

This, in turn, gives us the empirical measure

(9) |

where is a delta function centered on .
By definition, the convergence of this empirical measure to the target
requires that
expectation estimates calculated using it converge in
probability for *any* integrable (presuming our proposal
is valid). We thus see that
the convergence of the ratio of nested
expectations in (7) for any arbitrary , is equivalent
to the produced samples converging to the distribution defined by the program. Informally,
the NMC results then tell us this will happen in the limit
provided that is strictly positive for all possible (as otherwise
the problem becomes ill-defined).
More formally we have the following result. Its proof, along with all others,
is given in Appendix A.

###### Theorem 1.

Let be an integrable function, let , and let be a self-normalized MC estimate for calculated using as per (9). Assuming that forms a valid importance sampling proposal distribution for , then

(10) |

where and are constants derived in the proof and, as before, represents asymptotically dominated terms.

Note that rather than simply being a bound, this result is an equality and thus provides the exact asymptotic rate. Using the arguments of (Rainforth et al., 2018, Theorem 3), it can be straightforwardly extended to cases of multiple nesting (giving a rate analogous to (3)), though characterizing and becomes more challenging.

### 3.4 Convergence Requirements

We have demonstrated that the problem of nested inference is a particular case of nested estimation. This problem equivalence will hold whether we elect to use the aforementioned nested importance sampling based approach or not, while we see that our finite sample estimates must be biased for non-trivial by the convexity of and Theorem 4 of Rainforth et al. (2018). Presuming we cannot produce exact samples from the inner query and that the set of possible inputs to the inner query is not finite (these are respectively considered in Appendix D and Appendix C), we thus see that there is no “silver bullet” that can reduce the problem to a standard estimation.

We now ask, what behavior do we need for Anglican’s , and nested inference more generally, to ensure convergence? At a high level, the NMC results show us that we need the computational budget of each call of a nested query to become arbitrarily large, such that we use an infinite number of samples at each layer of the estimator: we require each .

We have formally demonstrated convergence when this requirement is satisfied and the previously introduced nested importance sampling approach is used. Another possible approach would be to, instead of drawing samples to estimate (7) directly, importance sample times for each call of the inner query and then return a single sample from these, drawn in proportion to the inner query importance weights. We can think of this as drawing the same raw samples, but then constructing the estimator as

(11) | ||||

(12) | ||||

As demonstrated formally in Appendix A, this approach also converges. However, if we Rao Blackwellize (Casella and Robert, 1996) the sampling of , we find that this recovers (9). Consequently, this is a strictly inferior estimator (it has an increased variance relative to (9)). Nonetheless, it may often be a convenient setup from the perspective of the PPS semantics and it will typically have substantially reduced memory requirements: we need only store the single returned sample from the inner query to construct our empirical measure, rather than all of the samples generated within the inner query.

Though one can use the results of Fort et al. (2017)

to show the correctness of instead using an MCMC estimator for the outer query, the correctness of using MCMC methods for the inner queries is not explicitly covered by existing results. Here we find that we need to start a new Markov chain for each call of the inner query because each value of

defines a different local inference problem. One would intuitively expect the NMC results to carry over – as all the inner queries will run their Markov chains for an infinitely long time, thereby in principle returning exact samples – but we leave formal proof of this case to future work. We note that such an approach effectively equates to what is referred to as*by Plummer (2015).*

multiple imputation

### 3.5 Shortfalls of Existing Systems

Using the empirical measure (9) provides one possible manner of producing a consistent estimate of our target by taking and so we can use this as a gold-standard reference approach (with a large value of ) to assess whether Anglican returns samples for the correct target distribution. To this end, we ran Anglican’s importance sampling inference engine on the simple model introduced earlier and compared its output to the reference approach using and . As shown in Figure 1, the samples produced by Anglican are substantially different to the reference code, demonstrating that the outputs do not match their semantically intended distribution. For reference, we also considered the distribution induced by the aforementioned unnested model and a naïve estimation scheme where a sample budget of is used for each call to inner, effectively corresponding to ignoring the statement by directly returning the first draw of .

We see that the unnested model defines a noticeably different distribution, while the behavior of Anglican is similar, but distinct, to ignoring the statement in the inner query. Further investigation shows that the default behavior of in a query nesting context is equivalent to using (11) but with held fixed to at , inducing a substantial bias. More generally, the Anglican source code shows that defines a Markov chain generated by equalizing the output of the weighted samples generated by running inference on the query. When used to nest queries, this Markov chain is only ever run for a finite length of time, specifically one accept-reject step is carried out, and so does not produce samples from the true conditional distribution.

Plummer (2015) noticed that WinBugs and OpenBugs (Spiegelhalter et al., 1996)

similarly do not provide valid inference when using their cut function primitives, which effectively allow the definition of nested inference problems. However, they do not notice the equivalence to the NMC formulation and instead propose a heuristic for reducing the bias that itself has no theoretical guarantees.

## 4 Nested Conditioning

An alternative way one might wish to nest queries is to
use the partition function estimate of one query to factor the trace probability of another.
We refer to this as *nested conditioning*.
In its simplest form, we can think about conditioning on the values input
to the inner query. In Anglican we can carry this out by
using the following custom distribution object constructor

When the resulting distribution object is observed, this will now generate, and factor the trace probability by, a partition function estimate for inner with inputs inputs, constructed using M samples of the inference algorithm inf-alg. For example, if we were to use the query

with inner from the nested inference example, then this would form a pseudo marginal sampler (Andrieu and Roberts, 2009) for the unnormalized target distribution

Unlike the nested inference case, nested conditioning turns out to be valid even if our budget is held fixed, provided that the partition function estimate is unbiased, as is satisfied by, for example, importance sampling and SMC. In fact, it is important to hold the budget fixed to achieve a MC convergence rate. In general, we can define our target density as

(13) |

where is as before (except that we no longer have returned variables from the inner query) and is the true partition function of the inner query when given input . In practice, we cannot evaluate

exactly, but instead produce unbiased estimates

. Using an analogous self-normalized importance sampling to the nested inference case leads to the weights(14) |

and corresponding empirical measure

(15) |

such that we are conducting conventional MC estimation, but our weights are now themselves random variables for a given due to the term. However, the weights are unbiased estimates of the “true weights” such that we have proper weighting (Naesseth et al., 2015) and thus convergence at the standard MC rate, provided the budget of the inner query remains fixed. This result also follows directly from Theorem 6 of Rainforth et al. (2018), which further ensures no complications arise when conditioning on multiple queries if the corresponding partition function estimates are generated independently. These results further trivially extend to the repeated nesting case by recursion, while using the idea of pseudo-marginal methods (Andrieu and Roberts, 2009), the results also extend to using MCMC based inference for the outermost query.

Rather than just fixing the inputs to the nested query, one can also consider conditioning on the internally sampled variables in the program taking on certain values. Such a nested conditioning approach has been implicitly carried out by Rainforth et al. (2016b); Zinkov and Shan (2017); Scibior and Ghahramani (2016); Ge et al. (2018), each of which manipulate the original program in some fashion to construct a partition function estimator that is used used within a greater inference scheme, e.g. a PMMH estimator (Andrieu et al., 2010).

## 5 Estimates as Variables

Our final case is that one might wish to use estimates as first class variables in another query. In other words, a variable in an outer query is assigned to a MC expectation estimate calculated from the outputs of running inference on another, nested, query. By comparison, the nested inference case (without Rao-Blackwellization) can be thought of as assigning a variable in the outer query to a single approximate sample from the conditional distribution of the inner query, rather than an MC expectation estimate constructed by averaging over multiple samples.

Whereas nested inference can only encode a certain class of nested estimation problems – because the only nonlinearity originates from taking the reciprocal of the partition function – using estimates as variables allows, in principle, the encoding of any nested estimation. This is because using the estimate as a first class variable allows arbitrary nonlinear mappings to be applied by the outer query.

An example of this approach is shown in Appendix G, where we construct a generic estimator for Bayesian experimental design problems. Here a partition function estimate is constructed for an inner query and is then used in an outer query. The output of the outer query depends on the logarithm of this estimate, thereby creating the nonlinearity required to form a nested expectation.

Because using estimates as variables allows the encoding of any nested estimation problem, the validity of doing so is equivalent to that of NMC more generally and must thus satisfy the requirements set out in (Rainforth et al., 2018). In particular, one needs to ensure that the budgets used for the inner estimates increase as more samples of the outermost query are taken.

## 6 Online Nested Monte Carlo

NMC will be highly inconvenient to actually implement in a PPS whenever one desires to provide online estimates; for example, a lazy sequence of samples that converges to the target distribution. Suppose that we have already calculated an NMC estimate, but now desire to refine it further. In general, this will require an increase to all for each sample of the outermost estimator. Consequently, the previous samples of the outermost query must be revisited to refine their estimates. This significantly complicates practical implementation, necessitating additional communication between queries, introducing computational overhead, and potentially substantially increasing the memory requirements.

To highlight these shortfalls concretely, consider
the nested inference class of problems and, in particular,
constructing the un–Rao–Blackwellized
estimator (11) in an online fashion. Increasing
requires to be redrawn for each , which in
turn necessitates storage of previous samples and weights.^{2}^{2}2Note that
not all previous samples and weights need storing – when making the update
we can sample whether to change or not based on combined weights
from all the old samples compared to all the new samples. This leads to an overhead cost
from the extra computation carried out for re-visitation
and a memory overhead from having to store information about each call
of the inner query.

Perhaps even more problematically, the need to revisit old samples when drawing new samples can cause substantial complications for implementation. Consider implementing such an approach in Anglican. Anglican is designed to return a lazy infinite sequence of samples converging to the target distribution. Once samples are taken from this sequence, they become external to Anglican and cannot be posthumously updated when further samples are requested. Even when all the output samples remain internal, revisiting samples remains difficult: one either needs to implement some form of memory for nested queries so they can be run further, or, if all information is instead stored at the outermost level, additional non-trivial code is necessary to apply post-processing and to revisit queries with previously tested inputs. The latter of these is likely to necessitate inference–algorithm–specific changes, particularly when there are multiple levels of nesting, thereby hampering the entire language construction.

To alleviate these issues, we propose to only increase the computational
budget of *new* calls to nested queries, such that earlier calls use fewer samples
than later calls. This simple adjustment removes the need for
communication between different calls and requires only the
storage of the number of times the outermost query has previously been sampled
to make updates to the overall estimate.
We refer to this approach as *online* NMC (ONMC),
which, to the best of our knowledge, has not been previously considered in the literature.
As we now show, ONMC only leads to small
changes in the convergence rate of the resultant estimator compared
to NMC: using recommended parameter settings, the asymptotic root mean squared error
for ONMC is never more than twice that of NMC
for a matched sample budget and can even be smaller.

Let be monotonically increasing functions dictating the number of samples used by ONMC at depth for the -th iteration of the outermost estimator. The ONMC estimator is defined as

(16) |

where is calculated using in (2), setting and . For reference, the NMC estimator, , is as per (16), except for replacing with . Algorithmically, we have that the ONMC approach is defined as follows.

We see that OMMC uses fewer samples at inner layers for earlier samples of the outermost level, and that each of resulting inner estimates is calculated as per an NMC estimator with a reduced sample budget. We now show the consistency of the ONMC estimator.

###### Theorem 2.

If each for some constants and each is continuously differentiable, then the mean squared error of as an estimator for converges to zero as .

In other words, ONMC converges for any realistic choice of provided : the requirements on are, for example, much weaker than requiring a logarithmic or faster rate of growth, which would already be an impractically slow rate of increase.

In the case where increases at a polynomial rate, we can further quantify the rate of convergence, along with the relative variance and bias compared to NMC:

###### Theorem 3.

If each for some constants and each is continuously differentiable, then

(17) | |||

(18) | |||

(19) |

is the Euler–Mascheroni constant; is the Riemann–zeta function; and , , and are constants defined as per the corresponding NMC bound given in (3).

###### Corollary 1.

Let be an ONMC estimator setup as per Theorem 3 with outermost samples and let be an NMC estimator with a matched overall sample budget. Defining , then

Further, if the NMC bias decreases at a rate proportional to that implied by the bound given in (3), namely

(20) |

for some constant , where is the number of outermost samples used by the NMC sampler, then

We expect the assumption that the bias scales as to be satisfied in the vast majority of scenarios, but there may be edge cases, e.g. when an gives a constant output, for which faster rates are observed. Critically, the assumption holds for all nested inference problems because the rate given in (10) is an equality.

We see that if , which will generally be the case in practice for sensible setups, then the convergence rates for ONMC and NMC vary only by a constant factor. Specifically, for a fixed value of , they have the same asymptotic variance and ONMC has a factor of higher bias. However, the cost of ONMC is (asymptotically) only times that of NMC, so for a fixed overall sample budget it has lower variance.

As the bound varies only in constant factors for , the asymptotically optimal value for for ONMC is the same as that for NMC, namely (Rainforth et al., 2018). For this setup, we have respectively for . Consequently, when , the fixed budget variance of ONMC is always better than NMC, while the bias is no more than times larger if and no more than times large more generally.

### 6.1 Empirical Confirmation

To test ONMC empirically, we consider the simple analytic model given in Appendix F, setting . The rationale for setting a minimum value of is to minimize the burn-in effect of ONMC – earlier samples will have larger bias than later samples and we can mitigate this by ensuring a minimum value for . More generally, we recommend setting (in the absence of other information) , where is the minimum overall budget we expect to spend. In Figure 2, we have chosen to set deliberately low so as to emphasize the differences between NMC and ONMC. Given our value for , the ONMC approach is identical to fixing for , but unlike fixing , it continues to improve beyond this because it is not limited by asymptotic bias. Instead, we see an inflection point-like behavior around , with the rate recovering to effectively match that of the NMC estimator.

### 6.2 USING ONMC IN PPSs

Using ONMC based estimation schemes to ensure consistent estimation for nested inference in PPSs is straightforward – the number of iterations the outermost query has been run for is stored and used to set the number of iterations used for the inner queries. In fact, even this minimal level of communication is not necessary – can be inferred from the number of times we have previously run inference on the current query, the current depth , and .

As with NMC, for nested inference problems ONMC can either return a single sample from each call of a nested query, or Rao–Blackwellize the drawing of this sample when possible. Each respectively produces an estimator analogous to (11) and (9) respectively, except that in the definition of the inner weights is now a function of .

Returning to Figure 1, we see that using ONMC with nested importance sampling and only returning a single sample corrects the previous issues with how Anglican deals with nested inference, producing samples indistinguishable from the reference code.

## 7 Conclusions

We have formalized the notion of nesting probabilistic program queries and investigated
the statistical validity of different categories of nesting. We have found that current
systems tend to use methods that lead to asymptotic bias for nested inference problems, but that
they are consistent for nested conditioning. We have shown how to carry out the former
in a consistent manner
and developed a new *online* estimator that simplifies the construction algorithms
that satisfy the conditions required for convergence.

## Appendix A Proofs

See 1

###### Proof.

Though informally the high-level result follows directly from (Rainforth et al., 2018, Theorem 3), there are three subtleties that require further attention. Firstly, unlike (Rainforth et al., 2018, Theorem 3), this result is an asymptotic equality rather than a bound – in the limit of large it holds exactly. This more powerful result is made possible by knowing the exact form of the nonlinearity. Secondly, our overall estimator uses the ratio of two NMC estimators. Though Slutsky’s Theorem means this does not create complications in the general demonstration of convergence, additional care is required when calculating the exact rate. Finally, samples are reused in both the inner and outer estimators. This could easily be avoided by sampling an additional for the outer estimator, thereby giving an estimator trivially of the form considered by (Rainforth et al., 2018, Theorem 3). However, doing so would be less efficient and is expected to have a larger variance than the estimator used.

We start by considering the the partition function estimate, noting that true value is ,

(21) | ||||

(22) |

where

(23) | ||||

(24) |

will be used as shorthands. Further defining

(25) | ||||

(26) |

and using Taylor’s Theorem on about gives

(27) |

provided each to avoid singularity issues. We have by assumption that for all possible as otherwise the problem becomes ill-defined. On the other hand, if , it must also be the case that . Here by taking the convention when , we can avoid all further possible singularity issues, such that (27) always holds.

Meanwhile, the standard breakdown of the mean squared error to the variance plus the bias squared gives

Using (27), we see that the first term in the expansion dominates for the variance (as decreases with

), such that the weak law of large numbers gives

Now we have

and we further see from the weak law of large numbers that the second term tends to as increases, but the first term remains fixed. Thus the first term is dominant and we have

(28) | |||

(29) | |||

(30) |

where

(31) |

Switching focus to the bias we have

For the first order term in the expansion, only the component with respect to is non-zero as, for ,

(32) |

Denoting the first order term as , we thus have

Comments

There are no comments yet.