Monte Carlo (MC) methods are used throughout the quantitative sciences. For example, they have become a ubiquitous means of carrying out approximate Bayesian inference(doucet2001introduction; gilks1995markov). The convergence of MC estimation has been considered extensively in the literature (durrett2010probability). However, the implications arising from the nesting of MC schemes, where terms in the integrand depend on the result of separate, nested, MC estimators, is generally less well known. This paper examines the convergence of such nested Monte Carlo (NMC) methods.
Nested expectations occur in wide variety of problems from portfolio risk management (gordy2010nested) to stochastic control (belomestny2010regression). In particular, simulations of agents that reason about other agents often include nested expectations. Tackling such problems requires some form of nested estimation scheme like NMC.
A common class of nested expectations is doubly-intractable inference problems (murray2006mcmc; liang2010double), where the likelihood is only known up to a parameter-dependent normalizing constant. This can occur, for example, when nesting probabilistic programs (mantadelis2011nesting; le2016nested). Some problems are even multiply-intractable, such that they require multiple levels of nesting to encode (stuhlmuller2014reasoning). Our results can be used to show that changes are required to the approaches currently employed by probabilistic programming systems to ensure consistent estimation for such problems (rainforth2017thesis; rainforth2017nestpp).
The expected information gain used in Bayesian experimental design (chaloner1995bayesian)
requires the calculation of an entropy of a marginal distribution and therefore the expectation of the logarithm of an expectation. By extension, any Kullback-Leibler divergence where one of the terms is a marginal distribution also involves a nested expectation. Hence, our results have important implications for relaxing mean-field assumptions, or using different bounds, in variational inference(hoffman2015stochastic; naesseth2017variational; maddison2017filtering) and deep generative models (burda2015importance; le2017auto).
Certain nested estimation problems can be tackled by pseudo-marginal methods (beaumont2003estimation; andrieu2009pseudo; andrieu2010particle). These consider inference problems where the likelihood is intractable, but can be estimated unbiasedly. From a theoretical perspective, they reformulate the problem in an extended space with auxiliary variables that are used to represent the stochasticity in the likelihood computation, enabling the problem to be expressed as a single expectation.
Our work goes beyond this by considering cases in which a non-linear mapping is applied to the output of the inner expectation, (e.g. the logarithm in the experimental design example), prohibiting such reformulation. We demonstrate that the construction of consistent NMC algorithms is possible, establish convergence rates, and provide empirical evidence that these rates are observed in practice. Our results show that whenever an outer estimator depends non-linearly on an inner estimator, then the number of samples used in both the inner and outer estimators must, in general, be driven to infinity for convergence. We extend our results to cases of repeated nesting and show that the optimal NMC convergence rate is where is the total number of samples used in the estimator and is the nesting depth (with being conventional MC), whereas naïve approaches only achieve a rate of . We further lay out methods for reformulating certain classes of nested expectation problems into a single expectation, allowing usage of conventional MC estimation schemes with superior convergence rates than naïve NMC. Finally, we use our results to make application-specific advancements in Bayesian experimental design and variational auto-encoders.
1.1 Related Work
Though the convergence of NMC has previously received little attention within the machine learning literature, a number of special cases having been investigated in other fields, sometimes under the name of nested simulation (longstaff2001valuing; belomestny2010regression; gordy2010nested; broadie2011efficient). While most of this literature focuses on particular application-specific non-linear mappings, a convergence bound for a wider range of problems was shown by hong2009estimating and recently revisited in the context of rare-event problems by fort2016mcmc
. The latter paper further considers the case where samples in the outer estimator originate from a Markov chain. Compared to this previous work, ours is the first to consider multiple levels of nesting, applies to a wider range of non-linear mappings, and provides more precise convergence rates. By introducing new results, outlining special cases, providing empirical assessment, and examining specific applications, we provide a unified investigation and practical guide nesting MC estimators in a machine learning context. We begin to realize the potential significance of this by using our theoretical results to make advancements in a number of specific application areas.
Another body of literature related to our work is in the study of the convergence of Markov chains with approximate transition kernels (rudolf2015perturbation; alquier2016noisy; medina2016stability)
. The analysis in this work is distinct, but complementary, to our own, focusing on the impact of a known bias on an MCMC chain, whereas our focus is more on the quantifying this bias. Also related is the study of techniques for variance reduction, such as multilevel MC(heinrich2001multilevel; giles2008multilevel), and bias reduction, such as the multi-step Richardson-Romberg method (pages2007multi; lemaire2017multilevel) and Russian roulette sampling (lyne2015russian), many of which are applicable in a NMC context and can improve performance.
2 Problem Formulation
The key idea of MC is that the expectation of an arbitrary function
under a probability distributionfor its input can be approximated using:
In this paper, we consider the case that is itself intractable, defined only in terms of a functional mapping of an expectation. Specifically, where we can evaluate exactly for a given and , but is the output of the following intractable expectation of another variable :
depending on the problem, with . All our results apply to both cases, but we will focus on (3a) for clarity. Estimating involves computing an integral over for each value of in the outer integral. We refer to the approach of tackling both integrations using MC as nested Monte Carlo (NMC):
where each are independently sampled. In Section 3 we will build on this further by considering cases with multiple levels of nesting, where calculating involves computation of an intractable (nested) expectation.
3 Convergence of Nested Monte Carlo
We now show that approximating is in principle possible, at least when is well-behaved. In particular, we establish a convergence rate of the mean squared error of and prove a form of almost sure convergence to . We