A key limitation of sampling algorithms for approximate inference is that it is difficult to quantify their approximation error. Widely used sampling schemes, such as sequential importance sampling with resampling and Metropolis-Hastings, produce output samples drawn from a distribution that may be far from the target posterior distribution. This paper shows how to upper-bound the symmetric KL divergence between the output distribution of a broad class of sequential Monte Carlo (SMC) samplers and their target posterior distributions, subject to assumptions about the accuracy of a separate gold-standard sampler. The proposed method applies to samplers that combine multiple particles, multinomial resampling, and rejuvenation kernels. The experiments show the technique being used to estimate bounds on the divergence of SMC samplers for posterior inference in a Bayesian linear regression model and a Dirichlet process mixture model.
This paper builds on a growing body of work begun by grosse2014model and grosse2015sandwiching into estimating upper bounds on KL divergences between a sampler’s output distribution and the posterior. In variational inference, the KL divergence of the variational approximation is the gap between the variational lower bound and the log-evidence. grosse2014model and grosse2015sandwiching
recognized that certain stochastic inference Markov chains including annealed importance sampling (AIS) and single-particle SMC can be treated as variational approximations over an extended space that includes auxiliary random choices of the sampler. A similar insight was introduced independently insalimans2015 . grosse2014model and grosse2015sandwiching
also showed how to estimate upper bounds on the log-evidence for datasets simulated from the model using generalizations of the harmonic mean estimator, and introduced the bidirectional Monte Carlo (BDMC) technique for ‘sandwiching’ the log-evidence between these upper bounds and variational lower bounds. A related approach for sandwiching the partition function was previously used in the statistical physics literaturehunter1993 . Finally, grosse2014model and grosse2015sandwiching recognized that the gap between the bounds serves as an upper bound on the KL divergence of the sampler, allowing BDMC to be used for measuring sampler accuracy on simulated datasets.
Two independent papers cusumano2016quantifying and grosse2016measuring built on grosse2014model and grosse2015sandwiching to develop the technique further in different ways. Our previous paper cusumano2016quantifying took a probabilistic programming perspective, and showed how to estimate the KL divergence bound described in grosse2014model and grosse2015sandwiching for general samplers using a ‘meta-inference’ sampler that generates sampler execution histories. cusumano2016quantifying also provided meta-inference samplers for sampling importance resampling (SIR) and particle filtering without MCMC rejuvenation kernels. cusumano2016quantifying also introduced an upper bound on the symmetric KL divergence between the sampler output and the posterior, analyzed optional use of approximate ‘reference’ samples as surrogates for exact posterior samples (prompting the label ‘subjective divergence’), and related the tightness of the bounds to the accuracy of the meta-inference sampler. A closely related but independent work grosse2016measuring introduced Bounding Divergences with REverse Annealing (BREAD), which uses the same upper bound on the symmetric KL divergence given in cusumano2016quantifying
, and showed how to evaluate AIS and single-particle SMC approximate inference quality using this bound. BREAD also includes a heuristic scheme, applicable to hierarchical Bayesian statistical models, for generating simulated datasets whose divergence profiles are used as proxies for divergence profiles on real-world datasets.grosse2016measuring also integrated their technique into existing probabilistic programming platforms.
The main contribution of the current work is a meta-inference construction for generic SMC samplers del2006sequential that is related to conditional SMC andrieu2010particle and generalizes the existing meta-inference constructions for AIS, single-particle SMC, SIR, and particle filtering. By handling a broad class of samplers, the construction increases relevance for real world problems. The construction allows analysis of samplers that rely on MCMC rejuvenation kernels for good inference quality, while permitting use of multiple particles (instead of custom model-specific annealing schemes) to tighten the KL divergence bounds.
2 Background on subjective divergence
We first review the subjective divergence procedure of cusumano2016quantifying . Let denote an approximate inference sampling program that samples output for . Suppose
also comes endowed with a side-procedure that evaluates the log probabilitythat the sampler produces any given output . Let denote the posterior distribution, and let denote an unnormalized posterior distribution. Suppose that we have access to samples from . Then the following is an unbiased Monte Carlo estimate of the symmetric KL divergence between and :
Unfortunately, it is often not possible to efficiently evaluate
for sampling programs that sample auxiliary random choices during their execution, including MCMC and SMC sampling algorithms for approximate Bayesian inference. We denote the joint distribution over auxiliary random choicesand output by . It is intractable to marginalize out the auxiliary random choices because there is an exponentially large number of terms in the sum
. Therefore, we instead compute the following unbiased estimate of an upper bound on the symmetric KL divergence, using a ‘meta-inference’ sampler programwhich samples execution histories of the sampler (assignments to the auxiliary variables ) given the output :
The upper bound estimated is the symmetric KL divergence on an extended space that includes the auxiliary variables of the sampler. As shown in cusumano2016quantifying , the tightness of the bound is governed by how well approximates on average for and . When samples from a gold-standard approximate inference ‘reference sampler’ are used in place of posterior samples, the validity of the bound is subject to the accuracy of the reference sampler cusumano2016quantifying .
3 A probabilistic programming interface for subjective divergence
We now clarify the procedures associated with a sampler that are needed for subjective divergence estimation. In particular, we introduce the following probabilistic programming interface, which consists of two stochastic procedures, denoted and for some distributions and :
The simulate procedure runs a sampler with joint distribution over execution histories and output , and returns . The regenerate procedure takes a potential sampler output as its input, and runs a ‘regeneration’ sampler that samples an execution history of the original sampler. Both procedures also return a log-weight. The log-weight returned by simulate can be interpreted as a log harmonic mean estimate of and the log-weight returned by regenerate can be interpreted as a log importance sampling estimate of . When the sampler is an inference sampler, we call the regeneration sampler a ‘meta-inference’ sampler. As will be seen, the relationship between the original sampler and the regeneration sampler is analogous to the relationship between SMC and conditional SMC andrieu2010particle .
Note that the auxiliary random variablesare not exposed through the interface. Also note that a sampler with a tractable marginal output probability trivially implements the interface because reduces to the log output probability when there are no auxiliary variables . Algorithm 1 shows a procedure that computes Equation (2) using the above interface.
4 Implementing simulate and regenerate for sequential Monte Carlo
Algorithm 2 below shows how to implement simulate and regenerate for the generic SMC sampler template introduced in del2006sequential , with independent resampling. The SMC sampler template (the simulate procedure of Algorithm 2), permits use of MCMC kernels (within the ), provided that corresponding ‘backward kernels’ are defined such that the weights can be computed. Note that simulate does not sample from the backward kernels. Building on the analysis of SMC used in andrieu2010particle , the auxiliary variables for the SMC sampler are the random choices made during its execution: the resampling choices for and and the values of all intermediate particles for . The output of the SMC sampler is denoted . The SMC stochastic regeneration template (the regenerate procedure of Algorithm 2), is given an output , and samples an execution history of the SMC sampler by first choosing the ancestral particle indices that led to the output (denoted for ), then sampling from the backward kernels in reverse order to define the ancestral particle values for that led to the output, and finally running SMC forward, with the ancestral indices and values for fixed. This is related to the conditional SMC update of andrieu2010particle , but differs in that only an output particle and not a full particle trajectory is required as input. The log-weight for this sampler and regeneration pair simplify to (see Appendix A for derivation):
Having specified how to implement simulate and regenerate for this generic variant of SMC, we can now estimate subjective divergences for SMC. We illustrate the use of Algorithm 1 and Algorithm 2 to estimate subjective bounds on symmetric KL divergences of SMC samplers and black box variational approximations to the posterior in Figure 1. Note that we optimized the performance of variational inference and SMC implementations separately, and the relative runtimes of the two approaches are not meant to be informative.
This research was supported by DARPA (PPAML program, contract number FA8750-14-2-0004), IARPA (under research contract 2015-15061000003), the Office of Naval Research (under research contract N000141310333), the Army Research Office (under agreement number W911NF-13-1-0212), and gifts from Analog Devices and Google. This research was conducted with Government support under and awarded by DoD, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a.
-  Roger Baker Grosse. Model selection in compositional spaces. PhD thesis, Massachusetts Institute of Technology, 2014.
-  Roger B Grosse, Zoubin Ghahramani, and Ryan P Adams. Sandwiching the marginal likelihood using bidirectional Monte Carlo. arXiv preprint arXiv:1511.02543, 2015.
Tim Salimans, Diederik P. Kingma, and Max Welling.
Markov chain monte carlo and variational inference: Bridging the gap.
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 1218–1226, 2015.
-  John E Hunter III, William P Reinhardt, and Thomas F Davis. A finite-time variational method for determining optimal paths and obtaining bounds on free energy changes from computer simulations. The Journal of chemical physics, 99(9):6856–6864, 1993.
-  Marco F Cusumano-Towner and Vikash K Mansinghka. Quantifying the probable approximation error of probabilistic inference programs. arXiv preprint arXiv:1606.00068, 2016.
-  Roger B Grosse, Siddharth Ancha, and Daniel M Roy. Measuring the reliability of MCMC inference with bidirectional Monte Carlo. arXiv preprint arXiv:1606.02275, 2016.
-  Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.
-  Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
Appendix A: Derivation of log weight for SMC
Recall that the set of auxiliary random choices for the SMC sampler of Algorithm 2 is the set of all resampling choices for and and the values of all intermediate particles for . The joint probability over auxiliary random choices and output for an execution of SMC’s simulate is:
The joint probability over auxiliary random choices for an execution of SMC’s regenerate is:
where the first factor is due to rand-ancestry. First, note that for all for and for . This is true for by the requirements and for all . For , . Either or . Using the requirements for all for all and for all for all , gives for all .
To see that , first consider some such that . Since we have for . We also have which implies for , and , which implies . Combined with for all these ensure is defined for output and .
Next, assume is defined for output and . Then we have which implies . We also have for , which implies for . We also have for . Therefore .
The weight is then defined for all and , and is:
Appendix B: SMC with sequential observation and detailed balance kernels
In the experiments, we use SMC programs defined as follows. Let be a hypothesis space, corresponding to ‘global’ latent variables. Let for be additional hypothesis space extensions, corresponding for ‘local’ latent variables for each of observations for . Define for . We use indexing notation where for any . Let denote the model’s joint probability of , where and for . Assume for the given observation set that for all . The target distribution of the SMC algorithm is the conditional distribution . Define the intermediate target distributions by , and the unnormalized target probability functions by for . Define the initialization kernel as: . This kernel samples from the model’s prior distribution over the global latents and the local latents for the first observation. Suppose there exist ‘detailed balance kernels’ for . Kernel is a collection of distributions over elements of , indexed by elements of . Each detailed balance kernel must satisfy the detailed balance property with respect to the intermediate target distribution :
Define for . Each for is a collection of distributions over , indexed by elements of . It is possible to sample from by sampling from the detailed balance kernel and then sampling from the model prior distribution over the new local latents . Intuitively, the kernel first performs inference targeting , then extends the hypothesis space to include values of the local latent variables for observation by sampling from the prior. Define . Intuitively, kernel performs inference targeting the final target distribution . Define the ‘backward kernels’ by for . Each kernel for is a collection of distributions over , indexed by elements of . To sample from , we simply sample from the detailed balance kernel . Finally, define . First, we show that for all . This follows from the detailed balance requirement (Equation 5) and from the fact that for all and for all . The same argument applies to and . Note that we do not require the detailed balance kernels to be ergodic. For example, a given kernel may only update one of the components of . Given these definitions, the weight functions become: