Monte Carlo approaches to approximate inference include importance sampling, Markov chain Monte Carlo, and sequential Monte Carlodel2006sequential . It is easy to construct Monte Carlo inference algorithms that have asymptotic guarantees, but constructing Monte Carlo inference algorithms that are accurate in practice often requires knowledge of the posterior distribution. Recent work in amortized inference stuhlmuller2013learning ; paige2016inference ; ritchie2016neurally ; le2016inference aims to acquire this knowledge by training neural networks on offline problem instances and generalizing to problem instances encountered at run-time. We propose to use probabilistic programming languages as a medium for users to encode their domain-specific knowledge about the posterior, and to learn parameters of the program offline using an amortized inference approach to fill in gaps in this knowledge. In our setting, the model in which inference is being performed may or may not be itself represented by a probabilistic program.
This paper makes three contributions: First, we introduce proposal programs
, which are probabilistic programs that represent proposal distributions. Proposal programs can use internal random choices, and report estimates of their proposal probabilities instead of true proposal probabilities. We show that proposal programs can be used in place of regular proposal distributions within importance sampling and Metropolis-Hastings algorithms without sacrificing asymptotic consistency of these algorithms. Second, we give a stochastic gradient descent algorithm for offline learning of proposal program parameters. Third, we propose a class of proposal programs based on combination of domain-specific heuristic randomized algorithms for identifying high-probability regions of the target distribution with neural networks that model the uncertainty in the target distribution around the outputs of the algorithm. We also discuss how to implement proposal programs in a sampling-based probabilistic programming runtime, and give example proposal programs written in Gen.jlcusumano2017gen , a probabilistic language embedded in Julia bezanson2017julia .
Consider a target distribution on latent variable(s) , and let be the unnormalized target distribution where is a normalizing constant. Let denote the problem instance—this may include the observed data, or any other context relevant to defining the inference problem. In importance sampling, we sample many values from a proposal distribution that is parameterized by , and evaluate the importance weight for each sample, producing a weighted collection of samples that can be used to estimate expectations under . In Markov chain Monte Carlo, we use one or more proposal distribution(s) to construct a Markov chain that converges asymptotically to , and use iterates of one or more such chains to estimate expectations under . Proposal distributions may be combined in cycles or mixtures tierney1994markov , and each proposal distributions may only update a subset of the variables in , leaving the remaining variables fixed.
3 Monte Carlo inference with proposal programs
A proposal program is a probabilistic program that takes input and makes random choices, where each random choice is given an address , where is some countable set, and where some subset of the addresses are the output choices. Random choices that are not output are choices are called internal choices. We assume that all output choices are realized in all possible executions of the program. We assume that all random choices are discrete—a rigorous treatment of continuous random choices is left for future work. A trace of a probabilistic program is a mapping from the address of a random choices to its value where , and where is the set of all possible values for a random choice. A function from output addresses to values is called an output trace. Let denote the output trace constructed by restricting a trace to addresses . Let if for all (i.e. if the trace agrees with an output trace on the values of the output choices). Let denote the probability of choice with address taking value given the state of the program at that point in its execution as determined by the values of previous random choices. The probability of a trace on input is defined as:
We assume that it is possible to execute the program , possibly with some output random choices fixed to given values, and to compute the probabilities for output random choices . Let denote a trace generated by executing the program on input and let denote a trace generated by sampling the program on input , but where output choices are fixed to values given in an output trace instead of being sampled (). Note that internal random choices can occur before or after output choices in program execution, and internal random choices may depend on the value of output random choices. We require that for all and the program satisfies . That is, any trace that can be generated by executing the program with output choices fixed to must also be possible under standard program execution. Let the product of probabilities of each output choice in a trace given the previous choices be denoted: and similarly for the internal choices . The marginal probability of an output trace is called the proposal probability, and is given by:
|All addresses of possible random choices|
|Addresses of realized random choices in an execution|
|Possible values for random choices|
|Trace (record of random choices in an execution)|
|Output random choices|
|Inputs to program|
|Output trace generated by restricting a trace to outputs|
|Sampling a trace by executing program on input|
|Sampling a trace with constrained output trace|
In order to use the marginal distribution on outputs of a proposal program as a proposal distribution within a standard importance sampling (IS) algorithm for a target distribution , we require the ability to sample from the proposal distribution and evaluate the proposal probability . The proposal probability is needed to compute the importance weight . To use the proposal program inside standard Metropolis-Hastings (MH), we require the ability to sample from the proposal and evaluate the forward proposal probability for the resulting sample (where may contain the previous iterate ), as well as the ability to compute the reverse proposal probability (the probability that the proposal generates the reverse move, where may contain the proposed iterate ). The forward and reverse proposal probabilities are needed to compute the acceptance probability:
Figure 1(a) shows an interface for proposal programs that, together with the ability to evaluate , is sufficient for implementing IS and MH. The simulate procedure executes the proposal program, and returns the output trace and the proposal probability . The assess procedure takes an output trace and returns . However, evaluating is intractable in the general case because Equation (2) contains a number of terms that is exponential in the number of internal random choices. Therefore, we propose an approximate implementation of the interface, shown in Figure 1(b). The approximate simulate procedure executes the proposal program, and returns the output trace and a specific estimate of the proposal probability . The approximate assess procedure takes an output trace and returns a specific estimate of . Algorithm 1 shows an IS algorithm that uses the approximate simulate procedure of Figure 1(b), and Algorithm 2 shows a MH transition operator that uses the approximate simulate and assess procedures. The remainder of the section shows that substituting the approximate procedures in Figure 1(b) in place of the idealized procedures in Figure 1(a) within IS and MH preserves the asymptotic consistency of both algorithms.
See Appendix A.1 for the proof, which is based on an auxiliary variable construction. Algorithm 2 shows a Metropolis-Hastings transition operator that uses the simulate and assess procedures of Figure 1(b).
See Appendix A.2 for the proof. Note that this transition operator may be composed with other transition operators to construct an ergodic Markov chain. The non-asymptotic accuracies of Algorithm 1 and Algorithm 2 depend on the choice of , the number of executions of the program used within each invocation of simulate or assess. As , the algorithms behave like the corresponding IS and MH algorithms using the exact proposal probabilities. For finite , the deviation of these algorithms from the limit depends on the details of the proposal program. If is independent of or if is deterministic, provides exact proposal probabilities. We expect that small may be sufficient for proposal programs which have low mutual information between the internal trace and the output trace. A detailed analysis of how performance depends on is left for future work.
The procedures in Figure 1(b) can be easily implemented in a sampling-based probabilistic programming runtime that records a probability for each random choice in a trace or ‘random database’ wingate2011lightweight . We implemented such a runtime for Gen.jl cusumano2017gen , a probabilistic programming language embedded in Julia bezanson2017julia . In Gen.jl, a probabilistic program is a Julia function that has been annotated with the @probabilistic keyword, and in which some random choices have been annotated with unique addresses using the @choice keyword.
4 Learning proposal program parameters
This section describes a technique to learn the parameters of a proposal program during an offline ‘inference compilation’ phase, prior to use in importance sampling or Metropolis-Hastings. Let denote a family of target distributions indexed by the problem instance . We consider a proposal program parameterized by , with distribution on traces denoted . We assume that it is possible to sample from a training distribution on the arguments of the proposal program and the outputs trace of the proposal program, denoted , where . We factor the training distribution into a distribution on arguments , and a desired distribution on outputs given arguments . If the model in which we are doing inference is a generative model where
is a joint distribution on latent variablesand data (observation) , then we can define , and , so that the argument distribution is and the desired output distribution is the posterior . For this choice of , sampling from the training distribution is achieved by ancestral sampling of the latents and observations from the generative model. A different training distribution based on gold-standard SMC sampling instead of joint generative simulation was previously used in ritchie2016neurally . Given , we seek to find that solve the following optimization problem, where denotes the Kullback-Leibler (KL) divergence:
This is equivalent to the maximizing the expected conditional log likelihood of the training data:
Because the proposal program may use internal random choices, it may not be possible to evaluate , making direct optimization of difficult. Therefore, we instead maximize a lower bound on :
where . We maximize using stochastic gradient ascent. For some , let . Let . Also, define functions and :
We then use the following unbiased estimator of the gradientbased on the ‘per-sample’ baseline of mnih2016variational :
The first term accounts for the effect of on the internal random choices and the second term accounts for the direct effect of on the output random choices. The resulting algorithm is shown below:
Algorithm 3 can be easily implemented on top of a sampling-based probabilistic programming runtime that includes reverse-mode automatic differentiation. The runtime must provide (1) the log probability of output random choices and its gradient and (2) the gradient of the log probability of internal random choices, . The gradient appearing in can be computed from the collection of for each .
Note that the log probability of internal random choices is not required (but its gradient is). Recall that the log probability of the internal random choices is the sum of the log probability of each internal random choice in the internal trace. For internal random choices that do not depend deterministically on , the log probability contribution does not depend on . Therefore, it suffices for the runtime to only instrument the internal random choices that deterministically depend on . This permits use of black box code within the proposal program, which can reduce the overhead introduced by the probabilistic runtime in performance-critical sections of the proposal, and is important for our proposed methodology for using proposal programs in Section 5. We implemented Algorithm 3 on top of the Gen.jl runtime cusumano2017gen . In Gen.jl, the user annotates random choices in a probabilistic program with a unique identifier. The user is free to use randomness in a probabilistic program without annotating it. In our implementation of Algorithm 3 we only accumulate the gradient taking into account random choices that were annotated, relying on the user to ensure that there is no deterministic dependency of un-annotated random choices on . In future work, automatic dependency analysis based on the approach of schulman2015gradient could remove this reliance on the user.
5 Application: Proposals based on randomized mode-finding algorithms
Because Algorithm 3 minimizes the KL divergence from the posterior (or its proxy) to the proposal distribution, optimization will tend to force the proposal to have sufficient variability to cover the posterior mass, which provides a degree of robustness even when the heuristic is brittle. On problem instances where the heuristic gives wrong answers, the proposal variability should increase. On problem instances where the heuristic produces values near the posterior modes, the proposal variability should attempt to match the variability of each posterior mode.
Proposal programs permit the inference programmer to express their knowledge about the target distribution in the form of a sampling program. Consider the case when the user possesses a heuristic randomized algorithm that finds areas of high probability of the target distribution. Although such an algorithm contains some knowledge about the target distribution, the algorithm in isolation is not useful for performing sound probabilistic inference—it does not necessarily attempt to represent the variability with the mode(s) of the target distribution, and its support may not even include the posterior support. The heuristic may also be brittle—it have parameters that need to be tuned for it to work on a particular problem instance, or it may fail completely on some problem instances.
To construct robust and sound probabilistic inference algorithms that can take advantage of heuristic randomized algorithms, we construct proposal programs that use the heuristic algorithm as a subroutine, and use the proposal program in importance sampling (Algorithm 1) or MCMC (Algorithm 2). Let denote the problem instance (e.g. the observed data or context variables); and let , , and denote the parameters, internal random choices, and output of the algorithm, respectively. To construct the proposal program from the heuristic algorithm, we:
Make the parameters that are difficult to set manually into random choices whose distribution may depend on (other parameters can be hardcoded or can have a hardcoded dependency on ).
Add random choices that are equal to the output of the heuristic algorithm plus some amount of noise, such that the support of given any and includes the support of the posterior . The choices are the output choices of the proposal program.
Parametrize the distribution of given and the distribution of given and using the output of neural network(s) with parameters . The networks take as input and generate parameter values for the heuristic algorithm, as well as the amount of noise used to add to the output of the algorithm.
Figure 3 shows the high-level data flow of the resulting proposal program. We optimize the parameters during an offline inference compilation phase using Algorithm 3. Note that the internal random choices of the heuristic algorithm () do not depend deterministically on . Therefore, to use Algorithm 3, the heuristic algorithm need not be instrumented by the probabilistic runtime, and can be fast black-box code.
6 Example: Proposal program combining RANSAC with a neural network
We illustrate Algorithm 1 and approach of Section 5, on a Bayesian linear regression with outliers inference problem. The data set is a set of x-y pairs , the latent variables consist of the parameters of the line
and binary variablesindicating whether each data point is an outlier or not. The generative model , which is conditioned on the x-coordinates , is defined below:
The unnormalized target distribution is the joint probability .
Random sample consensus (RANSAC, fischler1981random ) is an iterative algorithm that can quickly find lines near the posterior modes of the posterior on this problem, using deterministic least-squares algorithm within each iteration. Figure 4 shows a Julia implementation of a RANSAC algorithm for our linear-regression with outliers problem. However, the output distribution of the algorithm is an atomic set that does not sufficiently support the posterior distribution on lines, which is defined on . Also, the algorithm contains a parameter epsilon that needs to be roughly calibrated to the inlier variability in a particular data set, and it is not clear how many iterations of the algorithm to run.
To make use of RANSAC for robust and sound probabilistic inference, we wrote a proposal program in Gen.jl following the methodology proposed in Section 5, optimized the unknown parameters of the resulting proposal programs using Algorithm 3, and employed the optimized proposal programs within importance sampling (Algorithm 1). The proposal program, shown in Figure 5, (1) generates the parameter epsilon and the number of iterations of the RANSAC algorithm from distributions whose parameters are part of
, then (2) runs RANSAC, and (3) adds Cauchy-distributed noise to the resulting line parameters (slope and intercept), where the variability of the noise is determined by the output of a neural network whose parameters are part of, and finally (4) samples the outlier binary variables from their conditional distributions given the line. Note that the random choices inside the ransac function are not annotated—the RANSAC procedure is executed as regular Julia code by Gen.jl’s probabilistic runtime without incurring any overhead.
For comparison, we also implemented a proposal program (Figure 6) that does not use RANSAC, but instead generates the slope and intercept of the line from a distribution that depends on a neural network. We optimized both proposal programs using a modified version of Algorithm 3 that uses ADAM kingma2014adam instead of standard SGD. We used the generative model as the training distribution, and we used during training and minibatches of size , and iterations. The results are shown and discussed in Figure 7.
7 Related work
Some probabilistic programming systems support combining automated and custom inference strategies mansinghka2014venture . However, custom proposal distributions in mansinghka2014venture require user implementation of a density-evaluation function for the proposal (mirroring the interface of Figure 1(a)). Other probabilistic programming systems support the specification of ‘guide programs’ for use in importance sampling, sequential Monte Carlo, or variational inference ritchie2016neurally ; ritchie2016deep . However, guide programs follow the control flow of a model program, or are restricted to using the same set of random choices. We focus on using general probabilistic programs to define proposal distributions, whether in the context of probabilistic programs for the model or not.
The Monte Carlo algorithms presented here derive their theoretical justification from auxiliary-variable constructions similar to those of pseudo-marginal MCMC andrieu2009 and random-weight particle filter fearnhead2008particle . In particular, our Metropolis-Hastings algorithm can be seen as an application of the general auxiliary-variable MCMC formalism of storvik2011flexibility to the setting where proposals are defined as probabilistic programs.
Recent work in ‘amortized’ or ‘compiled’ inference has studied techniques for offline optimizing of proposal distributions in importance sampling or sequential Monte Carlo stuhlmuller2013learning ; paige2016inference ; ritchie2016neurally ; le2016inference . Others have applied similar approaches to optimize proposal distributions for use in MCMC cusumano2017probabilistic ; wang2017neural . However, the focus of these efforts is optimizing over neurally-parameterized distributions that inherit their structure from the generative model being targeted, do not contain their own internal random choices, and are therefore not suited to use with heuristic randomized algorithms. In contrast, we seek to allow the user to easily express and compute with custom proposal distributions that are defined as arbitrary probabilistic programs that are independent of any possible structure in the probabilistic model being targeted, and may include ‘internal’ random choices not present in the target model.
There has been much recent work in probabilistic machine learning on training generative latent variable models using using stochastic gradient descentkingma2013auto ; mnih2014neural ; mnih2016variational . Our procedure for optimizing proposal programs is an application of the Monte Carlo variational objective approach of mnih2016variational
to the setting where the generative model is itself a proposal distribution in a different probabilistic model. The observation that random variables can permit optimization of probabilistic computations that utilize black-box randomized code has been previously observed and used in reinforcement learningwilliams1992simple ; schulman2015gradient .
This paper formalized proposal programs, discussed how to implement proposal programs in a sampling-based probabilistic runtime, showed how to use proposal programs within importance sampling and Metropolis-Hastings, how to optimize proposal programs offline, and suggested an application of the formalism to using randomized heuristics to accelerate Monte Carlo inference. Several directions for future work seem promising.
First, we note that the efficiency of Monte Carlo inference with proposal programs depends on the number of internal replicates used within the proposed simulate and assess procedures. It is important to better characterize how the efficiency depends on . Also, the proposed simulate and assess procedures make use of forward execution of the proposal program to generate traces that are used for estimating the proposal probability. Following the ‘probabilistic module’ formalism cusumano2016encapsulating , other distributions on traces that attempt to approximate the conditional distribution on traces given output choices can be used instead for more accurate proposal probability estimates (and therefore more efficient Monte Carlo inference for a fixed proposal program). These distributions, which constitute nested inference samplers, should be able to make use of existing inference machinery in probabilistic programming languages that was originally developed for inference in model programs. It also seems promising to explore proposal programs that combine neural networks and domain-specific heuristic algorithms in different ways. For example, a proposal program may contain a switch statement that decides whether to propose using a heuristic (as in Section 5) or according to a pure-neural network proposal (as shown in Figure 6), where the probability of taking the two different paths can itself be predicted from the problem instance using a neural network. The proposal program formalism also suggests that other approaches for inference in probabilistic programs such as symbolic integration gehr2016psi
could find applications in estimating or computing proposal probabilities. Finally, a rigorous theoretical analysis of importance sampling, Metropolis-Hastings, and offline optimization with proposal programs that contain continuous random variables would be useful.
This research was supported by DARPA (PPAML program, contract number FA8750-14-2-0004), IARPA (under research contract 2015-15061000003), the Office of Naval Research (under research contract N000141310333), the Army Research Office (under agreement number W911NF-13-1-0212), and gifts from Analog Devices and Google. This research was conducted with Government support under and awarded by DoD, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a.
-  Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.
-  Andreas Stuhlmüller, Jacob Taylor, and Noah Goodman. Learning stochastic inverses. In Advances in neural information processing systems, pages 3048–3056, 2013.
-  Brooks Paige and Frank Wood. Inference networks for sequential monte carlo in graphical models. In International Conference on Machine Learning, pages 3040–3049, 2016.
-  Daniel Ritchie, Anna Thomas, Pat Hanrahan, and Noah Goodman. Neurally-guided procedural models: Amortized inference for procedural graphics programs using neural networks. In Advances in Neural Information Processing Systems, pages 622–630, 2016.
-  Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. Inference compilation and universal probabilistic programming. arXiv preprint arXiv:1610.09900, 2016.
-  Marco Cusumano-Towner, Vikash Mansinghka, et al. Gen.jl: A probabilistic meta-programming platform and compositional inference programming library. https://github.com/probcomp/Gen.jl, 2017.
-  Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017.
-  Luke Tierney. Markov chains for exploring posterior distributions. the Annals of Statistics, pages 1701–1728, 1994.
-  David Wingate, Andreas Stuhlmüller, and Noah D Goodman. Lightweight implementations of probabilistic programming languages via transformational compilation. In AISTATS, pages 770–778, 2011.
-  Marco F Cusumano-Towner, Alexey Radul, David Wingate, and Vikash K Mansinghka. Probabilistic programs for inferring the goals of autonomous agents. arXiv preprint arXiv:1704.04977, 2017.
-  Andriy Mnih and Danilo Rezende. Variational inference for monte carlo objectives. In International Conference on Machine Learning, pages 2188–2196, 2016.
-  John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pages 3528–3536, 2015.
-  Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Vikash Mansinghka, Daniel Selsam, and Yura Perov. Venture: a higher-order probabilistic programming platform with programmable inference. arXiv preprint arXiv:1404.0099, 2014.
-  Daniel Ritchie, Paul Horsfall, and Noah D Goodman. Deep amortized inference for probabilistic programs. arXiv preprint arXiv:1610.05735, 2016.
-  Christophe Andrieu and Gareth O. Roberts. The pseudo-marginal approach for efficient monte carlo computations. Ann. Statist., 37(2):697–725, 04 2009.
-  Paul Fearnhead, Omiros Papaspiliopoulos, and Gareth O Roberts. Particle filters for partially observed diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4):755–777, 2008.
-  Geir Storvik. On the flexibility of metropolis–hastings acceptance probabilities in auxiliary variable proposal generation. Scandinavian Journal of Statistics, 38(2):342–358, 2011.
-  Tongzhou Wang, Yi Wu, David A Moore, and Stuart J Russell. Neural block sampling. arXiv preprint arXiv:1708.06040, 2017.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
-  Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
-  Marco F Cusumano-Towner and Vikash K Mansinghka. Encapsulating models and approximate inference programs in probabilistic modules. arXiv preprint arXiv:1612.04759, 2016.
-  Timon Gehr, Sasa Misailovic, and Martin Vechev. Psi: Exact symbolic inference for probabilistic programs. In International Conference on Computer Aided Verification, pages 62–83. Springer, 2016.
Appendix A Proofs
a.1 Consistency of importance sampling using proposal program
Consider a proposal program and a target distribution such that for all where there exists a trace of where and . Also assume that for all . Consider the following extended target distribution:
Note that could hypothetically be sampled from by first sampling from and then executing using fixed output choices . Consider the following extended proposal distribution:
Note that can be sampled from by first sampling an index uniformly from , then executing to produce , extracting the output choices , and executing an additional times using fixed output to produce for . First note that since implies and . The importance weight for the extended target distribution and the extended proposal distribution is:
Note that invoking simulate in Algorithm 1 is equivalent to sampling from the extended proposal distribution and that the importance weight in Algorithm 1 is equal to Equation (11). Therefore, Algorithm 1 is a standard self-normalized importance sampler for the extended target distribution . By convergence of standard self-normalized importance sampling, .
a.2 Stationary distribution for Metropolis-Hastings using proposal program
Algorithm 2 defines a transition operator that takes as input an output trace of proposal program , denoted , and produces as output another output trace, denoted . Note that other state besides (i.e. state that is not mutated by the transition operator) can be included in the input to the operator, but we write to simplify notation. Let denote the Kronecker delta. Let denote the tuple . To show that the this transition operator admits the target distribution as a stationary distribution, we first define an extended target distribution on tuples where and are both output traces of , where , and where each for are traces of :
One could hypothetically sample from this distribution by first sampling , then sampling uniformly from , extracting the output choices , then executing the program on input to produce trace , and executing the program times on input with output choices fixed to values given in . We also define a proposal kernel on this extended space:
This kernel can be sampled from by first executing on input , times, with output random choices fixed to values in , producing traces , then sampling from in proportion to , and then setting and . Consider a Metropolis-Hastings (MH) move on the extended space, using as the proposal kernel and the extended target as the target. Assume that , , , and . The MH acceptance ratio for this move is:
Consider a sampling process that begins with then executes Algorithm 2 on input . Observe that the invocation of simulate in Algorithm 2 is equivalent to sampling from . Therefore, after simulate, we have a sample from the extended target distribution: . Next, note that the invocation of assess in Algorithm 2, on input and , is equivalent to sampling from the extended proposal kernel on input . Note that is not literally sampled in assess but could be without changing the behavior of assess. Finally, note that Equation (18) is the acceptance ratio used in the accept/reject step of Algorithm 2. Therefore, we can interpret Algorithm 2 as first extending the input sample onto the extended space by sampling from the conditional distribution , and then performing an MH step on the extended space. Since , we have that . Since the MH step on the extended space has the extended target distribution as a stationary distribution, we have and therefore . Therefore is a stationary distribution of Algorithm 2.
Appendix B Offline optimization of proposal programs
To see that , note that:
By Jensen’s inequality: