One strategy for improving convergence of Markov Chain Monte Carlo (MCMC) samplers is through online adaptation of the proposal distribution[1, 2, 14]. An adaptation scheme must ensure that the sample sequence converges to the correct equilibrium distribution. In a componentwise updating Metropolis-Hastings MCMC, i.e. Metropolis-within-Gibbs [5, 8, 10], the proposal distribution can be decomposed into two components:
A stochastic schedule (probability distribution) for selecting the next random variable for modification.
The kernels from which new values for each of the variables are proposed.
In this paper we concentrate on the first component—adapting the schedule for selecting a variable for modification.
Our primary interest in this work is to improve MCMC methods for probabilistic programming [6, 7, 11, 16]. Probabilistic programming languages facilitate development of general probabilistic models using the expressive power of general programming languages. The goal of inference in such programs is to reason about the posterior distribution over random variates that are sampled during execution, conditioned on observed values that constrain a subset of program expressions.
Lightweight Metropolis-Hastings (LMH) samplers  propose a change to a single random variable at each iteration. The program is then rerun, reusing previous values and computation where possible, after which the new set of sample values is accepted or rejected. While re-running the program each time may waste some computation, the simplicity of LMH makes developing probabilistic variants of arbitrary languages relatively straightforward.
Designing robust Adaptive MCMC methods for probabilistic programming is complicated because of diversity of models expressed by probabilistic programs. The same adaption scheme should perform well with different programs without manual tuning. Here we present an adaptive variant of LMH, which dynamically adjusts the schedule for selecting variables for modification. First, we review the general structure of a probabilistic program. We discuss convergence criteria with respect to the program output and propose a scheme for tracking the “influence” of each random variable on the output. We then adapt the selection probability for each variable, borrowing techniques from the upper confidence bound (UCB) family of algorithms for multi-armed bandits . We show that the proposed adaptation scheme preserves convergence to the target distribution under reasonable assumptions. Finally, we compare original and Adaptive LMH on several test problems to show how convergence is improved by adaptation.
2.1 Probabilistic Program
A probabilistic program is a stateful deterministic computation with the following properties:
Initially, expects no arguments.
On every call, returns either a distribution , a distribution and a value , a value , or .
Upon returning , expects a value drawn from as the argument to the next call.
Upon returning or , is invoked again without arguments.
Upon returning , terminates.
A program is run by calling repeatedly until termination.
Every run of the program implicitly produces a sequence of pairs of distributions and values of latent random variables. We call this sequence a trace and denote it by . A trace induces a sequence of pairs of distributions and values of observed random variables. We call this sequence an image and denote it by . We call a sequence of values an output of the program and denote it by . Program output is deterministic given the trace.
The probability of a trace is proportional to the product of the probability of all random choices and the likelihood of all observations
The objective of inference in probabilistic program is to discover the distribution of .
2.2 Adaptive Markov Chain Monte Carlo
MCMC methods generate a sequence of samples by simulating a Markov chain using a transition operator that leaves a target density invariant. In MH the transition operator is implemented by drawing a new sample from a parameterized proposal distribution that is conditioned on the current sample . The proposed sample is then accepted with probability
If is rejected, is re-used as the next sample.
The convergence rate of MH depends on parameters of the proposal distribution . The parameters can be set either offline or online. Variants of MCMC in which the parameters are continuously adjusted based on the features of the sample sequence are called adaptive. Challenges in design and analysis of Adaptive MCMC methods include optimization criteria and algorithms for the parameter adaptation, as well as conditions of convergence of Adaptive MCMC to the correct equilibrium distribution . Continuous adaptation of parameters of the proposal distribution is a well-known research subject [1, 2, 14].
In a componentwise MH algorithm  which targets a density defined on an -dimensional space , the components of a random sample are updated individually, in either random or systematic order. Assuming the component is selected at the step for modification, the proposal sampled from may differ from only in that component, and for all . Adaptive componentwise Metropolis-Hastings (Algorithm 1) chooses different probabilities for selecting a component for modification at each iteration. Parameters of this scheduling distribution may be viewed as a subset of parameters of the proposal distribution , and adjusted according to optimization criteria of the sampling algorithm.
Varying selection probabilities based on past samples violates the Markov property of . However, provided the adaptation of the selection probabilities diminishes, with , then under suitable regularity conditions for the target density (see Section 4) an adaptive componentwise MH algorithm will still be ergodic , and the distribution on induced by Algorithm 1 converges to .
2.3 Lightweight Metropolis-Hastings
LMH  is a sampling scheme for probabilistic programs where a single random variable drawn in the course of a particular execution of a probabilistic program is modified via a standard MH proposal, and this modification is accepted by comparing the values of the joint probability of old and new program traces. LMH differs from componentwise MH algorithms in that other random variables may also have to be modified, depending on the structural dependencies in the probabilistic program.
LMH initializes a proposal by selecting a single variable from an execution trace and resampling its value either using a reversible kernel or from the conditional prior distribution. Starting from this initialization, the program is rerun to generate a new trace . For each , the previous value is reused, provided it still lies in the support of the distribution on , rescoring its probability relative to the new random choices . When cannot be rescored, a new value is sampled from the prior on , conditioned on preceding choices. The acceptance probability is obtained by substituting (1) into (2):
We here further simplify LMH by assuming is sampled from the prior conditioned on earlier choices and that all variables are selected for modification with equal probability. In this case, takes the form 
where denotes the resampled variables, and denotes the variables which have the same values in both traces.
3 Adaptive Lightweight Metropolis-Hastings
We develop an adaptive variant of LMH, which dynamically adjusts the probabilities of selecting variables for modification (Algorithm 2). Let be the trace at iteration
of Adaptive LMH. We define the probability distribution of selecting variables for modification in terms of a weight vectorthat we adapt, such that the probability of selecting the variable for modification is
Just like LMH, Adaptive LMH runs the probabilistic program once and then selects variables for modification randomly. However, acceptance ratio must now include selection probabilities and of the resampled variable in the current and the proposed sample
This high-level description of the algorithm does not detail how is computed for each iteration. Indeed, this is the most essential part of the algorithm. There are two different aspects here — on one hand, the influence of a given choice on the output sequence must be quantified in terms of convergence of the sequence to the target distribution. On the other hand, the influence of the choice must be translated into re-computation of weights of random variables in the trace. Both parts of re-computation of are explained below.
3.1 Quantifying Influence
Extensive research literature is available on criteria for tuning parameters of Adaptive MCMC [1, 2, 14]. The case of inference in probabilistic programs is different: the user of a probabilistic program is interested in fast convergence of the program output rather than of the vector of the program’s random variables .
In adaptive MCMC variants the acceptance rate can be efficiently used as the optimization objective . However, for convergence of the output sequence an accepted trace that produces the same output is indistinguishable from a rejected trace. Additionally, while optimal values of the acceptance rate [1, 14] can be used to tune parameters in adaptive MCMC, in Adaptive LMH we do not change the parameters of proposal distributions of individual variables, and assume that they are fixed. However, proposing a new value for a random variable may or may not change the output even if the new trace is accepted. By changing variable selection probabilities we attempt to maximize the change in the output sequence so that it converges faster. In the pedagogical example
selecting the Bernoulli random choice for modification changes the output only when a different value is sampled, while selecting the normal random choice will change the output almost always.
Based on these considerations, we quantify the influence of sampling on the output sequence by measuring the change in the output of the probabilistic program. Since probabilistic programs may produce output of any type, we chose to discern between identical and different outputs only, rather than to quantify the distance by introducing a type-dependent norm. In addition, when , we quantify the difference by the fraction of components of with changed values.
Formally, let be the output sequence of a probabilistic program. Then the influence of a choice that produced is defined by the total reward , computed as normalized Hamming distance between the outputs
In the case of a scalar , the reward is 1 when outputs of subsequent samples are different, 0 otherwise.
The reward is used to adjust the variable selection probabilities for the subsequent steps of Adaptive LMH by computing (line 8 of Algorithm 2). It may seem sufficient to assign the reward to the last choice and use average choice rewards as their weights, but this approach will not work for Adaptive LMH. Consider the generative model
where we observe the value . Modifying may result in an accepted trace, but the value of , predicted by the program, will remain the same as in the previous trace. Only when is also modified, and a new trace with the updated values for both and is accepted, the earlier change in is indirectly reflected in the output of the program. In the next section, we discuss propagation of rewards to variable selection probabilities in detail.
3.2 Propagating Rewards to Variables
Both LMH and Adaptive LMH modify a single variable per trace, and either re-use or recompute the probabilities of values of all other variables (except those absent from the previous trace or having an incompatible distribution, for which new values are also sampled). Due to this updating scheme, the influence of modifying a variable on the output can be delayed by several iterations. We propose the following propagation scheme: for each random variable , the reward and count are kept in a data structure used to compute . The set of variables selected for modification, called here the history, is maintained for each component of output . When the value of changes, the reward is distributed between all of the variables in the history, and the history is emptied. When does not change, the selected variable is penalized by zero reward. This scheme, for the case of scalar output for simplicity, is shown in Algorithm 3 which expands line 8 of Algorithm 2. When has multiple components, histories for each component are maintained independently.
Rewarding all of the variables in the history ensures that while variables which cause changes in the output more often get a greater reward, variables with lower influence on the output are still selected for modification sufficiently often. This, in turn, ensures ergodicity of sampling sequence, and helps establish conditions for convergence to the target distribution, as we discuss in Section 4.
Let us show that under certain assumptions the proposed reward propagation scheme has a non-degenerate equilibrium for variable selection probabilities. Indeed, assume that for a program with two variables, , and , probability matching, or selecting a choice with the probability proportional to the unit reward , is used to compute the weights, that is, . Then, the following lemma holds:
Assume that for variables , where :
is the selection probability;
is the probability that the new trace is accepted given that the variable was selected for modification;
is the probability that the output changed given that the trace was accepted.
Assume further that , , and are constant. Then for the case , :
We shall proof the lemma in three steps. First, we shall analyse a sequence of samples between two subsequent arrivals of . Then, we shall derive a formula for the expected unit reward of . Finally, we shall bound the ratio .
Consider a sequence of samples, for some , between two subsequent arrivals of , including the sample corresponding to the second arrival of . Since a new value of always () and never () causes a change in the output, at the end of the sequence the history will contain occurrences of . Let us denote by , the increase of reward and count between the beginning and the end of the sequence. Noting that is penalized each time it is added to the history (line 9 of Algorithm 3), and occurrences of are rewarded when is added to the history (line 5 of Algorithm 3), we obtain
Consider now a sequence of such sequences. When , approaches the expected unit reward , where and are the reward and the count of at the end of the sequence.
Each variable is selected randomly and independently and produces an accepted trace with probability
Acceptances of form a Poisson process with rate .
is distributed according to the geometric distribution with probability, . Since for any , the expected unit reward of is . We shall substitute and into (10) to obtain the expected unit reward of :
For probability matching, selection probabilities are proportional to expected unit rewards:
By noting that any subset of variables in a probabilistic program can be considered a single random variable drawn from a multi-dimensional distribution, Lemma 1 is generalized to any number of variables by Corollary 1:
For any partitioning of the set of random variables of a probabilistic program, AdLMH with weights proportional to expected unit rewards selects variables from each of the partitions with non-zero probability.
To ensure convergence of to expected unit rewards in the stationary distribution, we use upper confidence bounds on unit rewards to compute the variable selection probabilities, an idea which we borrowed from the UCB family of algorithms for multi-armed bandits . Following UCB1 , we compute the upper confidence bound as the sum of the unit reward and the exploration term
where is an exploration factor. The default value for is in UCB1; in practice, a lower value of is preferable. Note that variable selection in Adaptive LMH is different from arm selection in multi-armed bandits: unlike in bandits, where we want to sample the best arm at an increasing rate, in Adaptive LMH we expect to converge to an equilibrium in which selection probabilities are proportional to expected unit rewards.
4 Convergence of Adaptive LMH
As adaptive MCMC algorithms may depend arbitrarily on the history at each step, showing that a given sampler correctly draws from the target distribution can be non-trivial. General conditions under which adaptive MCMC schemes are still ergodic, in the sense that the distribution of samples converges to the target in total variation, are established in . The fundamental criteria for validity of an adaptive algorithm are diminishing adaptation, which (informally) requires that the amount which the transition operator changes each iteration must asymptotically decrease to zero; and containment, a technical condition which requires that the time until convergence to the target distribution must be bounded in probability .
The class of models representable by probabilistic programs is very broad, allowing specification of completely arbitrary target densities; however, for many models the adaptive LMH algorithm reduces to an adaptive random scan Metropolis-within-Gibbs in Algorithm 1. To discuss when this is the case, we invoke the concept of structural versus structure-preserving random choices . Crucially, a structure-preserving random choice does not affect the existence of other in the trace.
Suppose we were to restrict the expressiveness of our language to admit only programs with no structural random choices: in such a language, the LMH algorithm in Algorithm 2 reduces to the adaptive componentwise MH algorithm. Conditions under which such an adaptive algorithm is ergodic have been established explicitly in [8, Theorems 4.10 and 5.5]. Given suitable assumptions on the target density defined by the program, it is necessary for the probability vector , and that for any particular component we have probability . Both of these are satisfied by our approach: from Corollary 1, we ensure that the unit reward across each converges to a positive fixed point.
While any theoretical result will require language restrictions such that programs only induce distributions satisfying regularity conditions, we conjecture that this scheme is broadly applicable across most non-pathological programs. We leave a precise theoretical analysis of the space of probabilistic programs in which adaptive MCMC schemes (with infinite adaptation) may be ergodic to future work. Empirical evaluation presented in the next section demonstrates practical convergence of Adaptive LMH on a range of inference examples, including programs containing structural random choices.
5 Empirical Evaluation
We evaluated Adaptive LMH on many probabilistic programs and observed consistent improvement of convergence rate compared to LMH. We also verified on a number of tests that the algorithm converges to the correct distribution obtained by independent exact methods. In this section, we compare Adaptive LMH to LMH on several representative examples of probabilistic programs. The rates in the comparisons are presented with respect to the number of samples, or simulations, of the probabilistic programs. The additional computation required for adaptation takes negligible time, and the computational effort per sample is approximately the same for all algorithms. Our implementation of the inference engine is available at https://bitbucket.org/dtolpin/embang.
In the following case studies differences between program outputs and target distributions are presented using KullBack-Leibler (KL) divergence, Kolmogorov-Smirnov (KS) distance, or L2 distance, as appropriate. In cases where target distributions cannot be updated exactly, they were approximated by running a non-adaptive inference algorithm for a long enough time and with a sufficient number of restarts. In each of the evaluations, all of the algorithms were run with 25 random restarts and 500 000 simulations of the probabilistic program per restart. The difference plots use the logarithmic scale for both axes. In the plots, the solid lines correspond to the median, and the dashed lines to 25% and 75% percentiles, taken over all runs of the corresponding inference algorithm. The exploration factor for computing upper confidence bounds on unit rewards (Equation 17) was fixed at for all tests and evaluations.
The first example is a latent state inference problem in an HMM with three states, one-dimensional normal observations (0.9, 0.8, 0.7, 0, -0.025, 5, 2, 0.1, 0, 0.13, 0.45, 6, 0.2, 0.3, -1, -1) with variance 1.0, a known transition matrix, and known initial state distribution. There are 18 distinct random choices in all traces of the program, and the 0th and the 17th state of the model are predicted. The results of evaluation are shown in Figure1 as KL divergences between the inference output and the ground truth obtained using the forward-backward algorithm. In addition, bar plots of unit reward and sample count distributions among random choices in Adaptive LMH are shown for , , and samples.
As can be seen in the plots, Adaptive LMH (black) exhibits faster convergence over the whole range of evaluation, requiring half as many samples as LMH (cyan) to achieve the same approximation, with the median of LMH above the 75% quantile of Adaptive LMH.
In addition, the bar plots show unit rewards and sample counts for different random choices, providing an insight on the adaptive behavior of AdLMH. On the left-hand bar plots, red bars are normalized unit rewards, and blue bars are normalized sample counts. On the right-hand bar plots, the total height of a bar is the total sample count, with green section corresponding to the accepted, and yellow to the rejected samples. At samples, the unit rewards have not yet converged, and exploration supersedes exploitation: random choices with lower acceptance rate are selected more often (choices 7, 8 and 13 corresponding to states 6, 7 and 12). At samples, the unit rewards become close to their final values, and choices 1 and 18, immediately affecting the predicted states, are selected more often. At samples, the unit rewards converge, and the sample counts correspond closely to the equilibrium state outlined in Lemma 1.
The process has five hyperparameters, . The program infers the posterior values of the hyperparameters by maximizing marginal likelihood of 6 observations , , , , , and . Parameters of the mean function are predicted. Maximum of KS distances between inferred distributions of each of the predicted parameters and an approximation of the target distributions is shown in Figure 2. The approximation was obtained by running LMH with samples per restart and 50 restarts, and then taking each 100th sample from the last samples of each restart, 5000 samples total. Just as for the previous case study, bar plots of unit rewards and sample counts are shown for , , and samples.
Here as well, Adaptive LMH (black) converges faster over the whole range of evaluation, outperforming LMH by a factor 2 over the first samples. Bar plots of unit rewards and sample counts for different number of choices, again, show the dynamics of sample allocation among random choices. Choices , , and are predicted, while choices and are required for inference but not predicted. Choice has the lowest acceptance rate (ratio between the total height of the bar and the green part on the right-hand bar plot), but the unit reward is close the unit reward of choices and . At samples, choice is selected with the highest probability. However, close to the converged state, at samples, choices , , and are selected with similar probabilities. At the same time, choices 4 and 5 are selected with a lower probability. Both the exploration-exploitation dynamics for choices – and probability matching of selection probabilities among all choices secure improved convergence.
The third case study involves a larger amount of data observed during each simulation of a probabilistic program. We use the well-known Iris dataset 
to fit a model of classifying a given flower as of the species Iris setosa, as opposite to either Iris virginica or Iris versicolor. Each record in the dataset corresponds to an observation. For each observation, we define a feature vectorand an indicator variable , which is 1 if and only if the observation is of an Iris setosa. We fit the model with five regression coefficients , defined as
To assess the convergence, we perform shuffle split leave-2-out cross validation on the dataset, selecting one instance belonging to the species Iris setosa and one belonging to a different species for each run of the inference algorithm. The classification error is shown in Figure 3 over 100 runs of LMH and Adaptive LMH.
The results are consistent with other case studies: Adaptive LMH exhibits a faster convergence rate, requiring half as many samples to achieve the same classification accuracy as LMH.
As a final case study we consider a linear dynamical system (i.e. a Kalman smoothing problem) that was previously described in 
In this problem we assume that 16-dimensional observations are conditioned on 2-dimensional latent states . We impose additional structure by assuming that the transition matrix is a simple rotation with angular velocity , whereas the transition covariance is a diagonal matrix with coefficient ,
We predict posterior values for , and in a setting where and are assumed known, under mildly informative priors and . Posterior inference is performed conditioned on a simulated sequence of observations, with , and . The observation matrix and covariance are sampled row-wise from symmetric Dirichlet distributions with parameters , and respectively.
Figure 4 shows a qualitative evaluation the mixing rate in the form of 500 consecutive samples from an LMH and AdLMH chain after 10 000 samples of burn-in. The LMH sequence exhibits good mixing over but is strongly correlated in , whereas the AdLMH sequence obtains a much better coverage of the space.
To summarize, Adaptive LMH consistently attained faster convergence than LMH, measured by differences between the ongoing output distribution of the random program and the target independently obtained distribution, assessed using various metrics. Variable selection probabilities computed by Adaptive LMH are dynamically adapted during the inference, combining exploration of the model represented by the probabilistic program and exploitation of influence of random variables on program output.
6 Contribution and Future Work
In this paper we introduced a new algorithm, Adaptive LMH, for approximate inference in probabilistic programs. This algorithm adjusts sampling parameters based on the output of the probabilistic program in which the inference is performed. Contributions of the paper include
A scheme of rewarding random choice based on program output.
An approach to propagation of choice rewards to MH proposal scheduling parameters.
An application of this approach to LMH, where the probabilities of selecting each variable for modification are adjusted.
Adaptive LMH was compared to LMH, its non-adaptive counterpart, and was found to consistently outperform LMH on several probabilistic programs, while still being almost as easy to implement. The time cost of additional computation due to adaptation was negligible.
Although presented in the context of a particular sampling algorithm, the adaptation approach can be extended to other sampling methods. We believe that various sampling algorithms for probabilistic programming can benefit from output-sensitive adaptation. Additional potential for improvement lies in acquisition of dependencies between predicted expressions and random variables. Exploring alternative approaches for guiding exploration-exploitation compromise, in particular, based on Bayesian inference, is another promising research direction.
Overall, output-sensitive approximate inference appears to bring clear advantages and should be further explored in the context of probabilistic programming models and algorithms.
This work is supported under DARPA PPAML through the U.S. AFRL under Cooperative Agreement number FA8750-14-2-0004.
-  Andrieu, C., Thoms, J.: A tutorial on adaptive MCMC. Statistics and Computing 18(4), 343–373 (2008)
-  Atchadé, Y., Fort, G., Moulines, E., Priouret, P.: Adaptive Markov chain Monte Carlo: theory and methods. In: Barber, D., Cemgil, A.T., Chiappa, S. (eds.) Bayesian Time Series Models, pp. 32–51. Cambridge University Press (2011), cambridge Books Online
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the Multiarmed Bandit problem. Machine Learning 47(2–3), 235–256 (2002)
-  Bai, Y., Roberts, G.O., Rosenthal, J.S.: On the containment condition for adaptive Markov chain Monte Carlo algorithms. Advances and Applications in Statistics 21(1), 1–54 (2011)
-  Gamerman, D., Lopes, H.F.: Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference. Chapman and Hall/CRC (2006)
Goodman, N.D., Mansinghka, V.K., Roy, D.M., Bonawitz, K., Tenenbaum, J.B.: Church: a language for generative models. In: Proc. of Uncertainty in Artificial Intelligence (2008)
-  Gordon, A.D., Henzinger, T.A., Nori, A.V., Rajamani, S.K.: Probabilistic programming. In: International Conference on Software Engineering (ICSE, FOSE track) (2014)
-  Łatuszyński, K., Roberts, G.O., Rosenthal, J.S.: Adaptive Gibbs samplers and related MCMC methods. Annals of Applied Probability 23(1), 66–98 (2013)
-  Lauritzen, S.: Graphical Models. Clarendon Press (1996)
-  Levine, R.A., Yu, Z., Hanley, W.G., Nitao, J.J.: Implementing componentwise hastings algorithms. Computational Stastistics & Data Analysis 48(2), 363–389 (2005)
-  Mansinghka, V.K., Selsam, D., Perov, Y.N.: Venture: a higher-order probabilistic programming platform with programmable inference. CoRR abs/1404.0099 (2014)
-  van de Meent, J.W., Yang, H., Mansinghka, V., Wood, F.: Particle Gibbs with Ancestor Sampling for Probabilistic Programs. In: Artificial Intelligence and Statistics (2015), http://arxiv.org/abs/1501.06769
-  Roberts, G.O., Rosenthal, J.S.: Coupling and ergodicity of adaptive MCMC. Journal of Applied Probability 44, 458–475 (2007)
-  Roberts, G.O., Rosenthal, J.S.: Examples of adaptive MCMC. Journal of Computational and Graphical Statistics 18(2), 349–367 (2009)
-  Wingate, D., Stuhlmüller, A., Goodman, N.D.: Lightweight implementations of probabilistic programming languages via transformational compilation. In: Proc. of the 14th Artificial Intelligence and Statistics (2011)
-  Wood, F., van de Meent, J.W., Mansinghka, V.: A new approach to probabilistic programming inference. In: Artificial Intelligence and Statistics (2014)
-  Yang, L., Hanrahan, P., Goodman, N.D.: Generating efficient MCMC kernels from probabilistic programs. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics. pp. 1068–1076 (2014)