 # Markov Chain Monte Carlo and Variational Inference: Bridging the Gap

Recent advances in stochastic gradient variational inference have made it possible to perform variational Bayesian inference with posterior approximations containing auxiliary random variables. This enables us to explore a new synthesis of variational inference and Monte Carlo methods where we incorporate one or more steps of MCMC into our variational approximation. By doing so we obtain a rich class of inference algorithms bridging the gap between variational methods and MCMC, and offering the best of both worlds: fast posterior approximation through the maximization of an explicit objective, with the option of trading off additional computation for additional accuracy. We describe the theoretical foundations that make this possible and show some promising first results.

## Code Repositories

### foundations_for_deep_learning

Building a scalable foundation for deep learning

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 MCMC and Variational Inference

Bayesian analysis gives us a very simple recipe for learning from data: given a set of unknown parameters or latent variables that are of interest, we specify a prior distribution quantifying what we know about before observing any data. Then we quantify how the observed data relates to by specifying a likelihood function . Finally, we apply Bayes’ rule to give the posterior distribution, which quantifies what we know about after seeing the data.

Although this recipe is very simple conceptually, the implied computation is often intractable. We therefore need to resort to approximation methods in order to perform Bayesian inference in practice. The two most popular approximation methods for this purpose are variational inference and Markov Chain Monte Carlo (MCMC). The former has the advantage of maximizing an explicit objective, and being faster in most cases. The latter has the advantage of being nonparametric and asymptotically exact. Here, we show how both methods can be combined in order to get the best of both worlds.

### 1.1 Variational Inference

Variational inference casts Bayesian inference as an optimization problem where we introduce a parameterized posterior approximation which is fit to the posterior distribution by choosing its parameters to maximize a lower bound on the marginal likelihood:

 logp(x) ≥logp(x)−DKL(qθ(z|x)||p(z|x)) (1) =Eqθ(z|x)[logp(x,z)−logqθ(z|x)]=L. (2)

Since is independent of , maximizing the bound w.r.t. will minimize the KL-divergence . The bound above is tight at , when the approximation perfectly matches .

### 1.2 MCMC and Auxiliary Variables

A popular alternative to variational inference is the method of Markov Chain Monte Carlo (MCMC). Like variational inference, MCMC starts by taking a random draw from some initial distribution or . Rather than optimizing this distribution, however, MCMC methods subsequently apply a stochastic transition operator to the random draw :

 zt∼q(zt|zt−1,x).

By judiciously choosing the transition operator and iteratively applying it many times, the outcome of this procedure, , will be a random variable that converges in distribution to the exact posterior . The advantage of MCMC is that the samples it gives us can approximate the exact posterior arbitrarily well if we are willing to apply the stochastic transition operator a sufficient number of times. The downside of MCMC is that in practice we do not know how many times is sufficient, and getting a good approximation using MCMC can take a very long time.

The central idea of this paper is that we can interpret the stochastic Markov chain as a variational approximation in an expanded space by considering to be a set of auxiliary random variables. Integrating these auxiliary random variables into the variational lower bound (2), we obtain

 Laux (3) =Eq(y,zT|x)[log[p(x,zT)r(y|x,zT)]−logq(y,zT|x)] =L−Eq(zT|x){DKL[q(y|zT,x)||r(y|zT,x)]} ≤L≤log[p(x)],

where is an auxiliary inference distribution which we are free to choose, and our marginal posterior approximation is given by . The marginal approximation is now a mixture of distributions of the form . Since this is a very rich class of distributions, auxiliary variables may be used to obtain a closer fit to the exact posterior (Salimans & Knowles, 2013). The choice would be optimal, but again often intractable to compute; in practice, good results can be obtained by specifying a that can approximate to a reasonable degree. One way this can be achieved is by specifying to be of some flexible parametric form, and optimizing the lower bound over the parameters of this distribution. In this paper we consider the special case where the auxiliary inference distribution also has a Markov structure just like the posterior approximation: , in which case the variational lower bound can be rewritten as

 logp(x) ≥Eq[logp(x,zT)−logq(z0,…,zT|x) (4) +logr(z0,…,zt−1|x,zT)] =Eq[log[p(x,zT)/q(z0|x)] +T∑t=1log[rt(zt−1|x,zt)/qt(zt|x,zt−1)]].

where the subscript in and highlights the possibility of using different transition operators and inverse models at different points in the Markov chain. By specifying these and in some flexible parametric form, we can then optimize the value of (4) in order to get a good approximation to the true posterior distribution.

## 2 Optimizing the lower bound

For most choices of the transition operators and inverse models , the auxiliary variational lower bound (4) cannot be calculated analytically. However, if we can at least sample from the transitions , and evaluate the inverse models at those samples, we can still approximate the variational lower bound without bias using the following algorithm:

The key insight behind the recent work in stochastic gradient variational inference is that if all the individual steps of an algorithm like this are differentiable in the parameters of and , which we denote by , then so is the algorithm’s output . Since

is an unbiased estimate of the variational lower bound, its derivative is then an unbiased estimate of the derivative of the lower bound, which can be used in a stochastic optimization algorithm.

Obtaining gradients of the Monte Carlo estimate of Algorithm 1

requires the application of the chain rule through the random sampling of the transition operators

. This can in many cases be realised by drawing from these operators in two steps: In the first step we draw a set of primitive random variables from a fixed distribution , and we then transform those as with a transformation chosen in such a way that follows the distribution

. If this is the case we can apply backpropagation, differentiating through the sampling function to obtain unbiased stochastic estimates of the gradient of the lower bound objective with respect to

(Salimans & Knowles, 2013; Kingma & Welling, 2014; Rezende et al., 2014). An alternative solution, which we do not consider here, would be to approximate the gradient of the lower bound using Monte Carlo directly (Paisley et al., 2012; Ranganath et al., 2014; Mnih & Gregor, 2014).

Once we have obtained a stochastic estimate of the gradient of (2) with respect to , we can use this estimate in a stochastic gradient-based optimization algorithm for fitting our approximation to the true posterior . We do this using the following algorithm:

### 2.1 Example: bivariate Gaussian

As a first example we look at sampling from the bivariate Gaussian distribution defined by

 p(z1,z2)∝exp[−12σ21(z1−z2)2−12σ22(z1+z2)2].

We consider two MCMC methods that update the univariate in turn. The first method is Gibbs sampling, which samples from the Gaussian full conditional distributions . The second method is the over-relaxation method of (Adler, 1981), which instead updates the univariate using . For the two methods are equivalent, but for other values of the over-relaxation method may mix more quickly than Gibbs sampling. To test this we calculate the variational lower bound for this MCMC algorithm, and maximize with respect to to find the most effective transition operator.

For the inverse model we use Gaussians with mean parameter linear in

and variance independent of

. For this particular case this specification allows us to recover the distribution exactly. We use in our exact posterior, and we initialize the Markov chain at , with addition of infinitesimal noise (variance of ). Figure 1 shows the lower bound for both MCMC methods: over-relaxation with an optimal of clearly recovers the exact posterior much more quickly than plain Gibbs sampling. The fact that optimization of the variational lower bound allows us to improve upon standard methods like Gibbs sampling is promising for more challenging applications. Figure 1: The log marginal likelihood lower bound for a bivariate Gaussian target and an MCMC variational approximaton, using Gibbs sampling or Adler’s overrelaxation.

## 3 Hamiltonian variational inference

One of the most efficient and widely applicable MCMC methods is Hamiltonian Monte Carlo (HMC) (Neal, 2011). HMC is an MCMC method for approximating continuous distributions where the space of unknown variables is expanded to include a set of auxiliary variables with the same dimension as . These auxiliary variables are initialized with a random draw from a distribution , after which the method simulates the dynamics corresponding to the Hamiltonian , where and are iteratively updated using the leapfrog integrator, see (Neal, 2011).

Hamiltonian dynamics of this form is a very effective way of exploring the posterior distribution because the dynamics is guided by the gradient of the exact log posterior, and random walks are suppressed by the auxiliary variables , which are also called momentum variables. Furthermore, the transition from to in HMC is deterministic, invertible and volume preserving, which means that we have

 q(vt,zt|zt−1,x)=q(vt,zt,zt−1|x)/q(zt−1|x) =q(v′t,zt−1|x)/q(zt−1|x)=q(v′t|zt−1,x)

and similarly , with the output of the Hamiltonian dynamics.

Using this choice of transition operator and inverse model we obtain the following algorithm for stochastically approximating the log marginal likelihood lower bound:

Here we omit the Metropolis-Hastings step that is typically used with Hamiltonian Monte Carlo. Section 4.1 discusses how such as step could be integrated into Algorithm 3.

We fit the variational approximation to the true posterior distribution by stochastically maximizing the lower bound with respect to , and the parameters (stepsize and mass matrix) of the Hamiltonian dynamics using Algorithm 2. We call this version of the algorithm Hamiltonian Variational Inference (HVI). After running the algorithm to convergence, we then have an optimized approximation of the posterior distribution. Because our approximation automatically adapts to the local shape of the exact posterior, this approximation will often be better than a variational approximation with a fixed functional form, provided our model for is flexible enough.

In addition to improving the quality of our approximation, we find that adding HMC steps to a variational approximation often reduces the variance in our stochastic gradient estimates, thereby speeding up the optimization. The downside of using this algorithm is that its computational cost per iteration is higher than when using an approximate of a fixed form, mainly owing to the need of calculating additional derivatives of

. These derivatives may also be difficult to derive by hand, so it is advisable to use an automatic differentiation package such as Theano

(Bastien et al., 2012). As a rule of thumb, using the Hamiltonian variational approximation with MCMC steps and leapfrog steps is about times as expensive per iteration as when using a fixed form approximation. This may be offset by reducing the number of iterations, and in practice we find that adding a single MCMC step to a fixed-form approximation often speeds up the convergence of the lower bound optimization in wallclock time. The scaling of the computational demands in the dimensionality of is the same for both Hamiltonian variational approximation and fixed form variational approximation, and depends on the structure of .

Compared to regular Hamiltonian Monte Carlo, Algorithm 3 has a number of advantages: The samples drawn from are independent, the parameters of the Hamiltonian dynamics are automatically tuned, and we may choose to omit the Metropolis-Hastings step so as not to reject any of the proposed transitions. Furthermore, we optimize a lower bound on the log marginal likelihood, and we can assess the approximation quality using the techniques discussed in (Salimans & Knowles, 2013). By finding a good initial distribution , we may also speed up convergence to the true posterior and get a good posterior approximation using only a very short Markov chain, rather than relying on asymptotic theory.

### 3.1 Example: A beta-binomial model for overdispersion

To demonstrate our Hamiltonian variational approximation algorithm we use an example from (Albert, 2009), which considers the problem of estimating the rates of death from stomach cancer for the largest cities in Missouri. The data is available from the R package LearnBayes. It consists of 20 pairs where contains the number of individuals that were at risk for cancer in city , and is the number of cancer deaths that occurred in that city. The counts

are overdispersed compared to what one could expect under a binomial model with constant probability, so

(Albert, 2009)

assumes a beta-binomial model with a two dimensional parameter vector

. The low dimensionality of this problem allows us to easily visualize the results.

We use a variational approximation containing a single HMC step so that we can easily integrate out the 2 momentum variables numerically for calculating the exact KL-divergence of our approximation and to visualize our results. We choose to all be multivariate Gaussian distributions with diagonal covariance matrix. The mass matrix is also diagonal. The means of and are defined as linear functions in and , with adjustable coefficients. The covariance matrices are not made to depend on , and the approximation is run using different numbers of leapfrog steps in the Hamiltonian dynamics.

As can be seen from Figures 2 and 3, the Hamiltonian dynamics indeed helps us improve the posterior approximation. Most of the benefit is realized in the first two leapfrog iterations. Of course, more iterations may still prove useful for different problems and different specifications of , and additional MCMC steps may also help. Adjusting only the means of and based on the gradient of the log posterior is a simple specification that achieves good results. We find that even simpler parameterizations still do quite well, by finding a solution where the variance of is larger than that of , and the variance of is smaller than that of : The Hamiltonian dynamics then effectively transfers entropy from to , resulting in an improved lower bound. Figure 2: Approximate posteriors for a varying number of leapfrog steps. Exact posterior at bottom right. Figure 3: R-squared accuracy measure (Salimans & Knowles, 2013) for approximate posteriors using a varying number of leapfrog steps.

### 3.2 Example: Generative model for handwritten digits

Next, we demonstrate the effectiveness of our Hamiltonian variational inference approach for learning deep generative neural network models. These models are fitted to a binarized version of the MNIST dataset as e.g. used in

(Uria et al., 2014). This dataset consists of 70000 data vectors , each of which represents a black-and-white image of a handwritten digit. The task of modelling the distribution of these handwritten digit images is often used as a comparative benchmark for probability density and mass modeling approaches.

Our generative model consists of a spherical Gaussian prior , and conditional likelihood (or decoder) parameterized with either a fully connected neural network as in (Kingma & Welling, 2014; Rezende et al., 2014), or a convolutional network as in (Dosovitskiy et al., 2014). The network takes as input the latent variables

, and outputs the parameters of a conditionally independent (Bernoulli) distribution over the pixels.

Since we now have a dataset consisting of multiple datapoints , with separate latent variables per datapoint, it is efficient to let the distribution be an explicit function of the data , since in that case there is often no necessity for ’local’ variational parameters per individual datapoint ; instead, maps from global parameters and local observed value to a distribution over the local latent variable(s) . We can then optimize over for all observations jointly. The joint lower bound to be optimized is given by

 n∑i=1logp(xi)≥n∑i=1Eqθ(zi|xi)[logp(zi,xi)−logqθ(zi|xi)],

of which an unbiased estimator (and its gradients) can be constructed by sampling minibatches of data from the empirical distribution and sampling from ).

One flexible way of parameterizing the posterior approximation is by using an inference network as in Helmholtz machines (Hinton & Zemel, 1994) or the related variational auto-encoders (VAE) (Kingma & Welling, 2014; Rezende et al., 2014).

We can augment or replace such inference networks with the MCMC variational approximations developed here, as the parameters of the Markov chain can also be shared over all data vectors .

Specifically, we replace or augment inference networks as used in (Kingma & Welling, 2014; Rezende et al., 2014) with a Hamiltonian posterior approximation as described in Algorithm 3, with and a varying number of leapfrog steps. The auxiliary inference model is chosen to be a fully-connected neural network with one deterministic hidden layer with hidden units with softplus () activations and a Gaussian output variable with diagonal covariance. We tested two variants of the distribution . In one case, we let this distribution be a Gaussian with a mean and diagonal covariance structure that are learned, but independent of the datapoint . In the second case, we let be an inference network like , with two layers of hidden units, softplus activations and Gaussian output with diagonal covariance structure.

In a third experiment, we replaced the fully-connected networks with convolutional networks in both the inference model and the generative model. The inference model consists of three convolutional layers with 5

5 filters, [16,32,32] feature maps, stride of 2 and softplus activations. The convolutional layers are followed by a single fully-connected layer with

units and softplus activations. The architecture of the generative model mirrors the inference model but with stride replaced by upsampling, similar to (Dosovitskiy et al., 2014). The number of leapfrog steps was varied from 0 to 16. After broader model search with a validation set, we trained a final model with 16 leapfrog steps and .

with default hyper-parameters. Before fitting our models to the full training set, the model hyper-parameters and number of training epochs were determined based on performance on a validaton set of about 15% of the available training data. The marginal likelihood of the test set was estimated with importance sampling by taking a Monte Carlo estimate of the expectation

(Rezende et al., 2014) with over a thousand importance samples per test-set datapoint.

See table 1 for our numerical results and a comparison to reported results with other methods. Without an inference network and with 10 leapfrog steps we were able to achieve a mean test-set lower bound of , and an estimated mean marginal likelihood of . When no Hamiltonian dynamics was included the gap is more than 5 nats; the smaller difference of   2 nats when 10 leapfrog steps were performed illustrates the bias-reduction effect of the MCMC chain. Our best result is nats with convolutional networks for inference and generation, and HVI with 16 leapfrog steps. This is slightly worse than the best reported number with DRAW (Gregor et al., 2015)

, a VAE with recurrent neural networks for both inference and generation. Our approaches are not mutually exclusive, and could indeed be combined for even better results.

## 4 Specification of the Markov chain

In addition to the core contributions presented above, we now present a more detailed analysis of some possible specifications of the Markov chain used in the variational approximation. We discuss the impact of different specification choices on the theoretical and practical performance of the algorithm.

### 4.1 Detailed balance

For practical MCMC inference we almost always use a transition operator that satisfies detailed balance, i.e. a transition operator for which we have

 p(x,zt)←qt(zt−1|x,zt)p(x,zt−1)qt(zt|x,zt−1)=1,

where denotes with its arguments reversed (not : the conditional pdf of given under ). If our transition operator satisfies detailed balance, we can divide in Algorithm 1 by the ratio above (i.e. 1) to give

 log[αt]=logrt(zt−1|x,zt)−log←qt(zt−1|x,zt).

By optimally choosing in this expression, we can make the expectation non-negative: what is required is that is a predictor of the reverse dynamics that is equal or better than . If the iterate has converged to the posterior distribution by running the Markov chain for a sufficient number of steps, then it follows from detailed balance that . In that case choosing is optimal, and the lower bound is unaffected by the transition. If, on the other hand, the chain has not fully mixed yet, then : the last iterate will then have a predictable dependence on the initial conditions which allows us to choose in such a way that is positive and improves our lower bound. Hence a stochastic transition respecting detailed balance always improves our variational posterior approximation unless it is already perfect! In practice, we can only use this to improve our auxiliary lower bound if we also have an adequately powerful model that can be made sufficiently close to .

A practical transition operator that satisfies detailed balance is Gibbs sampling, which can be trivially integrated into our framework as shown in Section 2.1. Another popular way of ensuring our transitions satisfy detailed balance is by correcting them using Metropolis-Hastings rejection. In the latter case, the stochastic transition operator is constructed in two steps: First a stochastic proposal is generated from a distribution . Next, the acceptance probability is calculated as

 ρ(zt−1,z′t)=min[p(x,z′t)ϕ(zt−1|z′t)p(x,zt−1)ϕ(z′t|zt−1),1].

Finally, is set to with probability , and to with probability . The density of the resulting stochastic transition operator cannot be calculated analytically since it involves an intractable integral over . To incorporate a Metropolis-Hastings step into our variational objective we will thus need to explicitly represent the acceptance decision as an additional auxiliary binary random variable . The Metropolis-Hastings step can then be interpreted as taking a reversible variable transformation with unit Jacobian:

 zt−1 →I[a=1]z′t+I[a=0]zt−1 z′t →I[a=1]zt−1+I[a=0]z′t a →a.

Evaluating our target density at the transformed variables, we get the following addition to the lower bound:

 log[αt] =log[p(x,zt)/p(x,zt−1)]+log[rt(a|x,zt)] +I[a=1]log[rt(zt−1|x,zt)] +I[a=0]log[rt(z′t|x,zt)] −log[qt(z′t|x,zt−1)q(a|z′t,zt−1,x)].

Assuming we are working with a continuous variable

, the addition of the binary variable

has the unfortunate effect that our Monte Carlo estimator of the lower bound is no longer a continuously differentiable function of the variational parameters , which means we cannot use the gradient of the exact log posterior to form our gradient estimates. Estimators that do not use this gradient are available (Salimans & Knowles, 2013; Paisley et al., 2012; Ranganath et al., 2014; Mnih & Gregor, 2014) but these typically have much higher variance. We can regain continuous differentiability with respect to by Rao-Blackwellizing our Monte Carlo lower bound approximation and calculating the expectation with respect to analytically. For short Markov chains this is indeed an attractive solution. For longer chains this strategy becomes computationally demanding as we need to do this for every step in the chain, thereby exploring all different paths created by the accept/reject decisions. Another good alternative is to simply omit the Metropolis-Hastings acceptance step from our transition operators and to rely on a flexible specification for and to sufficiently reduce any resulting bias.

### 4.2 Annealed variational inference

Annealed importance sampling is an MCMC strategy where the Markov chain consists of stochastic transitions that each satisfy detailed balance with respect to an unnormalized target distribution , for gradually increasing from 0 to 1. The reverse model for annealed importance sampling is then constructed using transitions , which are guaranteed to be normalized densities because of detailed balance. For this choice of posterior approximation and reverse model, the marginal likelihood lower bound is then given by

 logp(x)≥EqT∑t=1(βt−βt−1)log[p(x,zt)/q0(zt)].

With this looks like the bound we have at , but notice that the expectation is now taken with respect to a different distribution than . Since this new approximation is strictly closer to than the old approximation, its expectation of the log-ratio is strictly higher, and the lower bound will thus be improved.

The main advantage of annealed variational inference over other variational MCMC strategies is that it does not require explicit specification of the reverse model , and that the addition of the Markov transitions to our base approximation is guaranteed to improve the variational lower bound. A downside of using this scheme for variational inference is the requirement that the transitions satisfy detailed balance, which can be impractical for optimizing .

### 4.3 Using multiple iterates

So far we have defined our variational approximation as the marginal of the last iterate in the Markov chain, i.e. . This is wasteful if our Markov chain consists of many steps, and practical MCMC algorithms therefore always use multiple samples from the Markov chain, with the number of samples. When using multiple samples obtained at different points in the Markov chain, our variational approximation effectively becomes a discrete mixture over the marginals of the iterates that are used:

 q(z|x) =1KT∑t=T+1−Kq(zt|x) =T∑t=T+1−KI(w=t)q(zt|x), with w∼Categorical(T+1−K,…,T).

To use this mixture distribution to form our lower bound, we need to explicitly take into account the mixture indicator variable . This variable has a categorical distribution that puts equal probability on each of the last iterates of the Markov chain, the log of which is subtracted from our variational lower bound (3). This term is then offset by adding the corresponding log probability of that iterate under the inverse model . The simplest specification for the inverse model is to set it equal to : In that case both terms cancel, and we’re effectively just taking the average of the last lower bounds computed by Algorithm 1. Although suboptimal, we find this to be an effective method of reducing variance when working with longer Markov chains. An alternative, potentially more optimal approach would be to also specify the inverse model for using a flexible parametric function such as a neural network, taking and the sampled as inputs.

### 4.4 Sequential MCVI

In Algorithm 2 we suggest optimizing the bound over all MCMC steps jointly, which is expected to give the best results for a fixed number of MCMC steps. Another approach is to optimize the MCMC steps sequentially, by maximizing the local bound contributions . Using this approach, we can take any existing variational approximation and improve it by adding one or more MCMC steps. Improving an existing approximation in this way gives us an easier optimization problem, and can be compared to how boosting algorithms are used to iteratively fit regression models.

## 5 Conclusion

By using auxiliary variables in combination with stochastic gradient variational inference we can construct posterior approximations that are much better than can be obtained using only simpler exponential family forms. One way of improving variational inference is by integrating one or more MCMC steps into the approximation. By doing so we can bridge the accuracy/speed gap between MCMC and variational inference and get the best of both worlds.