 # Variational Hamiltonian Monte Carlo via Score Matching

Traditionally, the field of computational Bayesian statistics has been divided into two main subfields: variational methods and Markov chain Monte Carlo (MCMC). In recent years, however, several methods have been proposed based on combining variational Bayesian inference and MCMC simulation in order to improve their overall accuracy and computational efficiency. This marriage of fast evaluation and flexible approximation provides a promising means of designing scalable Bayesian inference methods. In this paper, we explore the possibility of incorporating variational approximation into a state-of-the-art MCMC method, Hamiltonian Monte Carlo (HMC), to reduce the required gradient computation in the simulation of Hamiltonian flow, which is the bottleneck for many applications of HMC in big data problems. To this end, we use a free-form approximation induced by a fast and flexible surrogate function based on single-hidden layer feedforward neural networks. The surrogate provides sufficiently accurate approximation while allowing for fast exploration of parameter space, resulting in an efficient approximate inference algorithm. We demonstrate the advantages of our method on both synthetic and real data problems.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bayesian inference has been successful in modern data analysis. Given a probabilistic model for the underlying mechanism of the observed data, Bayesian methods properly quantify uncertainty and reveal the landscape or global structure of parameter space. While conceptually simple, exact posterior inference in many Bayesian models is often intractable. Therefore, in practice, people often resort to approximation methods among which Markov chain Monte Carlo (MCMC) and variational Bayes (VB) are the two most popular choices.

The MCMC approach is based on drawing a series of correlated samples by constructing a Markov chain with guaranteed convergence to the target distribution. Therefore, MCMC methods are asymptotically unbiased. Simple methods such as random-walk Metropolis (Metropolis et al., 1953), however, often suffer from slow mixing (due to their random walk nature) when encountering complicated models with strong dependencies among parameters. Introducing an auxiliary momentum variable, Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2011)

reduces the random walk behavior by proposing states following a Hamiltonian flow which preserves the target distribution. By incorporating the geometric information of the target distribution, e.g., the gradient, HMC is able to generate distant proposals with high acceptance probabilities, enabling more efficient exploration of the parameter space than standard random-walk proposals.

A major bottleneck of HMC, however, is the computation of the gradient of the potential energy function in order to simulate the Hamiltonian flow. As the datasets involved in many practical tasks, such as “big data” problems, usually have millions to billions of observations, such gradient computations are infeasible since they need full scans of the entire dataset. In recent years, many attempts have been made to develop scalable MCMC algorithms that can cope with very large data sets (Welling and Teh, 2011; Ahn et al., 2012; Chen et al., 2014; Ding et al., 2014)

. The key idea of these methods stems from stochastic optimization where noisy estimates of the gradient based on small subsets of the data are utilized to scale up the algorithms. The noise introduced by subsampling, however, could lead to non-ignorable loss of accuracy, which in turn hinders the exploration efficiency of standard MCMC approaches

(Betancourt, 2015).

The main alternative to MCMC is variational Bayes inference (Jordan et al., 1999; Wainwright and Jordan, 2008). As a deterministic approach, VB transforms Bayesian inference into an optimization problem where a parametrized distribution is introduced to fit the target posterior distribution by minimizing the Kullback-Leibler (KL) divergence with respect to the variational parameters. Compared to MCMC methods, VB introduces bias but is usually faster.

A natural question would be: can we combine both methods to mitigate the drawbacks and get the best of both worlds? The first attempt in this direction was proposed by de Freitas et al. (2001) where a variational approximation was used as proposal distribution in a block Metropolis-Hasting (MH) MCMC kernel to locate the high probability regions quickly, thus facilitating convergence. Recently, a new synthesis of variational inference and Markov chain Monte Carlo methods has been explored in Salimans et al. (2015) where one or more steps of MCMC are integrated into variational approximation. The extra flexibility from MCMC steps provides a rich class of distributions to find a closer fit to the exact posterior, which allows for further improvement on the approximation quality.

In this work, we explore the possibility of utilizing variational approximation to speed up HMC for problems with large scale datasets. The key idea is to integrate fast variational approximations into the sampling procedure so that the overall computational complexity can be reduced. The main contributions are summarized as follows:

1. We train a fast neural network surrogate using extreme learning machine (ELM) to approximate the log-posterior (potential energy function). Neural networks with randomly-mapped features (with almost any nonlinear activation functions) has been proved to approximate a rich class of functions arbitrarily well in

Huang et al. (2006a). After the fast training of a reasonably accurate neural network surrogate, we use its gradient to simulate the surrogate induced Hamiltonian flow in order to generate proposals. The learning and training of the surrogate function provides an effective and flexible way to explore both regularity of the parameter space and redundancy in the data collectively, which can be viewed as an implicit subsampling. However, unlike the popular subsampling-based methods where the leapforg stepsize needs to be annealed or remains small enough to compensate for the resulting bias, the leapfrog stepsize in our approach can remain close to that of standard HMC while keeping a comparable acceptance probability. As such, the exploration efficiency of HMC can be maintained while reducing the computational cost.

2. A new training procedure is proposed for an efficient and general surrogate method. The neural network surrogate is trained by minimizing the squared distance between the gradient of the surrogate and the gradient of the target (log-posterior), a procedure that resembles score matching (Hyvärinen, 2005). The training data are collected while the “modified” HMC sampler (based on the surrogate induced Hamiltonian flow) explores the parameter space. In Hyvärinen (2005), the training data are assumed to be sampled from the target distribution. In our method, by including the gradient of the target distribution in the learning process, the information from the target distribution is explicitly included in the surrogate in our method. Therefore, the restriction on the training data is relaxed.

3. To further reduce the computation cost, we also use the computationally fast surrogate in the Metropolis-Hastings correction step. As a result, the modified HMC sampler will no longer converge to the exact target distribution. Instead, it will converge to the best approximation from an exponential family with pre-specified random sufficient statistics (which we call free-form variational approximation in the sequel). This variational perspective distinguishes our approach from the existing surrogate methods on accelerating HMC. Moreover, compared to traditional fixed-form variational approximations, the free-form variational approximation is usually more flexible and thus can provide a better fit to more general target distributions.

Our paper is organized as follows. In section 2, we introduce the two ingredients related to our method: Hamiltonian Monte Carlo and fixed-form variational Bayes. Section 3 presents our method, termed Variational Hamiltonian Monte Carlo (VHMC). We demonstrate the efficiency of VHMC in a number of experiments in section 4 and conclude in section 5.

## 2 Background

### 2.1 Hamiltonian Monte Carlo

In general formulation of Bayesian inference, a set of independent observations are modeled by an underlying distribution with unknown parameter . Given a prior distribution of , the posterior distribution is given by Bayesian formula

 p(θ|Y)=p(Y|θ)p(θ)p(Y)∝N∏n=1p(yn|θ)⋅p(θ)

To construct the Hamiltonian dynamical system (Duane et al., 1987; Neal, 2011), the position-dependent potential energy function is defined as the negative log unnormalized posterior density

 U(θ)=−N∑n=1logp(yn|θ)−logp(θ) (2.1)

and the kinetic energy function is defined as a quadratic function of an auxiliary momentum variable

 K(r)=12rTM−1r

where is a mass matrix and is often set to identity, . The fictitious Hamiltonian, therefore, is defined as the total energy function of the system

 H(θ,r)=U(θ)+K(r)

As one of the state-of-the-art MCMC methods, Hamiltonian Monte Carlo suppresses random walk behavior by simulating the Hamiltonian dynamical system to propose distant states with high acceptance probabilities. That is, in order to sample from the posterior distribution

, HMC augments the parameter space and generates samples from the joint distribution of

 π(θ,r)∝exp(−U(θ)−K(r)) (2.2)

Notice that and are separated in (2.2), we can simply drop the momentum samples and the samples follow the marginal distribution which is exactly the target posterior.

To generate proposals, HMC simulates the Hamiltonian flow governed by the following differential equations

 dθdt =∂H∂r=M−1r (2.3) drdt =−∂H∂θ=−∇θU(θ) (2.4)

Over a period , also called trajectory length, (2.3) and (2.4) together define a map in the extended parameter space, from the starting state to the end state. As implied by a Hamiltonian flow, is reversible, volume-preserving and also preserves the Hamiltonian . These allow us to construct -invariant Markov chains whose proposals will always be accepted. In practice, however, (2.3) and (2.4) are not analytically solvable and we need to resort to numerical integrators. As a symplectic integrator, the leapfrog scheme (see Algorithm 1) maintains reversibility and volume preservation and hence is a common practice in HMC literatures. Although discretization introduces bias which needs to be corrected in an Metroplis-Hasting (MH) step, we can control the stepsizes to maintain high acceptance probabilities even for distant proposals.

In recent years, many variants of HMC have been developed to make the algorithm more flexible and generally applicable in a variety of settings. For example, methods proposed in Hoffman and Gelman (2011) and Wang et al. (2013) enable automatically tuning of hyper-paramters such as the stepsize and the number of leapfrog steps , saving the amount of tuning-related headaches. Riemannian Manifold HMC (Girolami and Calderhead, 2011) further improves standard HMC’s efficiency by automatically adapting to local structures using Riemanian geometry of parameter space. These adaptive techniques could be potentially combined with our proposed method which focuses on reducing the computational complexity.

### 2.2 Fixed-form Variational Bayes

Instead of running a Markov chain, people can also approximate the intractable posterior distribution with a more convenient and tractable distribution. A popular approach of obtaining such an approximation is fixed-form variational Bayes (Honkela et al., 2010; Saul and Jordan, 1996; Salimans and Knowles, 2013) where a parametrized distribution is proposed to approximate the target posterior by minimizing the KL divergence

 DKL (qη(θ)||p(θ|Y))=∫qη(θ)log(qη(θ)p(θ|Y))dθ =log(p(Y))+∫qη(θ)log(qη(θ)p(θ,Y))dθ (2.5)

since is a constant (used extensively in model selection), it suffices to minimize the second term in (2.2). Usually, is chosen from the exponential family of distributions with the following canonical form:

 qη(θ)=exp[T(θ)η−A(η)]ν(θ) (2.6)

where

is a row vector of sufficient statistics,

is for normalization and is a base measure. The column vector is often called the natural parameters of the exponential family distribution . Taking this approach and substituting into (2.2), we now have a parametric optimization problem in :

 ^η=argminηEqη(θ)[logqη(θ)−logp(θ,Y)] (2.7)

The above optimization problem can be solved using gradient-based optimization or fix-point algorithms if and its derivatives with respect to can be evaluated analytically. Without assuming posterior independence and requiring conjugate exponential models, posterior approximations of this type are usually much more accurate than a factorized approximation following the mean-field assumptions. However, the requirement of being able to analytically evaluate those quantities mentioned above is also very restrictive. Many attempts, therefore, has been made to evaluate those quantities using stochastic approximations from Monte Carlo integration (Blei et al., 2012; Ranganath et al., 2013; Kingma and Welling, 2013). As an alternative, Salimans and Knowles (2013) proposed a new optimization algorithm which relates (2.7

) to stochastic linear regression. To reveal the connection, the posterior approximate (

2.6) is relaxed and rewritten in the unnormalized form

 ~q~η(θ)=exp[~T(θ)~η]ν(θ) (2.8)

where the nonlinear normalizer is removed and the vectors of sufficient statistics and natural parameters are augmented, i.e. . The unnormalized version of KL divergence is utilized to deal with and achieves its minimum at

 ~η=Eq[~T(θ)′~T(θ)]−1Eq[~T(θ)′logp(θ,Y)] (2.9)

which resembles the maximum likelihood estimator for linear regression. Based on this observation, Salimans and Knowles (2013) derived a stochastic approximation algorithm using (2.9) as a fixed point update and approximating the involved expectations by weighted Monte Carlo.

In the next section, we will discuss how the variational Bayes approach can be actually utilized to accelerate HMC. For this, we construct a fast and accurate approximation for the computationally expensive potential energy function. The approximation is provided by variational Bayes and is incorporated in the simulation of Hamiltonian flow.

## 3 Variational Hamiltonian Monte Carlo

Besides subsampling, an alternative approach that can save computation cost is to construct fast and accurate surrogate functions for the expensive potential energy functions (Liu, 2001; Neal, 2011). As one of the commonly used models for emulating expensive-to-evaluate functions, Gaussian process (GP) is used in Rasmussen (2003) to approximate the potential energy and its derivatives based on true values of these quantities (training set) collected during an initial exploratory phase. However, a major drawback of GP-based surrogate methods is that inference time typically grows cubically in the size of training set due to the necessity of inverting a dense covariance matrix. This is especially crucial in high dimensional spaces, where large training sets are often needed before a reasonable level of approximation accuracy is achieved. Our goal, therefore, is to develop a method that can scale to large training set while still maintaining a desired level of flexibility. For this purpose, we propose to use neural networks along with efficient training algorithms to construct surrogate functions. A typical single-hidden layer feedforward neural network (SLFN) with scalar output is defined as

 z(θ)=s∑i=1viσ(θ;γi) (3.1)

where and are the node parameter and output weight for the i

th hidden neuron,

is a nonlinear activation function. Given a training dataset, the estimates of those parameters can be obtained by minimizing the mean square error (MSE) cost function. To save training time, randomly assigned node parameters are suggested in Ferrari and Stengel (2005) and Huang et al. (2006b) where the optimization is reduced to a linear regression problem with randomly mapped features which can be solved efficiently using algebraic approaches. This can also be viewed as using an additive model based on random (adaptive) basis to approximate the target distribution. Unlike a standard Gaussian process, the above neural network based surrogate scales linearly in the size of training data, and cubically in the number of hidden neurons. This allows us to explicitly balance evaluation time and model capacity. Moreover, the random network in (3.1) has been proven to approximate a rich class of functions arbitrarily well (Rahimi and Recht, 2008).

### 3.1 Approximation with random networks

As shown in Rahimi and Recht (2008), functions in (3.1) with randomly assigned bases is flexible enough to provide accurate approximations for other well-studied classes of functions (e.g., Reproducing Kernel Hilbert Space). Let be a family of functions on a compact set with parameter specified over the set . Let be a distribution on , consider a rich class of functions of the following form

 f(θ)=∫Γα(γ)σ(θ;γ)dγ (3.2)

where for some constant . Define a norm and the set

 Fp≡{f(θ)=∫Γα(γ)σ(θ;γ)dγ∣∣∥f∥p<∞} (3.3)

The following theorem shows that a given can be approximated within by a function of the form

 z(θ)=s∑i=1viσ(θ;γi) (3.4)

where are sampled iid from . See Rahimi and Recht (2008) for a detailed proof.

###### Theorem 1 (Rahimi 2008).

Let be any probability measure on , and define the norm . Suppose satisfies . Fix . Then, , with probability at least over drawn iid from , there exist so that the function

 z(θ)=s∑i=1viσ(θ;γi)

satisfies

 ∥z−f∥μ<∥f∥p√s(1+√2log1δ) (3.5)

The activation function

on together defines a Reproducing Kernel Hilbert Space (RKHS) with the following kernel on

 k(θ,θ′)=∫Γp(γ)σ(θ;γ)σ(θ′;γ)dγ (3.6)

which is clearly positive definite. Alternatively, the following proposition shows that can be constructed based on functions of the form (3.2).

###### Proposition 1.

Let the space be the completion of the set of functions of the form (3.2) such that

 ∫Γα(γ)2p(γ)dγ<∞

with the inner product

 ⟨f,g⟩=∫Γα(γ)β(γ)p(γ)dγ

where . Then .

###### Proof.

See a proof in Rahimi and Recht (2008). ∎

Note that ,

 ∫Γα(γ)2p(γ)=∫Γα(γ)2p(γ)2p(γ)dγ≤∥f∥2p<∞

Therefore, is a subset of RKHS . In fact, Rahimi and Recht (2008) shows that is a dense subset of .

###### Theorem 2.

Let and be defined as above for a given activation function and probability density . Then is dense in .

###### Proof.

By the property of RKHS, is the completion of the set of all finite linear combinations of the form

 f(θ)=∑ivik(θ;θi),θi∈Θ (3.7)

with the inner product satisfying . Using (3.6), we can rewrite (3.7) in the following form

 f(θ)=∑ivi∫Γp(γ)σ(θ;γ)σ(θi;γ)dγ=∫Γp(γ)∑iviσ(θi;γ)σ(θ;γ)dγ=∫Γαf(γ)σ(θ;γ)dγ

where , implying that . ∎

Since certain RKHSs are known to fit a rich class of (density) functions arbitrarily well, approximating these RKHSs with a small number of random bases allows for efficient surrogate construction with comparable approximation accuracy.

### 3.2 Free-form Variational Bayes

The correspondence between distributions and their potential energy functions builds a bridge between distribution approximation and function approximation. Viewing this way, each random neural network in (3.1) corresponds to a distribution in the exponential family

 qv(θ)∝exp(−z(θ))=exp[−s∑i=1viσ(θ;γi)−Φ(v)] (3.8)

where is called the vector of canonical parameters, and the collection of randomly-mapped features is known as sufficient statistics. The quantity , known as the log partition function, is defined by the following integral

 Φ(v)=log∫exp(v⋅Ψ(θ))dθ

Note that is properly normalized if and only if the integral is finite. Therefore, the canonical parameters of interest belong to the set

 Ω:={v∈Rs|Φ(v)<+∞} (3.9)

We call the induced distribution of neural network surrogate if .

As our neural network surrogates approximate the true potential energy function , the underlying distribution then approximates the target posterior distribution . Because both the surrogate induced distribution and the target posterior distribution are known up to a constant, we introduce the following expected squared distance between unnormalized densities

 ~DSM(p(θ|Y)||qv(θ))=12∫qv(θ)∥∇θz(θ)−∇θU(θ)∥2dθ (3.10)

which is similar to the well known score matching squared distance (Hyvärinen, 2005) (see section 3.4 for a more detailed explanation). By minimizing the expected squared distance , we arrive at a variational inference algorithm

 ^v=argminv∈Ω~DSM(p(θ|Y)||qv(θ)) (3.11)

The surrogate induced approximation (3.8) enriches our choices for variational approximation from fixed-form tractable distributions (e.g., Gaussian or mixture of Gaussians) to fast and flexible intractable distributions. The integral in (3.10) is usually hard to evaluate in practice but can be approximated using samples from the surrogate induced distribution

by the law of large numbers. Unlike the

fixed-form approximation, the surrogate induced distribution generally does not allow for drawing samples directly. However, we can use MCMC methods to generate samples from it.

Due to the random nature in approximation (3.8), we call variational algorithm (3.11) free-form variational Bayes. By choosing a proper number of hidden neurons , the free-form variational Bayes provides an implicit subsampling procedure that can effectively remove redundancy and noise in the data while striking a good balance between computation cost and approximation accuracy of the underlying distribution.

Remark. From an approximation point of view, each is a basis with a random configuration of , which specifies the orientation and scaling properties, within a profile of the activation function . In particular we choose the softplus function for additive node to approximate the potential function, e.g., the Hamiltonian corresponding to the posterior distribution, for our free-form variational Bayes. Of course, the choice of is general. As shown in Zhang et al. (2015)

, different basis can be used in our free-form variational Bayes formulation. In particular, if exponential square kernel is used as radial basis functions (RBF) centered at given data, GP method is recovered. However, using kernel functions centered at very data point, GP method does not effectively exploit redundancy in data or regularity in parameter space and hence may result in expensive computation cost as well as instability for large data set. If exponential square kernel is used at a smaller set of properly induced points, a more computationally tractable GP model that tries to exploit redundancy in data, sparse GP

(Snelson and Ghahramani, 2006; Quinonero-Candela and Rasmussen, 2005), is recovered. Since kernel function is usually local, to approximate potential functions in HMC which go to infinity at far field, it is shown in Zhang et al. (2015) that the use of random basis based on softplus function provides a more compact and better approximation that has a good balance between local features and global behaviors. Also the choice of , the number of basis, can be used to balance flexibility/accuracy and overfitting.

### 3.3 Surrogate Induced Hamiltonian Flow

To sample from the surrogate induced distribution , we generate proposals by simulating the corresponding surrogate induced Hamiltonian flow governed by the following equations

 dθdt =∂~H∂r=M−1r (3.12) drdt =−∂~H∂θ=−∇θz(θ) (3.13)

where the modified Hamiltonian is

 ~H(θ,r)=z(θ)+K(r)

Practitioners can use leapfrog scheme to solve (3.12) and (3.13) numerically. The end-point of the trajectory starting at is accepted with probability

 αvhmc=min[1,exp(~H(θ0,r0)−~H(θ∗,r∗))] (3.14)

Similar to the true Hamiltonian flow, we have the following theorem

###### Theorem 3.

If we construct a Markov chain by simulating surrogate induced Hamiltonian flow (3.12) and (3.13) using leapfrog steps and Metropolis-Hastings correction with the acceptance probability according to (3.14), the equilibrium distribution of the chain is

 ~π(θ,r)∝exp(−z(θ)−K(r))

See Supplementary Materials for a proof.

Theorem 3 implies that the marginal distribution of is exactly the surrogate induced distribution . Therefore, we can run the Markov chain and collect the values of interest, such as the potential energy function and its derivatives, as additional training data to improve the surrogate approximation (solving (3.11)) on the fly. This way, our algorithm can be viewed as a generalized version of the stochastic approximation algorithm proposed in Salimans and Knowles (2013) for free-form variational Bayes.

### 3.4 Score Matching

A well-known strategy to estimate unnormalized models is score matching (Hyvärinen, 2005). Assuming that data come from a distribution with unknown density , we want to find an approximation with a parameterized unnormalized density model , where is an -dimensional vector of parameters. Score matching estimates the model by minimizing the expected squared distance between the model score function and the data score function

 J(v)=12∫pD(θ)∥ψv(θ)−ψD(θ)∥2dθ (3.15)

A simple trick was suggested in Hyvärinen (2005) to avoid the expensive estimation of the data score function from .

Similar ideas can be applied here to train our neural network surrogate. Notice that the posterior density is known up to a constant, the data score function then can be evaluated exactly . This allows us to estimate our density model without requiring samples from the posterior distribution. Since sampling from the posterior distribution is computationally costly, we sample from the simpler and cheaper surrogate induced distribution (our variational model) instead. The corresponding expected squared distance is

 ~J(v)=12∫qv(θ)∥ψv(θ)−ψD(θ)∥2dθ (3.16)

Note that the model score function , is exactly the expected squared distance we introduced in section 3.2. In the case that our density model is not degenerate, we have a similar local consistency result to Hyvärinen (2005) as shown by the following theorem.

###### Theorem 4.

Assume that the data density follows the model: for some . Further, assume that no other parameter value gives a probability density that is equal to , and that for all . Then

 ~J(v)=0⇔v=v∗

As such, minimizing the expected squared distance would be sufficient to estimate the model. For a proof, see Supplementary Materials.

Remark. Note that the data score function is

 ψD(θ)=N∑n=1∇θlogp(yn|θ)+∇θlogp(θ)

we may choose our surrogate function as

 z(θ)=N∑n=1logp(yn|θ)+logp(θ) (3.17)

then the data density follows the model exactly. In order to reduce computational cost, our model is usually much simpler than (3.17) (). This allows us to explore and exploit the redundancy in the data from a function approximation perspective.

In practice, samples from the surrogate induced distribution can be collected as observations and our surrogate can be trained by minimizing the empirical version of . A regularization term could be included to improve numerical stability.

Now suppose that we have collected training data of size from the Markov chain history

 T(t)s:={(θn,∇θU(θn))}tn=1∈Rd×Rd (3.18)

where is the -th sample. The estimator of the output weight vector (variational parameter) can be obtained by minimizing the empirical square distance between the gradient of the surrogate and the gradient of the potential (i.e., score function) plus an additional regularization term:

 ^v=argminv12t∑n=1∥∇θz(θn)−∇θU(θn)∥2+λ2∥v∥2 (3.19)

which has an online updating formula summarized in the following proposition 2 (see Supplementary Materials for a detailed proof). For ease of presentation, we use the additive nodes 111Note that our method works for a general class of nonlinear nodes, including the radial basis functions (RBF).

###### Proposition 2.

Suppose our current estimator of the output weight vector is based on the current training dataset using hidden neurons. Given a new training data point , the updating formula for the estimator is given by

 v(t+1)=v(t)+W(t+1)(∇θU(θt+1)−At+1v(t)) (3.20)

where

 W(t+1) =C(t)A′t+1[Id+At+1C(t)A′t+1]−1 At+1 =(A1(θt+1),…,As(θt+1))

with , and can be updated by Sherman-Morrison-Woodburyformula:

 C(t+1)=C(t)−W(t+1)At+1C(t) (3.21)

The estimator and inverse matrix can be initialized as . The online learning can be achieved by storing the inverse matrix and performing the above updating formulas, which cost computation and storage independent of .

### 3.5 Variational HMC in Practice

The neural network based surrogate is capable of approximating the potential energy function well when there is enough training data. However, the approximation could be poor when only few training data are available which is true in the early stage of the Markov chain simulations. To alleviate this issue, we propose to add an auxiliary regularizer which provides enough information for the sampler at the beginning and gradually diminishes as the surrogate becomes increasingly accurate. Here, we use the Laplace’s approximation to the potential energy function but any other fast approximation could be used. The regularized surrogate approximation then takes the form

 Vt(θ)=μtzt(θ)+12(1−μt)(θ−θL)′∇2θU(θL)(θ−θL)

where is the maximum a posteriori (MAP) estimate and is a smooth monotone function monitoring the transition from the Laplace’s approximation to the surrogate approximation. Refining the surrogate approximation by acquiring training data from simulating the regularized surrogate induced Hamiltonian flow, we arrive at an efficient approximate inference method: Variational Hamiltonian Monte Carlo (VHMC) (Algorithm 2).

In practice, the surrogate approximation may achieve sufficient quality and an early stopping could save us from inefficient updating of the output weight vector. In fact, the stopping time serves as a knob to control the desired approximation quality. Before stopping, VHMC acts as a free-form variational Bayes method that keeps improving itself by collecting training data from the history of the Markov chain. After stopping, VHMC performs as a standard HMC algorithm which samples from the surrogate induced distribution. VHMC successfully combines the advantages of both variational Bayes and Hamiltonian Monte Carlo, resulting in higher computational efficiency compared to HMC and better approximation compared to VB.

## 4 Experiments

In this section, four experiments are conducted to verify the effectiveness and efficiency of the proposed Variational HMC framework. In the first two examples, we demonstrate the performance of VHMC from a pure approximation perspective and compare it to state-of-the-art fix-form

variational Bayes methods. We then test the efficiency of VHMC on two machine learning problems with large datasets and compare our methods to SGLD

(Welling and Teh, 2011), which is one of the state-of-the-art stochastic gradient MCMC methods. Compared to standard HMC, Variational HMC introduces some additional hyper-parameters, such as the number of random bases and the transition monitor , that might be hard to tune in practice. Generally speaking, we want our method to be as efficient as possible while maintaining good approximation. Therefore, we choose to be small as long as a reasonable level of accuracy can be achieved and to stay small until enough data have been collected to stabilize the training procedure. These can all be done by monitoring the acceptance rate using an initial run as in Zhang et al. (2015). In our experiments, we found that is a useful schedule where can be adjusted to accommodate different transition speed according to the complexity of the model.

### 4.1 A Beta-binomial Model for Overdispersion

We first demonstrate the performance of our variational Hamiltonian Monte Carlo method on a toy example from Albert (2009), which considers the problem of estimating the rates of death from stomach cancer for the largest cities in Missouri. The data is available from the R package LearnBayes which consists of pairs where records the number of individuals that were at risk for cancer in city , and is the number of cancer deaths that occurred in that city. The counts are overdispersed compared to what would be expected under a binomial model with a constant probability; therefore, Albert (2009) assumes a beta-binomial model with mean and precision :

 p(yj|m,K)=(njyj)B(Km+yj,K(1−m)+nj−yj)B(Km,K(1−m))

and assigns the parameters the following improper prior:

 p(m,K)∝1m(1−m)1(1+K)2

The resulting posterior is extremely skewed (as shown in the bottom right corner in Figure

1) and a reparameterization

 x1=logit(m),x2=logit(K)

is proposed to ameliorate this issue. Figure 1: Approximate posteriors for a varying number of hidden neurons. Exact posterior at bottom right. Figure 2: KL-divergence and score matching squared distance between the surrogate approximation and the exact posterior density using an increasing number of hidden neurons.

We choose as our transition schedule and set up the HMC parameter to achieve around acceptance. We run the variational Hamiltonian Monte Carlo long enough so that we can estimate the full approximation qualify of our surrogate. To demonstrate the approximation performance in terms of the number of hidden neurons , we train the neural network based surrogate using different numbers of hidden neurons and examine the resulting KL-divergence and score matching squared distance to the true posterior density. As we can see from Figures 1 and 2, the neural network based surrogate indeed offers a high quality approximation and becomes more accurate as increases. The surrogate induced Hamiltonian flow effectively explores the parameter space and transfers information from the posterior to the surrogate.

### 4.2 Bayesian Probit Regression

Next, we demonstrate the approximation performance of our Variational HMC algorithm relative to existing variational approaches on a simple Bayesian classification problem, binary probit regression. Given observed data pairs , the model comprised a probit likelihood function and a Gaussian prior over the parameter , where is the standard Gaussian cdf. A full covariance multivariate normal approximation is used for all variational approaches. The synthetic data we use are simulated from the model, with and

. We show the performance averaged over 50 runs for all methods. We compare our algorithm to Variational Bayesian Expectation Maximization (VBEM)

(Beal and Ghahramani, 2002; Ormerod and Wand, 2010), and the best fixed-form variational approximation of Salimans and Knowles (2013) that uses Hessian and subsampling (FF-Minibatch). VBEM and FF-Minibatch are both implemented using the code provided in Salimans and Knowles (2013) with the same learning rate and minibatch size. For all variational approaches, we initialize the posterior approximation to the prior. For our Variational HMC algorithm, we choose random hidden units for the surrogate and set the starting point to be the origin. The number of hidden units is chosen in such a way that the surrogate is flexible enough to fit the target well and remain fast in computation. The HMC parameters are set to make the acceptance probability around . The target density is almost Gaussian, and a fast transition is enough to stabilize our algorithm. Since this experiment is on synthetic data, we follow Salimans and Knowles (2013) to assess the approximation performance in terms of the root mean squared error (RMSE) between the estimate (variational mean for VB and sample mean for VHMC) and the true parameter that is used to generate the dataset. Figure 3: RMSE approximate posterior mean as a function of the number of likelihood evaluations for difference variational Bayesian approaches and our Variational HMC algorithm.

Figure 3 shows the performance of our Variational HMC algorithm, as well as the performance of the other two variational Bayes methods. As we can see from the graph, VHMC and the subsampling based fixed-form variational approach (FF-minibatch) achieve lower RMSE than the VBEM algorithm. That is because of the extra factorization assumptions made by VBEM when introducing the auxiliary variables (Ormerod and Wand, 2010). Even though Gaussian approximation is already sufficiently accurate on this simple example, VHMC can still arrive at a lower RMSE due to the extra flexibility provided by the free-form neural network surrogate function.

### 4.3 Bayesian Logistic Regression

Next, we test the efficiency of our Variational HMC method as a scalable sampling method. We first apply Variational HMC to a Bayesian logistic regression model. Given the

-th input vector , the corresponding output (label) is assumed to follow the probability and a Gaussian prior is used for the model parameter . We test our proposed algorithm on the dataset (Lin et al., 2008). The original dataset, which is compiled from the UCI dataset, has 32561 observations and 123 features. We use a dimension random projection of the original features. We choose hidden units for the surrogate and set a transition schedule for our VHMC algorithm. We then compare the algorithm to HMC (Duane et al., 1987; Neal, 2011) and to stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011). For HMC and VHMC, we set the leap-frog stepsize such that the acceptance rate is around . For SGLD we choose batch size of 500 and use a range of fixed stepsizes.

To illustrate the sampling efficiency of all methods, we follow Ahn et al. (2012) to investigate the time normalized effective sample size (ESS)222Given samples, ESS = , where is the sample autocorrelation at lag (Geyer, 1992) averaged over the parameters and compare this with the relative error after a fixed amount of computation time. The relative error of mean (REM) and relative error of covariance (REC) is defined as

 REMt=∑i|¯¯¯¯¯βti−βoi|∑i|βoi|,RECt=∑ij|Ctij−Coij|∑ij|Coij| (4.1)

where are the sample mean and sample covariance up to time and the ground truth are obtained using a long run (T = 500K samples) of HMC algorithm. Figure 4: Final error of logistic regression at time T versus mixing rate for the mean (top) and covariance (bottom) estimates after 300 (left) and 3000 (right) seconds of computation. Each algorithm is run using different setting of parameters.

Figure 4 shows the relative error at time T = 300, T = 3000 as a function of the time normalized mean ESS, which is a measure of the mixing rate. The results for the mean are shown on the top, and those for the covariance are on the bottom. We run each algorithm with a different setting of parameters that control the mixing rate: number of leap-frog steps for HMC and for VHMC, and stepsizes for SGLD.

As we decrease the stepsize, SGLD becomes less biased in the gradient approximation, resulting in smaller relative error. However, the exploration efficiency drops at the same time and sampling variance gradually dominates the relative error. In contrast, HMC uses a fixed

leap-frog stepsize and therefore maintains high exploration efficiency in parameter space. The down side is the expensive computation of the full gradient and the possible turning back of the trajectories when the number of leap-frog steps is unnecessarily large. Adopting a flexible neural network surrogate, VHMC balances the computation cost and approximation quality much better than subsampling and achieves lower relative error with high mixing rates.

### 4.4 Independent Component Analysis

Finally, we apply our method to sample from the posterior distribution of the unmixing matrix in Independent Component Analysis (ICA)

(Hyvärinen and Oja, 2000). Given -dimensional observations , we model the data as

 p(x|W)=|det(W)|d∏i=1pi(wTix) (4.2)

where is the -th row of and is supposed to capture the true density of the -th independent component. Following Welling and Teh (2011), we use a Gaussian prior over the unmixing matrix and choose with . We evaluate our method using the MEG dataset (Vigário et al., 1997), which has 122 channels and 17730 observations. We extract the first 5 channels for our experiment which leads to samples with dimensions. We then compare our algorithm to standard HMC and SGLD (Welling and Teh, 2011). For SGLD, we use a natural gradient (Amari et al., 1996) which has been found to improve the efficiency of gradient descent significantly. We set for the Gaussian prior. For HMC and Variational HMC, we set the leap-frog stepsize to keep the acceptance ratio around and set to allow an efficient exploration in parameter space. For SGLD, we choose batch size of and use stepsizes from a polynomial annealing schedule , with and . (This setting reduces the stepsize from to during 1e+7 iterations). We choose hidden units and set the transition schedule for our Variational HMC algorithm. Note that the likelihood (4.2) is row-permutation invariant. To measure the convergence of all samplers on this ICA problem, we therefore use the Amari distance (Amari et al., 1996) , where is the sample average and is the true unmixing matrix estimated using a long run (T = 100K samples) of standard HMC algorithm. Figure 5: Convergence of Amari distance on the MEG data for HMC, SGLD and our Variational HMC algorithm.

The Amari distance as a function of runtime is reported for each of these methods in Figure 5. From the graph, we can see that SGLD converges faster than standard HMC. The noise introduced by subsampling is compensated by the fast exploration in parameter space which allows for an early arrival at the high probability domain for SGLD. However, when the variance dominates the bias, the convergence speed slows down since SGLD requires annealing (or small) stepsize that inevitably leads to low exploration efficiency. By maintaining efficient exploration in parameter space (same stepsize as HMC) while reducing the computation in simulating the Hamiltonian flow, VHMC outperforms SGLD, arriving at a lower Amari distance much more rapidly.

## 5 Conclusion

We have presented a novel approach, Variational Hamiltonian Monte Carlo, for approximate Bayesian inference. Our approach builds on the framework of HMC, but uses a flexible and efficient neural network surrogate function to approximate the expensive full gradient. The surrogate keeps refining its approximation by collecting training data while the sampler is exploring the parameter space. This way, our algorithm can be viewed as a free-form variational approach. Unlike subsampling-based MCMC methods, VHMC maintains the relatively high exploration efficiency of its MCMC counterparts while reducing the computation cost. Compared to fixed-form variational approximation, VHMC is more flexible and thus can approximate more general target distribution better.

As the complexity of the model increases (e.g., high dimensional covariates), the computational cost of random network surrogates may increase since it usually requires more random bases to provide adequately accurate approximation in high dimensions. However, the free-style surrogate construction in VHMC allows us to reduce the computation cost by adapting to the model if some special structures exist. For example, one can build surrogates based on graphical models to exploit the conditional independence structures in the model. Alternatively, one can also train the full neural network with sparsity constraints imposed to the input weights. Both approaches could be interesting to explore in future work.

### Acknowledgement

This work is supported by NIH grant R01AI107034 and NSF grants DMS-1418422 and DMS-1622490.

## References

• Ahn et al. (2012) Ahn, S., Korattikara, A., and Welling, M. (2012). “Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring.” In International Conference on Machine Learning.
• Albert (2009) Albert, J. (2009). Bayesian Computing with R. Springer Science, New York.
• Amari et al. (1996) Amari, S. I., Cichocki, A., and Yang, H. (1996). “A new learning algorithm for blind signal separation.” In Advances in Neual Information Processing Systems, 757–763.
• Beal and Ghahramani (2002) Beal, M. J. and Ghahramani, Z. (2002). “The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures.” In Bayesian Statistics 7: Proceeding of the 7th Valencia International Meeting, 453–463.
• Betancourt (2015) Betancourt, M. (2015). “The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling.” In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015).
• Blei et al. (2012) Blei, D. M., Michael, M. I., and Paisley, J. W. (2012). “Variational Bayesian inference with stochastic search.” In Proceedings of the 29th International Conference on Machine Learning (ICML-12), 1367–1374.
• Chen et al. (2014) Chen, T., Fox, E. B., and Guestrin, C. (2014). “Stochastic Gradient Hamiltonian Monte Carlo.” In Proceedings of 31st International Conference on Machine Learning (ICML 2014).
• de Freitas et al. (2001) de Freitas, N., Højen-Sørensen, P., Jordan, M. I., and Russell, S. (2001). “Variational MCMC.” In

Proceedings of the 7th conference on Uncertainty in Artificial Intelligence (UAI 2001)

, 120–127. San Francisco: Morgan Kaufmann.
• Ding et al. (2014) Ding, N., Fang, Y., Babbush, R., Chen, C., Skell, R. D., and Neven, H. (2014). “Bayesian Sampling Using Stochastic Gradient Thermostats.” In Advances in Neural Information Processing Systems 27 (NIPS 2014).
• Duane et al. (1987) Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). “Hybrid Monte Carlo.” Physics Letters B, 195(2): 216 – 222.
• Ferrari and Stengel (2005) Ferrari, S. and Stengel, R. F. (2005). “Smooth function approximation using neural networks.” IEEE Trans. Neural Network, 16(1): 24–38.
• Geyer (1992) Geyer, C. J. (1992). “Practical Markov Chain Monte Carlo.” Statistical Science, 7: 473–483.
• Girolami and Calderhead (2011) Girolami, M. and Calderhead, B. (2011). “Riemann manifold Langevin and Hamiltonian Monte Carlo methods.” Journal of the Royal Statistical Society, (with discussion) 73(2): 123–214.
• Hoffman and Gelman (2011) Hoffman, M. and Gelman, A. (2011). “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” arxiv.org/abs/1111.4246.
• Honkela et al. (2010) Honkela, A., Raiko, T., Kuusela, M., Tornio, M., and Karhunen, J. (2010). “Approximate Riemannian conjugate gradient learning for fixed-form variational Bayes.” Journal of Machine Learning Research, 11: 3235–3268.
• Huang et al. (2006a) Huang, G. B., Chen, L., and Siew, C. K. (2006a). “Universal approximation using incremental constructive feedforward networks with random hidden nodes.” IEEE Trans. Neural Networks, 17(4): 879–892.
• Huang et al. (2006b) Huang, G. B., Zhu, Q. Y., and Siew, C. K. (2006b). “Extreme learning machine: Theory and applications.” Neurocomputing, 70(1-3): 489–501.
• Hyvärinen (2005) Hyvärinen, A. (2005). “Estimation of non-normalized statistical models by score matching.” Journal of Machine Learning Research, 6: 695–709.
• Hyvärinen and Oja (2000) Hyvärinen, A. and Oja, E. (2000). “Independent component analysis: algorithms and applications.” Neural networks, 13: 411–430.
• Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). “An Introduction to Variational Methods for Graphical Methods.” In Machine Learning, 183–233. MIT Press.
• Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). “Auto-Encoding Variational Bayes.” In The 2nd International Conference on Learning Representations (ICLR).
• Lin et al. (2008) Lin, C. J., Weng, R. C., and Keerthi, S. S. (2008). “Trust region Newton method for large-scale logistic regression.” Journal of Machine Learning Research, 9: 627–650.
• Liu (2001) Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer.
• Metropolis et al. (1953) Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). “Equation of State Calculations by Fast Computing Machines.” The Journal of Chemical Physics, 21(6): 1087–1092.
• Neal (2011) Neal, R. M. (2011). “MCMC using Hamiltonian dynamics.” In Brooks, S., Gelman, A., Jones, G., and Meng, X. L. (eds.), Handbook of Markov Chain Monte Carlo, 113–162. Chapman and Hall/CRC.
• Ormerod and Wand (2010) Ormerod, J. T. and Wand, M. P. (2010). “Explaining Variational Approximations.” The American Statistician, 2(64): 140–153.
• Quinonero-Candela and Rasmussen (2005) Quinonero-Candela, J. and Rasmussen, C. E. (2005). “A unifying view of sparse approximate Gaussian process regression.” Journal of Machine Learning Research, 6: 1939–1959.
• Rahimi and Recht (2008) Rahimi, A. and Recht, B. (2008). “Uniform approximation of functions with random bases.” In Proc. 46th Ann. Allerton Conf. Commun., Contr. Comput..
• Ranganath et al. (2013) Ranganath, R., Gerrish, S., and Blei, D. M. (2013). “Black box variational inference.” arxiv.org/abs/1401.0118.
• Rasmussen (2003) Rasmussen, C. E. (2003). “Gaussian Processes to Speed up Hybrid Monte Carlo for Expensive Bayesian Integrals.” Bayesian Statistics, 7: 651–659.
• Salimans et al. (2015) Salimans, T., Kingma, D. P., and Welling, M. (2015). “Markov chain Monte Carlo and variational inference: bridging the gap.” In International Conference on Machine Learning (ICML 2015), 1218–1226.
• Salimans and Knowles (2013) Salimans, T. and Knowles, D. A. (2013). “Fixed-form variational posterior approximation through stochastic linear regression.” Bayesian Analysis, 8(4): 837–882.
• Saul and Jordan (1996) Saul, L. and Jordan, M. I. (1996). “Exploiting tractable substructures in intractable networks.” In Tesauro, G., Touretzky, D. S., and Leen, T. K. (eds.), Advance in neural information processing systems 7 (NIPS 1996), 486–492. Cambridge, MA: MIT Press.
• Snelson and Ghahramani (2006) Snelson, E. and Ghahramani, Z. (2006). “Sparse Gaussian processes using pseduo-input.” In Advances in Neual Information Processing Systems, 1257–1264. Cambridge, MA: MIT Press.
• Vigário et al. (1997) Vigário, R., Särelä, J., and Oja, E. (1997). “MEG data for studies using independent component analysis.”
• Wainwright and Jordan (2008) Wainwright, M. and Jordan, M. (2008). “Graphical models, exponential families, and variational inference.” Foundations and Trends in Machine Learning, 1(1-2): 1–305.
• Wang et al. (2013) Wang, Z., Mohamed, S., and Nando, D. (2013). “Adaptive Hamiltonian and Riemann manifold Monte Carlo.” In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), 1462–1470.
• Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). “Bayesian Learning via Stochastic Gradient Langevin Dynamics.” In Proceedings of the International Conference on Machine Learning.
• Zhang et al. (2015) Zhang, C., Shahbaba, B., and Zhao, H. K. (2015). “Hamiltonian Monte Carlo Acceleration Using Surrogate Functions with Random Bases.” arxiv.org/abs/1506.05555.