1 Introduction
Bayesian statistics provides a natural way to manage model complexity and control overfitting, with modern problems involving complicated models with a large number of parameters. One of the most powerful advantages of the Bayesian approach is hierarchical modeling, which allows partial pooling across a group of datasets, allowing groups with little data to borrow information from similar groups with larger amounts of data. However, such models pose problems for Markov chain Monte Carlo (MCMC) methods, because the joint posterior distribution is often pathological due to strong correlations between the model parameters and the hyperparameters [3]. For example, one of the most powerful MCMC methods is Hamiltonian Monte Carlo (HMC). However, for hierarchical models even the mixing speed of HMC can be unsatisfactory in practice, as has been noted several times in the literature [3, 4, 11]. Riemannian manifold Hamiltonian Monte Carlo (RMHMC) [7] is a recent extension of HMC that aims to efficiently sample from challenging posterior distributions by exploiting local geometric properties of the distribution of interest. However, it is computationally too expensive to be applicable to large scale problems.
In this work, we propose a simplified RMHMC method, called SemiSeparable Hamiltonian Monte Carlo (SSHMC), in which the joint Hamiltonian over parameters and hyperparameters has special structure, which we call semiseparability, that allows it to be decomposed into two simpler, separable Hamiltonians. This condition allows for a new efficient algorithm which we call the alternating blockwise leapfrog algorithm. Compared to Gibbs sampling, SSHMC can make significantly larger moves in hyperparameter space due to shared terms between the two simple Hamiltonians. Compared to previous RMHMC methods, SSHMC yields simpler and more computationally efficient samplers for many practical Bayesian models.
2 Hierarchical Bayesian Models
Let be a collection of data groups where th data group is a collection of iid observations and their inputs . We assume the data follows a parametric distribution , where is the model parameter for group . The parameters are assumed to be drawn from a prior , where is the hyperparameter with prior distribution . The joint posterior over model parameters and hyperparameters is then
(1) 
This hierarchical Bayesian model is popular because the parameters for each group are coupled, allowing the groups to share statistical strength. However, this property causes difficulties when approximating the posterior distribution. In the posterior, the model parameters and hyperparameters are strongly correlated. In particular,
usually controls the variance of
to promote partial pooling, so the variance of depends strongly on This causes difficulties for many MCMC methods, such as the Gibbs sampler and HMC. An illustrative example of pathological structure in hierarchical models is the Gaussian funnel distribution [11]. Its density function is defined as , whereis the vector of lowlevel parameters and
is the variance hyperparameters. The pathological correlation between and is illustrated by Figure 1.3 Hamiltonian Monte Carlo on Posterior Manifold
Hamiltonian Monte Carlo (HMC) is a gradientbased MCMC method with auxiliary variables. To generate samples from a target density , HMC constructs an ergodic Markov chain with the invariant distribution , where is an auxiliary variable. The most common choice of
with precision matrix . Given the current sample , the transition kernel of the HMC chain includes three steps: first sample , second propose a new sampleby simulating the Hamiltonian dynamics and finally accept the proposed sample with probability
, otherwise leave unchanged. The last step is a MetropolisHastings (MH) correction. Define . The Hamiltonian dynamics is defined by the differential equations , where is called the position and is called the momentum.It is easy to see that , which is called the energy preservation property [11, 10]. In physics, is known as the Hamiltonian energy, and is decomposed into the sum of the potential energy and the kinetic energy . The most used discretized simulation in HMC is the leapfrog algorithm, which is given by the recursion
(2a)  
(2b)  
(2c) 
where is the step size of discretized simulation time. After steps from the current sample , the new sample is proposed as the last point . In Hamiltonian dynamics, the matrix is called the mass matrix. If is constant w.r.t. , then and are independent in . In this case we say that is a separable Hamiltonian. In particular, we use the term standard HMC
to refer to HMC using the identity matrix as
. Although HMC methods often outperform other popular MCMC methods, they may mix slowly if there are strong correlations between variables in the target distribution. Neal [11] showed that HMC can mix faster if is not the identity matrix. Intuitively, such a acts like a preconditioner. However, if the curvature of varies greatly, a global preconditioner can be inadequate.For this reason, recent work, notably that on Riemannian manifold HMC (RMHMC) [7], has considered nonseparable Hamiltonian methods, in which varies with position , so that and are no longer independent in . The resulting Hamiltonian is called a nonseparable
Hamiltonian. For example, for Bayesian inference problems, Girolami et al.
[7] proposed using the Fisher Information Matrix (FIM) of, which is the metric tensor of posterior manifold. However, for a nonseparable Hamiltonian, the simple leapfrog dynamics (
2a)(2c) do not yield a valid MCMC method, as they are no longer reversible. Simulation of general nonseparable systems requires the generalized leapfrog integrator (GLI) [7], which requires computing higher order derivatives to solve a system of nonlinear differential equations. The computational cost of GLI in general is where is the number of parameters, which is prohibitive for large .In hierarchical models, there are two ways to sample the posterior using HMC. One way is to sample the joint posterior directly. The other way is to sample the conditional and , simulating from each conditional distribution using HMC. This strategy is called HMC within Gibbs [11]. In either case, HMC chains tend to mix slowly in hyperparameter space, because the huge variation of potential energy across different hyperparameter values can easily overwhelm the kinetic energy in separable HMC [11]. Hierarchical models also pose a challenge to RMHMC, if we want to sample the model parameters and hyperparameters jointly. In particular, the closedform FIM of the joint posterior is usually unavailable. Due to this problem, even sampling some toy models like the Gaussian funnel using RMHMC becomes challenging. Betancourt [2] proposed a new metric that uses a transformed Hessian matrix of , and Betancourt and Girolami [3] demonstrate the power of this method for efficiently sampling hyperparameters of hierarchical models on some simple benchmarks like Gaussian funnel. However, the transformation requires computing eigendecomposition of the Hessian matrix, which is infeasible in high dimensions.
Because of these technical difficulties, RMHMC for hierarchical models is usually used within a block Gibbs sampling scheme, alternating between and . This RMHMC within Gibbs strategy is useful because the simulation of the nonseparable dynamics for the conditional distributions may have much lower computational cost than that for the joint one. However, as we have discussed, in hierarchical models these variables tend be very strongly correlated, and it is wellknown that Gibbs samplers mix slowly in such cases [13]. So, the Gibbs scheme limits the true power of RMHMC.
4 SemiSeparable Hamiltonian Monte Carlo
In this section we propose a nonseparable HMC method that does not have the limitations of Gibbs sampling and that scales to relatively high dimensions, based on a novel property that we will call semiseparability. We introduce new HMC methods that rely on semiseparable Hamiltonians, which we call semiseparable Hamiltonian Monte Carlo (SSHMC).
4.1 SemiSeparable Hamiltonian
In this section, we define the semiseparable Hamiltonian system. Our target distribution will be the posterior of a hierarchical model (1), where and . Let be the vector of momentum variables corresponding to and respectively. The nonseparable Hamiltonian is defined as
(3) 
where the potential energy is and the kinetic energy is which includes the normalization term . The mass matrix can be an arbitrary p.d. matrix. For example, previous work on RMHMC [7] has chosen to be FIM of the joint posterior , resulting in an HMC method that requires time. This limits applications of RMHMC to large scale problems.
To attack these computational challenges, we introduce restrictions on the mass matrix to enable efficient simulation. In particular, we restrict to have the form
where and are the precision matrices of and , respectively. Importantly, we restrict to be independent of and to be independent of . If has these properties, we call the resulting Hamiltonian a semiseparable Hamiltonian. A semiseparable Hamiltonian is still in general nonseparable, as the two random vectors and are not independent.
The semiseparability property has important computational advantages. First, because is block diagonal, the cost of matrix operations reduces from to . Second, and more important, substituting the restricted mass matrix into (3) results in the potential and kinetic energy:
(4)  
(5) 
If we fix or , the nonseparable Hamiltonian (3) can be seen as a separable Hamiltonian plus some constant terms. In particular, define the notation
Then, considering as fixed, the nonseparable Hamiltonian in (3) is different from the following separable Hamiltonian
(6)  
(7)  
(8) 
only by some constant terms that do not depend on . What this means is that any update to that leaves invariant leaves the joint Hamiltonian invariant as well. An example is the leapfrog dynamics on , where is considered the potential energy, and the kinetic energy.
Similarly, if are fixed, then differs from the following separable Hamiltonian
(9)  
(10)  
(11) 
only by terms that are constant with respect to
Notice that and are coupled by the terms and . Each of these terms appears in the kinetic energy of one of the separable Hamiltonians, but in the potential energy of the other one. We call these terms auxiliary potentials because they are potential energy terms introduced by the auxiliary variables. These auxiliary potentials are key to our method (see Section 4.3).
4.2 Alternating blockwise leapfrog algorithm
Now we introduce an efficient SSHMC method that exploits the semiseparability property. As described in the previous section, any update to that leaves invariant also leaves the joint Hamiltonian invariant, as does any update to that leaves invariant. So a natural idea is simply to alternate between simulating the Hamiltonian dynamics for and that for . Crucially, even though the total Hamiltonian is not separable in general, both and are separable. Therefore when simulating and , the simple leapfrog method can be used, and the more complex GLI method is not required.
We call this method the alternating blockwise leapfrog algorithm (ABLA), shown in Algorithm 1. In this figure the function “leapfrog” returns the result of the leapfrog dynamics (2a)(2c) for the given starting point, Hamiltonian, and step size. We call each iteration of the loop from an ALBA step. For simplicity, we have shown one leapfrog step for and for each ALBA step, but in practice it is useful to use multiple leapfrog steps per ALBA step. ABLA has discretization error due to the leapfrog discretization, so the MH correction is required. If it is possible to simulate and exactly, then is preserved exactly and there is no need for MH correction.
To show that the SSHMC method by ALBA preserves the distribution , we also need to show the ABLA is a timereversible and volumepreserving transformation in the joint space of . Let where and . Obviously, any reversible and volumepreserving transformation in a subspace of is also reversible and volumepreserving in . It is easy to see that each leapfrog step in the ABLA algorithm is reversible and volumepreserving in either or . One more property of integrator of interest is symplecticity. Because each leapfrog integrator is symplectic in the subspace of [10], they are also symplectic in . Then because ABLA is a composition of symplectic leapfrog integrators, and the composition of symplectic transformations is symplectic, we know ABLA is symplectic.
We emphasize that ABLA is actually not a discretized simulation of the semiseparable Hamiltonian system , that is, if starting at a point in the joint space, we run the exact Hamiltonian dynamics for for a length of time , the resulting point will not be the same as that returned by ALBA at time even if the discretized time step is infinitely small. For example, ALBA simulates with step size and with step size where , when that preserves .
4.3 Connection to other methods
Although the SSHMC method may seem similar to RMHMC within Gibbs (RMHMCWG), SSHMC is actually very different. The difference is in the last two terms of (7) and (10); if these are omitted from SSHMC and the Hamiltonians for , then we obtain HMC within Gibbs. Particularly important among these two terms is the auxiliary potential, because it allows each of the separable Hamiltonian systems to borrow energy from the other one. For example, if the previous leapfrog step increases the kinetic energy in , then, in the next leapfrog step for , we see that will have greater potential energy , because the auxiliary potential is shared. That allows the leapfrog step to accommodate a larger change of using . So, the chain will mix faster in . By the symmetry of and , the auxiliary potential will also accelerate the mixing in .
Another way to see this is that the dynamics in RMHMCWG for preserves the distribution but not the joint . That is because the Gibbs sampler does not take into account the effect of on . In other words, the Gibbs step has the stationary distribution rather than The difference between the two is the auxiliary potential. In contrast, the SSHMC methods preserve the Hamiltonian of .
4.4 Choice of mass matrix
The choice of and in SSHMC is usually similar to RMHMCWG. If the Hessian matrix of is independent of and always p.d., it is natural to define
as the inverse of the Hessian matrix. However, for some popular models, e.g., logistic regression, the Hessian matrix of the likelihood function depends on the parameters
. In this case, one can use any approximate Hessian , like the Hessian at the mode, and define , where is the Hessian of the prior distribution. Such a rough approximation is usually good enough to improve the mixing speed, because the main difficulty is the correlation between model parameters and hyperparameters.In general, because the computational bottleneck in HMC and SSHMC is computing the gradient of the target distribution, both methods have the same computational complexity , where is the cost of computing the gradient and is the total number of leapfrog steps per iteration. However, in practice we find it very beneficial to use multiple steps in each blockwise leapfrog update in ALBA; this can cause SSHMC to require more time than HMC. Also, depending on the mass matrix , the cost of leapfrog a step in ABLA may be different from those in standard HMC. For some choices of , the leapfrog step in ABLA can be even faster than one leapfrog step of HMC. For example, in many models the computational bottleneck is the gradient , is the normalization in prior. Recall that is a function of . If , will be canceled out, avoiding computation of . One example is using in Gaussian funnel distribution aforementioned in Section 2. A potential problem of such is that the curvature of the likelihood function is ignored. But when the data in each group is sparse and the parameters are strongly correlated, this can give nearly optimal mixing speed and make SSHMC much faster.
In general, any choice of and that would be valid for separable HMC with Gibbs is also valid for SSHMC.
5 Experimental Results
In this section, we compare the performance of SSHMC with the standard HMC and RMHMC within Gibbs [7] on four benchmark models.^{1}^{1}1Our use of a Gibbs scheme for RMHMC follows standard practice [7]. The step size of all methods are manually tuned so that the acceptance rate is around . The number of leapfrog steps are tuned for each method using preliminary runs. The implementation of RMHMC we used is from [7]. The running time is wallclock time measured after burnin. The performance is evaluated by the minimum Effective Sample Size (ESS) over all dimensions (see [6]). When considering the different computational complexity of methods, our main efficiency metric is time normalized ESS.





HMC with diagonal constant mass  SSHMC (semiseparable mass) 
time(s)  min ESS(, )  min ESS/s (, )  MSE(, )  

HMC  5.16  (302.97, 26.30)  (58.64, 5.09)  (2.28, 1.34) 
RMHMC(Gibbs)  2.7  (2490.98, 8.93)  (895.15, 3.21)  (1.95, 1.33) 
SSHMC  37.35  (3868.79, 1541.67)  (103.57, 41.27)  (0.04, 0.02) 
is the hyperparameter. The last column is the mean squared error of the sample estimated mean and variance of hyperparameter.
running time(s)  ESS (min, med, max)  ESS  min ESS/s  

HMC  378  (2.05, 3.68, 4.79)  815  2.15 
RMHMC(Gibbs)  411  (0.8, 4.08, 4.99)  271  0.6 
SSHMC  385.82  (2.5, 3.42, 4.27)  2266  5.83 
time (s)  ESS (min, med, max)  ESS  min ESS/s  

HMC  162  (1.6, 2.2, 5.2)  (50, 50, 128)  0.31 
RMHMC(Gibbs)  183  (12.1, 18.4, 33.5)  (385, 163, 411)  0.89 
SSHMC  883  (78.4, 98.9, 120.7)  (4434, 1706, 1390)  1.57 



5.1 Demonstration on Gaussian Funnel
We demonstrate SSHMC by sampling the Gaussian Funnel (GF) defined in Section 2. We consider dimensional lowlevel parameters and 1 hyperparameter . RMHMC within Gibbs on GF has block diagonal mass matrix defined as and . We use the same mass matrix in SSHMC, because it is semiseparable. We use 2 leapfrog steps for lowlevel parameters and 1 leapfrog step for the hyperparameter in ABLA and the same leapfrog step size for the two separable Hamiltonians.
We generate 5000 samples from each method after 1000 burnin iterations. The ESS per second (ESS/s) and mean squared error (MSE) of the sample estimated mean and variance of the hyperparameter are given in Table 1. Notice that RMHMC is much more efficient for the lowlevel variable because of the adaptive mass matrix with hyperparameter. Figure 1 illustrates a dramatic difference between HMC and SSHMC. It is clear that HMC suffers from oscillation of the hyperparameter in a narrow region. That is because the kinetic energy limits the change of hyperparameters [11, 3]. In contrast, SSHMC has much wider energy variation and the trajectory spans a larger range of hyperparameter . The energy variation of SSHMC is similar to the RMHMC with SoftAbs metric (RMHMCSoftAbs) reported in [2], an instance of general RMHMC without Gibbs. But compared with [2], each ABLA step is about 100 times faster than each generalized leapfrog step and SSHMC can generate around 2.5 times more effective samples per second than RMHMCSoftAbs. Although RHMC within Gibbs has better ESS/s on the low level variables, its estimation of the mean and variance is biased, indicating that the chain has not yet mixed. More important, Table 1
shows that the samples generated by SSHMC give nearly unbiased estimates of the mean and variance of the hyperparameter, which neither of the other methods are able to do.
5.2 Hierarchical Bayesian Logistic Regression
In this experiment, we consider hierarchical Bayesian logistic regression with exponential prior for the variance hyperparameter , that is
where is the logistic function and is the th data points the th group. We use the Statlog (German credit) dataset from [1]. This dataset includes 1000 data points and each data has categorical features and numeric features. Bayesian logistic regression on this dataset has been considered as a benchmark for HMC [7, 8], but the previous work uses only one group in their experiments. To make the problem more interesting, we partition the dataset into groups according to the feature Purpose. The size of group varies from 9 to 285. There are model parameters ( parameters for each group) and hyperparameter.
We consider the reparameterization of the hyperparameter . For RMHMC within Gibbs, the mass matrix for group is where is the Fisher Information matrix for model parameter and constant mass . In each iteration of the Gibbs sampler, each is sampled from by RMHMC using 6 generalized leapfrog steps and is sampled using 6 leapfrog steps. For SSHMC, and the same constant mass .
The results are shown in Table 2. SSHMC again has much higher ESS/s than the other methods.
5.3 Stochastic Volatility
A stochastic volatility model we consider is studied in [9], in which the latent volatilities are modeled by an autoregressive AR(1) process such that the observations are with latent variable . We consider the distributions , and . The joint probability is defined as
where the prior , and . The FIM of depends on the hyperparameters but not , but the FIM of depends on . For RMHMC within Gibbs we consider FIM as the metric tensor following [7]. For SSHMC, we define as inverse Hessian of , but as an identity matrix. In each ABLA step, we use 5 leapfrog steps for updates of and 2 leapfrog steps for updates of the hyperparameters, so that the running time of SSHMC is about 7 times that of standard HMC.
We generate 20000 samples using each method after 10000 burnin samples. The histogram of hyperparameters is shown in Figure 2. It is clear that all methods approximately converge to the same distribution. But from Table 3, we see that SSHMC generates almost two times as many ESS/s as RMHMC within Gibbs.
5.4 LogGaussian Cox Point Process
The logGaussian Cox Point Process (LGCPP) is another popular testing benchmark [5, 7, 14]. We follow the experimental setting of Girolami and Calderhead [7]. The observations are counts at the location , on a regular spatial grid, which are conditionally independent given a latent intensity process with mean , where , , and . is assigned by a Gaussian process prior, with mean function and covariance function where is the Euclidean distance between and . The log joint probability is given by We consider a grid that has 1024 latent variables. Each latent variable corresponds to a single observation .
We consider RMHMC within Gibbs with FIM of the conditional posteriors. See [7] for the definition of FIM. The generalized leapfrog steps are required for updating , but only the leapfrog steps are required for updating . Each Gibbs iteration takes 20 leapfrog steps for and 1 general leapfrog step for . In SSHMC, we use and . In each ABLA step, the update of takes 2 leapfrog steps and the update of takes 1 leapfrog step. Each SSHMC transition takes 10 ALBA steps. We do not consider HMC on LGCPP, because it mixes extremely slowly for hyperparameters.
The results of ESS are given in Table 4. The mean of sampled latent variable and the histogram of sampled hyperparameters are given in Figure 3. It is clear that the samples of RMHMC and SSHMC are consistent, so both methods are mixing well. However, SSHMC generates about six times as many effective samples per hour as RMHMC within Gibbs.




time(h)  ESS (min, med, max)  ESS()  min ESS/h  

SSHMC  2.6  (7.8, 30, 39)  (2101, 270)  103.8 
RMHMC(Gibbs)  2.64  (1, 29, 38.3)  (200, 46)  16 
6 Conclusion
We have presented SemiSeparable Hamiltonian Monte Carlo (SSHMC), a new version of Riemannian manifold Hamiltonian Monte Carlo (RMHMC) that aims to retain the flexibility of RMHMC for difficult Bayesian sampling problems, while achieving greater simplicity and lower computational complexity. We tested SSHMC on several different hierarchical models, and on all the models we considered, SSHMC outperforms both HMC and RMHMC within Gibbs in terms of number of effective samples produced in a fixed amount of computation time. Future work could consider other choices of mass matrix within the semiseparable framework, or the use of SSHMC within discrete models, following previous work in discrete HMC [15, 12].
References

Bache and Lichman [2013]
K. Bache and M. Lichman.
UCI machine learning repository, 2013.
URL http://archive.ics.uci.edu/ml.  Betancourt [2012] M. J. Betancourt. A General Metric for Riemannian Manifold Hamiltonian Monte Carlo. ArXiv eprints, Dec. 2012.
 Betancourt and Girolami [2013] M. J. Betancourt and M. Girolami. Hamiltonian Monte Carlo for Hierarchical Models. ArXiv eprints, Dec. 2013.

Choo [2000]
K. Choo.
Learning hyperparameters for neural network models using Hamiltonian dynamics
. PhD thesis, Citeseer, 2000.  Christensen et al. [2005] O. F. Christensen, G. O. Roberts, and J. S. Rosenthal. Scaling limits for the transient phase of local Metropolis–Hastings algorithms. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):253–268, 2005.
 Geyer [1992] C. J. Geyer. Practical Markov Chain Monte Carlo. Statistical Science, pages 473–483, 1992.
 Girolami and Calderhead [2011] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011. ISSN 14679868. doi: 10.1111/j.14679868.2010.00765.x. URL http://dx.doi.org/10.1111/j.14679868.2010.00765.x.
 Hoffman and Gelman [In press] M. D. Hoffman and A. Gelman. The noUturn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, In press.
 Kim et al. [1998] S. Kim, N. Shephard, and S. Chib. Stochastic volatility: likelihood inference and comparison with ARCH models. The Review of Economic Studies, 65(3):361–393, 1998.
 Leimkuhler and Reich [2004] B. Leimkuhler and S. Reich. Simulating Hamiltonian dynamics, volume 14. Cambridge University Press, 2004.
 Neal [2011] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–162, 2011.
 Pakman and Paninski [2013] A. Pakman and L. Paninski. Auxiliaryvariable exact Hamiltonian Monte Carlo samplers for binary distributions. In Advances in Neural Information Processing Systems 26, pages 2490–2498. 2013.
 Robert and Casella [2004] C. P. Robert and G. Casella. Monte Carlo statistical methods, volume 319. Citeseer, 2004.
 Wang et al. [2013] Z. Wang, S. Mohamed, and N. de Freitas. Adaptive Hamiltonian and Riemann manifold Monte Carlo samplers. In International Conference on Machine Learning (ICML), pages 1462–1470, 2013. URL http://jmlr.org/proceedings/papers/v28/wang13e.pdf. JMLR W&CP 28 (3): 1462–1470, 2013.
 Zhang et al. [2012] Y. Zhang, C. Sutton, A. Storkey, and Z. Ghahramani. Continuous relaxations for discrete Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems (NIPS), 2012.
Comments
There are no comments yet.