Gradient-based Monte Carlo algorithms are useful tools for sampling posterior distributions. Similar to gradient descent algorithms, gradient-based Monte Carlo generates posterior samples iteratively using the gradient of log-likelihood.
Langevin dynamics (LD) and Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal et al., 2011) are two important examples of gradient-based Monte Carlo sampling algorithms that are widely used in Bayesian inference. Since calculating likelihood on large datasets is expensive, people use stochastic gradients (Robbins and Monro, 1951) in place of full gradient, and have, for both Langevin dynamics and Hamiltonian Monte Carlo, developed their stochastic gradient counterparts(Welling and Teh, 2011; Chen et al., 2014)
. Stochastic gradient Hamiltonian Monte Carlo (SGHMC) usually converges faster than stochastic gradient Langevin dynamics (SGLD) in practical machine learning tasks like covariance estimation of bivariate Gaussian and Bayesian neural networks for classification on MNIST dataset, as demonstrated in(Welling and Teh, 2011). Similar phenomenon was also observed in (Chen et al., 2015) where SGHMC and SGLD were compared on both synthetic and real-world datasets. Intuitively speaking, comparing against SGLD, SGHMC has a momentum term that may enable it to explore the parameter space of posterior distribution much faster when the gradient of log-likelihood becomes smaller.
Very recently, Dubey et al. (2016) borrowed the standard variance reduction techniques from the stochastic optimization literature (Johnson and Zhang, 2013; Defazio et al., 2014) and applied them on SGLD to obtain two variance-reduced SGLD algorithms (called SAGA-LD and SVRG-LD) with improved theoretical results and practical performance. Because of the superiority of SGHMC over SGLD in terms of convergence rate in a wide range of machine learning tasks, it would be a natural question whether such variance reduction techniques can be applied on SGHMC to achieve better results than variance-reduced SGLD. The challenge is that SGHMC is more complicated than SGLD, i.e., the extra momentum term (try to explore faster) and friction term (control the noise caused by SGHMC from HMC) in SGHMC. Note that the friction term in SGHMC is inherently different than SGLD since LD itself already has noise so it can be directly extended to SGLD, while HMC itself is deterministic. To the best of our knowledge, there is even no existing work to prove that SGHMC is better than SGLD. So in this paper we need to give some new approaches and insights in our analysis to prove that variance-reduced SGHMC is better than variance-reduced SGLD due to the existence of momentum term and friction term.
1.1 Our contribution
We propose two variance-reduced versions of Hamiltonian Monte Carlo algorithms (called SVRG-HMC and SAGA-HMC) using the standard approaches from (Johnson and Zhang, 2013; Defazio et al., 2014). Compared with SVRG/SAGA-LD (Dubey et al., 2016), our algorithms guarantee improved theoretical convergence results due to the extra momentum term in HMC (see Corollary 3).
Moreover, we combine the proposed SVRG/SAGA-HMC algorithms with the symmetric splitting scheme (Chen et al., 2015; Leimkuhler and Shang, 2016) to extend them to 2nd-order integrators, which further improve the dependency on step size (see the difference between Theorem 2 and 5). We denote these two algorithms as SVRG2nd-HMC and SAGA2nd-HMC.
Finally, we evaluate our algorithms on real-world datasets and compare them with SVRG/SAGA-LD (Dubey et al., 2016); as it turns out, our algorithms converge markedly faster than the benchmarks (vanilla SGHMC and SVRG/SAGA-LD).
1.2 Related work
Langevin dynamics and Hamiltonian Monte Carlo are two important sampling algorithms that are widely used in Bayesian inference. Many literatures studied how to develop the variants of them to achieve improved performance, especially for scalability for large datasets. Welling and Teh (2011) started this direction with the notable work stochastic gradient Langevin dynamics (SGLD). Ahn et al. (2012) proposed a modification to SGLD reminiscent of Fisher scoring to better estimate the gradient noise variance, with lower classification error rates on HHP dataset and MNIST dataset. Chen et al. (2014) developed the stochastic gradient version of HMC (SGHMC), with a quite nontrivial approach different from SGLD. Ding et al. (2014) further improved SGHMC by a new dynamics to better control the gradient noise, and the proposed stochastic gradient Nosé-Hoover thermostats (SGNHT) outperforms SGHMC on MNIST dataset.
Various settings of Markov Chain Monte Carlo (MCMC) are also considered.Girolami and Calderhead (2011) enhanced LD and HMC by exploring the Riemannian structure of the target distribution, with Riemannian manifold LD and HMC (RMLD and RMHMC, respectively). Byrne and Girolami (2013) developed geodesic Monte Carlo (GMC) that is applicable to Riemannian manifolds with no global coordinate systems. Large-scale variants of RMLD, RMHMC and GMC with stochastic gradient were developed by (Patterson and Teh, 2013), (Ma et al., 2015) and (Liu et al., 2016), respectively. Ahn et al. (2014) studied the behaviour of stochastic gradient MCMC algorithms for distributed posterior inference. Very recently, Zou et al. (2018) used a stochastic variance-reduced HMC for sampling from smooth and strongly log-concave distributions which requires is smooth and strongly convex. In this paper, we do not assume is strongly convex or convex and we also use an efficient discretization scheme to further imporve the convergence results. Their results were measured with 2-Wasserstein distance, while ours are measured with mean square error.
Let be a -dimensional dataset that follows the distribution . Then, we are interested in sampling the posterior distribution based on Hamiltonian Monte Carlo algorithms. Let denote the set . Define , where and . Similar to (Dubey et al., 2016), we assume that each is -smooth and -Lipschitz, for all .
The general algorithmic framework maintains two sequences for by the following discrete time procedure:
and then returns the samples as an approximation to the stationary distribution . is the parameter we wish to sample and is an auxiliary variable conventionally called the “momentum”. Here is step size, is a constant independent of and , and is a mini-batch approximation of the full gradient . If we set , being a -element index set uniformly randomly drawn (with replacement) from as introduced in (Robbins and Monro, 1951), then the algorithm becomes SGHMC.
The above discrete time procedure provides an approximation to the continuous Hamiltonian Monte Carlo diffusion process :
Here is a Wiener process. According to (Chen et al., 2015)
, the stationary joint distribution ofis .
How do we evaluate the quality of the samples ? Assuming is a smooth test function, we wish to upper bound the Mean-Squared Error (MSE) , where is the empirical average, and is the population average. So, the objective of our algorithm is to carefully design to minimize in a faster way, where is a stochastic approximation of .
To study how the choice of influences the value of , define to be the solution to the Poisson equation , being the generator of Hamiltonian Monte Carlo diffusion process. In order to analyze the theoretical convergence results related to the MSE , we inherit the following assumption from (Chen et al., 2015).
Assumption 1 ((Chen et al., 2015))
Function is bounded up to 3rd-order derivatives by some real-valued function , i.e. where is the th order derivative for , and . Furthermore, the expectation of on is bounded, i.e. and that is smooth such that , for some constant .
Define operator for all . When the above assumption holds, we have the following theorem by (Chen et al., 2015).
For the rest of this paper, for any two values , we say if , where the notation only hides a constant factor independent of algorithm parameters .
Theorem 1 ((Chen et al., 2015))
Similar to the [A2] assumption in (Dubey et al., 2016), we also need to make the following assumption which relates to the difference .
for all .
As we mentioned before, if we take to be the Robbins & Monro approximation of (Robbins and Monro, 1951), then it becomes SGHMC and the following corollary holds since all ’s are -Lipschitz and
for any random variable.
3 Variance Reduction for Hamiltonian Monte Carlo
In this subsection, we propose the SVRG-HMC algorithm (see Algorithm 1) which is based on the SVRG algorithm. As can be seen from Line 8 of Algorithm 1, we use , where is , as the stochastic estimation for the full gradient .
Note that we initialize
to be zero vectors in the algorithm only to simplify the theoretical analysis. It would still work with an arbitrary initialization.
We assume . Otherwise the MSE upper bound of SVRG-LD would be equal to SGHMC (see (6) and (8)). We then omit the same terms (i.e., second and third terms) in the RHS of (7) and (8). Then we have the following lemma.
In other words, the SVRG-HMC is times faster than SVRG-LD, in terms of the convergence bound related to the variance reduction (i.e., the first terms in RHS of (7) and (8)). Note that is the size of dataset which can be very large, b is the mini-batch size which is usually a small constant and is the Lipschitz smooth parameter for .
In this subsection, we propose the SAGA-HMC algorithm by applying the SAGA framework (Defazio et al., 2014) to the Hamiltonian Monte Carlo. The details are described in Algorithm 2. Similar to the SVRG-HMC, we initialize to be zero vectors in the algorithm only to simplify the analysis; it would still work with an arbitrary initialization.
The following theorem shows the convergence result for MSE of SAGA-HMC. The proof is deferred to Appendix B.
4 Variance-reduced SGHMC with Symmetric Splitting
Symmetric splitting is a numerically efficient method introduced in (Leimkuhler and Shang, 2016) to accelerate the gradient-based algorithms. We note that one additional advantage of SGHMC over SGLD is that SGHMC can be combined with symmetric splitting while SGLD cannot (Chen et al., 2015). So it is quite natural to combine symmetric splitting with the proposed SVRG-HMC and SAGA-HMC respectively to see if any further improvements can be obtained.
The symmetric splitting scheme breaks the original recursion into 5 steps:
If we eliminate the intermediate variables, then
Same as before . Note that the stochastic gradient is computed at (which is ) instead of (see (1), (12) and (14)). As shown in (Chen et al., 2015), this symmetric splitting scheme is a 2nd-order local integrator. Then it improves the dependency of MSE on step size , i.e., the third term in the RHS of (5) changes to be , which is a higher order term than the original . It means that we can allow larger step size by using this symmetric splitting scheme (note that ).
Similarly, we can further improve the convergence results for SVRG/SAGA-HMC by combining the symmetric splitting scheme. We give the details of the algorithms and theoretical results for SVRG2nd-HMC and SAGA2nd-HMC in the following subsections.
In this subsection, we propose the SAGA2nd-HMC algorithm (see Algorithm 4) by combining our SAGA-HMC (Algorithm 2) with the symmetric splitting scheme. The convergence result and algorithm details are described below.
We present experimental results in this section. We compare the proposed SVRG-HMC (Algorithm 1), as well as its symmetric splitting variant SVRG2nd-HMC (Algorithm 3), against SVRG-LD (Dubey et al., 2016) on Bayesian regression, Bayesian classification and Bayesian Neural Networks. The experimental results of SAGA variants (Algorithm 2 and 4) are almost same as the SVRG variants. We report the corresponding SAGA experiments in Appendix A. In accordance with the theoretical analysis, all algorithms have fixed step size , and all HMC-based algorithms have fixed friction parameter ; a grid search is performed to select the best step size and friction parameter for each algorithm. The minibatch size is chosen to be (same as SVRG/SAGA-LD (Dubey et al., 2016)) for all algorithms, and is set to be .
The experiments are tested on the real-world UCI datasets111The UCI datasets can be downloaded from https://archive.ics.uci.edu/ml/datasets.html. The information of the standard datasets used in our experiments are described in the following Table 1 and Table 2 (Section 5.3). For each dataset (regression or classification), we partition the dataset into training (70%), validation (10%) and test (20%) sets. The validation set is used to select step size as well as friction for HMC-based algorithms in an 8-fold manner.
The Bayesian regression experiments were conducted on the first four UCI regression datasets, the Bayesian classification experiments were conducted on the last three UCI classification datasets, and the more complicated Bayesian Neural Networks experiments were conducted on larger UCI datasets in Table 2.
5.1 Bayesian Regression
In this subsection we study the performance of those aforementioned algorithms on Bayesian linear regression. Say we are provided with inputswhere and . The distribution of given is modelled as , where the unknown parameter follows a prior distribution of . The gradients of log-likelihood can thus be calculated as and . The average test Mean-Squared Error (MSE) is reported in Figure 1.
As can be observed from Figure 1, SVRG-HMC as well its symmetric splitting counterpart SVRG2nd-HMC, converge markedly faster than SVRG-LD in the first pass through the whole dataset. The performance SVRG2nd-HMC is usually similar (no worse) to SVRG-HMC, and it turns out that a slightly larger step size can be chosen for SVRG2nd-HMC, which is also consistent with our theoretical results (i.e., allow larger step size).
5.2 Bayesian Classification
In this subsection we study classification tasks using Bayesian logistic classification. Suppose input data where , . The distribution of the output is modelled as , where the model parameter follows a prior distribution of . Then the gradient of log-likelihood and log-prior can be written as and . The average test log-likelihood is reported in Figure 2.
Similar to the Bayesian regression, SVRG-HMC as well its symmetric splitting counterpart SVRG2nd-HMC, converge markedly faster than SVRG-LD for the Bayesian classification tasks. Also, the experimental results suggest that SVRG2nd-HMC converges more quickly than SVRG-HMC, which is consistent with Theorem 2 and 5.
In sum, for our four algorithms, we recommend SVRG2nd/SAGA2nd-HMC due to the better theoretical results (Theorem 2, 4, 5 and 6) and practical results (Figure 1, 2, 4 and 5) compared with SVRG/SAGA-HMC. Further, we recommend SVRG2nd-HMC since SAGA2nd-HMC needs high memory cost and its implementation is a little bit complicated than SVRG2nd-HMC.
5.3 Bayesian Neural Networks
To show the scalability of variance reduced HMC to larger datasets and its application to nonconvex problems and more complicated models, we study Bayesian neural networks tasks. In our experiments, the model is a neural network with one hidden layer which has 50 hidden units (100 hidden units for ’susy’ dataset) with ReLU activation, which is denoted by. Its unknown parameter follows a prior distribution of . Let denotes output of the neural network with parameter value and input . The experiments are tested on larger UCI regression and classification datasets described in Table 2. Suppose we are provided with inputs where . In regression tasks, , , and the distribution of given is modelled as . In binary classification tasks, , , and . In -class classification tasks (), , , and
. The code for experiments is implemented in TensorFlow. We conduct experiments for vanilla SGHMC and SVRG variants of LD and HMC algorithms. The test Root-Mean-Square Error (RMSE) for regression tasks is reported in Figure 3, and the average test log-likelihood for classification tasks is reported in Figure 4.
The Bayesian neural network regression experiments were conducted on the first two UCI regression datasets and the classification experiments were conducted on the last two UCI classification datasets. The ’letter’ dataset is 26-class and the ’susy’ dataset is binary class.
Experimental results show that SVRG/SVRG2nd-HMC outperforms vanilla SGHMC and SVRG-LD, often by a significant gap. In particularly, this means that variance reduction technique indeed helps the convergence of SGHMC, i.e., the performance gap between SVRG/SVRG2nd-HMC and SVRG-LD in Figure 1–4 is not only coming from the superiority of HMC compared with LD. Similar to previous Section 5.1 and 5.2, the performance SVRG2nd-HMC is usually similar (no worse) to SVRG-HMC, and our experiments found sometimes a slightly larger step size can be chosen for SVRG2nd-HMC (while the same step size brings SVRG-HMC to NaN), which is also consistent with our theoretical results Theorem 5.
In this paper, we propose four variance-reduced Hamiltonian Monte Carlo algorithms, i.e., SVRG-HMC, SAGA-HMC, SVRG2nd-HMC and SAGA2nd-HMC for Bayesian Inference. These proposed algorithms guarantee improved theoretical convergence results and converge markedly faster than the benchmarks (vanilla SGHMC and SVRG/SAGA-LD) in practice. In conclusion, the SVRG2nd/SAGA2nd-HMC are more preferable than SVRG/SAGA-HMC according to our theoretical and experimental results. We would like to note that, our variance-reduced Hamiltonian Monte Carlo samplers are not Markovian procedures, but fortunately our theoretical analysis does not rely on any properties of Markov processes, and so it does not affect the correctness of Theorem 2, 4, 5 and 6.
For future work, it would be interesting to study whether our analysis can be apply to vanilla SGHMC without variance reduction. To the best of our knowledge, there is no existing work to prove that SGHMC is better than SGLD. On the other hand, we note that stochastic thermostat (Ding et al., 2014) could outperform both SGLD and SGHMC. It might be interesting to study if a variance-reduced variant of stochastic thermostat could also beat SVRG-LD and SVRG/SVRG2nd-HMC both theoretically and experimentally.
We would like to thank Chang Liu for useful discussions.
- Ahn et al.  Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient fisher scoring. In Proceedings of the 29th International Conference on Machine Learning, pages 1771–1778, 2012.
- Ahn et al.  Sungjin Ahn, Babak Shahbaba, and Max Welling. Distributed stochastic gradient mcmc. In International Conference on Machine Learning, pages 1044–1052, 2014.
- Byrne and Girolami  Simon Byrne and Mark Girolami. Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825–845, 2013.
- Chen et al.  Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gradient mcmc algorithms with high-order integrators. In Advances in Neural Information Processing Systems, pages 2278–2286, 2015.
- Chen et al.  Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International Conference on Machine Learning, pages 1683–1691, 2014.
- Defazio et al.  Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
- Ding et al.  Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems, pages 3203–3211, 2014.
- Duane et al.  Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
- Dubey et al.  Kumar Avinava Dubey, Sashank J Reddi, Sinead A Williamson, Barnabas Poczos, Alexander J Smola, and Eric P Xing. Variance reduction in stochastic gradient langevin dynamics. In Advances in Neural Information Processing Systems, pages 1154–1162, 2016.
- Girolami and Calderhead  Mark Girolami and Ben Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
Johnson and Zhang 
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems, pages 315–323, 2013.
- Leimkuhler and Shang  Benedict Leimkuhler and Xiaocheng Shang. Adaptive thermostats for noisy gradient systems. SIAM Journal on Scientific Computing, 38(2):A712–A736, 2016.
- Liu et al.  Chang Liu, Jun Zhu, and Yang Song. Stochastic gradient geodesic mcmc methods. In Advances in Neural Information Processing Systems, pages 3009–3017, 2016.
- Ma et al.  Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
- Neal et al.  Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011.
Patterson and Teh 
Sam Patterson and Yee Whye Teh.
Stochastic gradient riemannian langevin dynamics on the probability simplex.In Advances in Neural Information Processing Systems, pages 3102–3110, 2013.
- Robbins and Monro  Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- Welling and Teh  Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, pages 681–688, 2011.
- Zou et al.  Difan Zou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced hamilton monte carlo methods. arXiv preprint arXiv:1802.04791, 2018.
Appendix A SAGA Experiments
In this appendix, we report the corresponding experimental results of SAGA variants (i.e., SAGA-LD, SAGA-HMC and SAGA2nd-HMC) for Bayesian regression and Bayesian classification tasks. The settings are the same as those in Section 5.1 and 5.2.
Appendix B Missing Proofs
b.1 Proof of Corollary 2
To prove this corollary, it is sufficient to show . Recall that , being a -element index set uniformly randomly drawn (with replacement) from and . Now, we prove this inequality as follows:
where the last two inequalities hold since for any random variable and is -Lipschitz.
b.2 Proof of Theorem 2
According to Corollary 1, we have:
where is the additive error in estimating the full gradient . By applying the variance reduction technique, we need to upper bound the summation .
Unpacking the definition of and , we have:
The Inequality (21) is due to for any random variable . In the rightmost summation, index is picked uniformly random from .
Then, we bound the RHS of (21) as follows:
The first inequality is by -smoothness of all ’s, and the second one is by Cauchy’s inequality.
By our algorithm, , so we need to upper bound for each .
By the recursion of ,
The third equality holds because and and are independent. The first inequality takes advantage of and .
Define . Then, taking a grand summation over ,