Markov chain Monte Carlo (MCMC) has become a defacto tool for Bayesian posterior inference. However, these methods notoriously mix slowly in complex, high-dimensional models and scale poorly to large datasets. The past decades have seen a rise in MCMC methods that provide more efficient exploration of the posterior, such as Hamiltonian Monte Carlo (HMC) [8, 12] and its Reimann manifold variant . This class of samplers is based on defining a potential energy function in terms of the target posterior distribution and then devising various continuous dynamics to explore the energy landscape, enabling proposals of distant states. The gain in efficiency of exploration often comes at the cost of a significant computational burden in large datasets.
Recently, stochastic gradient variants of such continuous-dynamic samplers have proven quite useful in scaling the methods to large datasets [17, 1, 6, 2, 7]. At each iteration, these samplers use data subsamples—or minibatches—rather than the full dataset. Stochastic gradient Langevin dynamics (SGLD)  innovated in this area by connecting stochastic optimization with a first-order Langevin dynamic MCMC technique, showing that adding the “right amount” of noise to stochastic gradient ascent iterates leads to samples from the target posterior as the step size is annealed. Stochastic gradient Hamiltonian Monte Carlo (SGHMC)  builds on this idea, but importantly incorporates the efficient exploration provided by the HMC momentum term. A key insight in that paper was that the naïve stochastic gradient variant of HMC actually leads to an incorrect stationary distribution (also see ); instead a modification to the dynamics underlying HMC is needed to account for the stochastic gradient noise. Variants of both SGLD and SGHMC with further modifications to improve efficiency have also recently been proposed [1, 13, 7].
In the plethora of past MCMC methods that explicitly leverage continuous dynamics—including HMC, Riemann manifold HMC, and the stochastic gradient methods—the focus has been on showing that the intricate dynamics leave the target posterior distribution invariant. Innovating in this arena requires constructing novel dynamics and simultaneously ensuring that the target distribution is the stationary distribution. This can be quite challenging, and often requires significant physical and geometrical intuition [6, 13, 7]. A natural question, then, is whether there exists a general recipe for devising such continuous-dynamic MCMC methods that naturally lead to invariance of the target distribution. In this paper, we answer this question to the affirmative. Furthermore, and quite importantly, our proposed recipe is complete. That is, any continuous Markov process (with no jumps) with the desired invariant distribution can be cast within our framework, including HMC, Riemann manifold HMC, SGLD, SGHMC, their recent variants, and any future developments in this area. That is, our method provides a unifying framework of past algorithms, as well as a practical tool for devising new samplers and testing the correctness of proposed samplers.
The recipe involves defining a (stochastic) system parameterized by two matrices: a positive semidefinite diffusion matrix,
, and a skew-symmetric curl matrix,, where with our model parameters of interest and a set of auxiliary variables. The dynamics are then written explicitly in terms of the target stationary distribution and these two matrices. By varying the choices of and , we explore the space of MCMC methods that maintain the correct invariant distribution. We constructively prove the completeness of this framework by converting a general continuous Markov process into the proposed dynamic structure.
For any given , , and target distribution, we provide practical algorithms for implementing either full-data or minibatch-based variants of the sampler. In Sec. 3.1, we cast many previous continuous-dynamic samplers in our framework, finding their and . We then show how these existing and building blocks can be used to devise new samplers; we leave the question of exploring the space of and well-suited to the structure of the target distribution as an interesting direction for future research. In Sec. 3.2 we demonstrate our ability to construct new and relevant samplers by proposing stochastic gradient Riemann Hamiltonian Monte Carlo, the existence of which was previously only speculated. We demonstrate the utility of this sampler on synthetic data and in a streaming Wikipedia analysis using latent Dirichlet allocation .
2 A Complete Stochastic Gradient MCMC Framework
We start with the standard MCMC goal of drawing samples from a target distribution, which we take to be the posterior of model parameters given an observed dataset . Throughout, we assume i.i.d. data . We write , with potential function Algorithms like HMC [12, 10] further augment the space of interest with auxiliary variables and sample from , with Hamiltonian
Marginalizing the auxiliary variables gives us the desired distribution on . In this paper, we generically consider as the samples we seek to draw; could represent itself, or an augmented state space in which case we simply discard the auxiliary variables to perform the desired marginalization.
As in HMC, the idea is to translate the task of sampling from the posterior distribution to simulating from a continuous dynamical system which is used to define a Markov transition kernel. That is, over any interval , the differential equation defines a mapping from the state at time to the state at time . One can then discuss the evolution of the distribution under the dynamics, as characterized by the Fokker-Planck equation for stochastic dynamics  or the Liouville equation for deterministic dynamics . This evolution can be used to analyze the invariant distribution of the dynamics, . When considering deterministic dynamics, as in HMC, a jump process must be added to ensure ergodicity. If the resulting stationary distribution is equal to the target posterior, then simulating from the process can be equated with drawing samples from the posterior.
If the stationary distribution is not the target distribution, a Metropolis-Hastings (MH) correction can often be applied. Unfortunately, such correction steps require a costly computation on the entire dataset. Even if one can compute the MH correction, if the dynamics do not nearly lead to the correct stationary distribution, then the rejection rate can be high even for short simulation periods
. Furthermore, for many stochastic gradient MCMC samplers, computing the probability of the reverse path is infeasible, obviating the use of MH. As such, a focus in the literature is on defining dynamics with the right target distribution, especially in large-data scenarios where MH corrections are computationally burdensome or infeasible.
2.1 Devising SDEs with a Specified Target Stationary Distribution
Generically, all continuous Markov processes that one might consider for sampling can be written as a stochastic differential equation (SDE) of the form:
where denotes the deterministic drift and often relates to the gradient of , is a -dimensional Wiener process, and is a positive semidefinite diffusion matrix. Clearly, however, not all choices of and yield the stationary distribution .
When , as in HMC, the dynamics of Eq. (2) become deterministic. Our exposition focuses on SDEs, but our analysis applies to deterministic dynamics as well. In this case, our framework—using the Liouville equation in place of Fokker-Planck—ensures that the deterministic dynamics leave the target distribution invariant. For ergodicity, a jump process must be added, which is not considered in our recipe, but tends to be straightforward (e.g., momentum resampling in HMC).
To devise a recipe for constructing SDEs with the correct stationary distribution, we propose writing directly in terms of the target distribution:
Here, is a skew-symmetric curl matrix representing the deterministic traversing effects seen in HMC procedures. In contrast, the diffusion matrix determines the strength of the Wiener-process-driven diffusion. Matrices and can be adjusted to attain faster convergence to the posterior distribution. A more detailed discussion on the interpretation of and and the influence of specific choices of these matrices is provided in the Supplement.
Importantly, as we show in Theorem 1, sampling the stochastic dynamics of Eq. (2) (according to Itô integral) with as in Eq. (3) leads to the desired posterior distribution as the stationary distribution: . That is, for any choice of positive semidefinite and skew-symmetric parameterizing , we know that simulating from Eq. (2) will provide samples from (discarding any sampled auxiliary variables ) assuming the process is ergodic.
The equivalence of and the target can be shown using the Fokker-Planck description of the probability density evolution under the dynamics of Eq. (2) :
We can verify that is invariant under Eq. (5) by calculating . If the process is ergodic, this invariant distribution is unique. The equivalence of the compact form was originally proved in ; we include a detailed proof in the Supplement for completeness. ∎
2.2 Completeness of the Framework
An important question is what portion of samplers defined by continuous Markov processes with the target invariant distribution can we define by iterating over all possible and ? In Theorem 2, we show that for any continuous Markov process with the desired stationary distribution, , there exists an SDE as in Eq. (2) with defined as in Eq. (3). We know from the Chapman-Kolmogorov equation  that any continuous Markov process with stationary distribution can be written as in Eq. (2), which gives us the diffusion matrix . Theorem 2 then constructively defines the curl matrix . This result implies that our recipe is complete. That is, we cover all possible continuous Markov process samplers in our framework. See Fig. 1.
The integrability condition is usually satisfied when the probability density function uniquely exists. A constructive proof for the existence of is provided in the Supplement.
2.3 A Practical Algorithm
In practice, simulation relies on an -discretization of the SDE, leading to a full-data update rule
Calculating the gradient of involves evaluating the gradient of . For a stochastic gradient method, the assumption is that is too computationally intensive to compute as it relies on a sum over all data points (see Sec. 2). Instead, such stochastic gradient algorithms examine independently sampled data subsets and the corresponding potential for these data:
The specific form of Eq. (7) implies that
is an unbiased estimator of. As such, a gradient computed based on —called a stochastic gradient —is a noisy, but unbiased estimator of the full-data gradient. The key question in many of the existing stochastic gradient MCMC algorithms is whether the noise injected by the stochastic gradient adversely affects the stationary distribution of the modified dynamics (using in place of
). One way to analyze the impact of the stochastic gradient is to make use of the central limit theorem and assume
resulting in a noisy Hamiltonian gradient . Simply plugging in in place of in Eq. (6) results in dynamics with an additional noise term . To counteract this, assume we have an estimate
of the variance of this additional noise satisfying(i.e., positive semidefinite). With small , this is always true since the stochastic gradient noise scales down faster than the added noise. Then, we can attempt to account for the stochastic gradient noise by simulating
This provides our stochastic gradient—or minibatch— variant of the sampler. In Eq. (9), the noise introduced by the stochastic gradient is multiplied by (and the compensation by ), implying that the discrepancy between these dynamics and those of Eq. (6) approaches zero as goes to zero. As such, in this infinitesimal step size limit, since Eq. (6) yields the correct invariant distribution, so does Eq. (9). This avoids the need for a costly or potentially intractable MH correction. However, having to decrease to zero comes at the cost of increasingly small updates. We can also use a finite, small step size in practice, resulting in a biased (but faster) sampler. A similar bias-speed tradeoff was used in [11, 3] to construct MH samplers, in addition to being used in SGLD and SGHMC.
3 Applying the Theory to Construct Samplers
3.1 Casting Previous MCMC Algorithms within the Proposed Framework
We explicitly state how some recently developed MCMC methods fall within the proposed framework based on specific choices of , and in Eq. (2) and (3). For the stochastic gradient methods, we show how our framework can be used to “reinvent” the samplers by guiding their construction and avoiding potential mistakes or inefficiencies caused by naïve implementations.
Hamiltonian Monte Carlo (HMC)
The key ingredient in HMC [8, 12] is Hamiltonian dynamics, which simulate the physical motion of an object with position , momentum , and mass on an frictionless surface as follows (typically, a leapfrog simulation is used instead):
Eq. (10) is a special case of the proposed framework with , , and .
Stochastic Gradient Hamiltonian Monte Carlo (SGHMC)
where the arises from the approximation of Eq. (8). Careful study shows that Eq. (11) cannot be rewritten into our proposed framework, which hints that such a naïve stochastic gradient version of HMC is not correct. Interestingly, the authors of  proved that this naïve version indeed does not have the correct stationary distribution. In our framework, we see that the noise term is paired with a term, hinting that such a term should be added to Eq. (11). Here, , which means we need to add . Interestingly, this is the correction strategy proposed in , but through a physical interpretation of the dynamics. In particular, the term (or, generically, where ) has an interpretation as friction and leads to second order Langevin dynamics:
Here, is an estimate of . This method now fits into our framework with and as in HMC, but with . This example shows how our theory can be used to identify invalid samplers and provide guidance on how to effortlessly correct the mistakes; this is crucial when physical intuition is not available. Once the proposed sampler is cast in our framework with a specific and , there is no need for sampler-specific proofs, such as those of .
Stochastic Gradient Langevin Dynamics (SGLD)
SGLD  proposes to use the following first order (no momentum) Langevin dynamics to generate samples
This algorithm corresponds to taking with , , , and . As motivated by Eq. (9) of our framework, the variance of the stochastic gradient can be subtracted from the sampler injected noise to make the finite stepsize simulation more accurate. This variant of SGLD leads to the stochastic gradient Fisher scoring algorithm .
Stochastic Gradient Riemannian Langevin Dynamics (SGRLD)
SGLD can be generalized to use an adaptive diffusion matrix . Specifically, it is interesting to take , where is the Fisher information metric. The sampler dynamics are given by
Taking , , and , this SGRLD  method falls into our framework with correction term . It is interesting to note that in earlier literature , was taken to be . More recently, it was found that this correction term corresponds to the distribution function with respect to a non-Lebesgue measure ; for the Lebesgue measure, the revised was as determined by our framework . Again, we have an example of our theory providing guidance in devising correct samplers.
Stochastic Gradient Nosé-Hoover Thermostat (SGNHT)
Finally, the SGNHT  method incorporates ideas from thermodynamics to further increase adaptivity by augmenting the SGHMC system with an additional scalar auxiliary variable, . The algorithm uses the following dynamics:
We can take , , , and to place these dynamics within our framework.
In our framework, SGLD and SGRLD take and instead stress the design of the diffusion matrix , with SGLD using a constant and SGRLD an adaptive, -dependent diffusion matrix to better account for the geometry of the space being explored. On the other hand, HMC takes and focuses on the curl matrix . SGHMC combines SGLD with HMC through non-zero and matrices. SGNHT then extends SGHMC by taking to be state dependent. The relationships between these methods are depicted in the Supplement, which likewise contains a discussion of the tradeoffs between these two matrices. In short, can guide escaping from local modes while can enable rapid traversing of low-probability regions, especially when state adaptation is incorporated. We readily see that most of the product space , defining the space of all possible samplers, has yet to be filled.
3.2 Stochastic Gradient Riemann Hamiltonian Monte Carlo
In Sec. 3.1, we have shown how our framework unifies existing samplers. In this section, we now use our framework to guide the development of a new sampler. While SGHMC  inherits the momentum term of HMC, making it easier to traverse the space of parameters, the underlying geometry of the target distribution is still not utilized. Such information can usually be represented by the Fisher information metric , denoted as , which can be used to precondition the dynamics. For our proposed system, we consider , as in HMC/SGHMC methods, and modify the and of SGHMC to account for the geometry as follows:
We refer to this algorithm as stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC). Our theory holds for any positive definite , yielding a generalized SGRHMC (gSGRHMC) algorithm, which can be helpful when the Fisher information metric is hard to compute.
A naïve implementation of a state-dependent SGHMC algorithm might simply (i) precondition the HMC update, (ii) replace by , and (iii) add a state-dependent friction term on the order of the diffusion matrix to counterbalance the noise as in SGHMC, resulting in:
However, as we show in Sec. 4.1, samples from these dynamics do not converge to the desired distribution. Indeed, this system cannot be written within our framework. Instead, we can simply follow our framework and, as indicated by Eq. (9), consider the following update rule:
which includes a correction term , with -th component . The practical implementation of gSGRHMC is outlined in Algorithm 1.
In Sec. 4.1, we show that gSGRHMC can excel at rapidly exploring distributions with complex landscapes. We then apply SGRHMC to sampling in a latent Dirichlet allocation (LDA) model on a large Wikipedia dataset in Sec. 4.2. The Supplement contains details on the specific samplers considered and the parameter settings used in these experiments.
4.1 Synthetic Experiments
In this section we aim to empirically (i) validate the correctness of our recipe and (ii) assess the effectiveness of gSGRHMC. In Fig. 2(left), we consider two univariate distributions (shown in the Supplement) and compare SGLD, SGHMC, the naïve state-adaptive SGHMC of Eq. (16), and our proposed gSGRHMC of Eq. (17). See the Supplement for the form of . As expected, the naïve implementation does not converge to the target distribution. In contrast, the gSGRHMC algorithm obtained via our recipe indeed has the correct invariant distribution and efficiently explores the distributions. In the second experiment, we sample a bivariate distribution with strong correlation. The results are shown in Fig. 2(right). The comparison between SGLD, SGHMC, and our gSGRHMC method shows that both a state-dependent preconditioner and Hamiltonian dynamics help to make the sampler more efficient than either element on its own.
|Original LDA||Expanded Mean|
|Method||Average Runtime per 100 Docs|
4.2 Online Latent Dirichlet Allocation
We also applied SGRHMC (with , the Fisher information metric) to an online latent Dirichlet allocation (LDA)  analysis of topics present in Wikipedia entries. In LDA, each topic is associated with a distribution over words, with the probability of word under topic . Each document is comprised of a mixture of topics, with the probability of topic in document . Documents are generated by first selecting a topic for the th word and then drawing the specific word from the topic as . Typically, and are given Dirichlet priors.
The goal of our analysis here is inference of the corpus-wide topic distributions . Since the Wikipedia dataset is large and continually growing with new articles, it is not practical to carry out this task over the whole dataset. Instead, we scrape the corpus from Wikipedia in a streaming manner and sample parameters based on minibatches of data. Following the approach in , we first analytically marginalize the document distributions and, to resolve the boundary issue posed by the Dirichlet posterior of defined on the probability simplex, use an expanded mean parameterization shown in Figure 3(upper left). Under this parameterization, we then compute and, in our implementation, use boundary reflection to ensure the positivity of parameters . The necessary expectation over word-specific topic indicators is approximated using Gibbs sampling separately on each document, as in . The Supplement contains further details.
For all the methods, we report results of three random runs. When sampling distributions with mass concentrated over small regions, as in this application, it is important to incorporate geometric information via a Riemannian sampler . The results in Fig. 3(right) indeed demonstrate the importance of Riemannian variants of the stochastic gradient samplers. However, there also appears to be some benefits gained from the incorporation of the HMC term for both the Riemmannian and non-Reimannian samplers. The average runtime for the different methods are similar (see Fig. 3(lower left)) since the main computational bottleneck is the gradient evaluation. Overall, this application serves as an important example of where our newly proposed sampler can have impact.
We presented a general recipe for devising MCMC samplers based on continuous Markov processes. Our framework constructs an SDE specified by two matrices, a positive semidefinite and a skew-symmetric . We prove that for any and , we can devise a continuous Markov process with a specified stationary distribution. We also prove that for any continuous Markov process with the target stationary distribution, there exists a and that cast the process in our framework. Our recipe is particularly useful in the more challenging case of devising stochastic gradient MCMC samplers. We demonstrate the utility of our recipe in “reinventing” previous stochastic gradient MCMC samplers, and in proposing our SGRHMC method. The efficiency and scalability of the SGRHMC method was shown on simulated data and a streaming Wikipedia analysis.
This work was supported in part by ONR Grant N00014-10-1-0746, NSF CAREER Award IIS-1350133, and the TerraSwarm Research Center sponsored by MARCO and DARPA. We also thank Mr. Lei Wu for helping with the proof of Theorem 2 and Professors Ping Ao and Hong Qian for many discussions.
S. Ahn, A. Korattikara, and M. Welling.
Bayesian posterior sampling via stochastic gradient Fisher
Proceedings of the 29th International Conference on Machine Learning (ICML’12), 2012.
-  S. Ahn, B. Shahbaba, and M. Welling. Distributed stochastic gradient MCMC. In Proceeding of 31st International Conference on Machine Learning (ICML’14), 2014.
-  R. Bardenet, A. Doucet, and C. Holmes. Towards scaling up Markov chain Monte Carlo: An adaptive subsampling approach. In Proceedings of the 30th International Conference on Machine Learning (ICML’14), 2014.
-  M. Betancourt. The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In Proceedings of the 31th International Conference on Machine Learning (ICML’15), 2015.
-  D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, March 2003.
-  T. Chen, E.B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Proceeding of 31st International Conference on Machine Learning (ICML’14), 2014.
-  N. Ding, Y. Fang, R. Babbush, C. Chen, R.D. Skeel, and H. Neven. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems 27 (NIPS’14). 2014.
-  S. Duane, A.D. Kennedy, B.J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216 – 222, 1987.
Introduction to Probability Theory and its Applications. John Wiley & Sons, 1950.
-  M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society Series B, 73(2):123–214, 03 2011.
-  A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proceedings of the 30th International Conference on Machine Learning (ICML’14), 2014.
-  R.M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54:113–162, 2010.
-  S. Patterson and Y.W. Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems 26 (NIPS’13). 2013.
-  H. Risken and T. Frank. The Fokker-Planck Equation: Methods of Solutions and Applications. Springer, 1996.
-  H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 09 1951.
-  J. Shi, T. Chen, R. Yuan, B. Yuan, and P. Ao. Relation of a new interpretation of stochastic differential equations to Itô process. Journal of Statistical Physics, 148(3):579–590, 2012.
-  M. Welling and Y.W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML’11), pages 681–688, June 2011.
-  T. Xifara, C. Sherlock, S. Livingstone, S. Byrne, and M. Girolami. Langevin diffusions and the Metropolis-adjusted Langevin algorithm. Statistics & Probability Letters, 91:14–19, 2014.
-  L. Yin and P. Ao. Existence and construction of dynamical potential in nonequilibrium processes without detailed balance. Journal of Physics A: Mathematical and General, 39(27):8593, 2006.
-  R. Zwanzig. Nonequilibrium Statistical Mechanics. Oxford University Press, 2001.
Supplementary Material: A Complete Recipe for Stochastic Gradient MCMC
Appendix A Proof of Stationary Distribution
The proof is a re-writing of Eq. (S.1):
We can further decompose the second term as follows
The second equality follows because due to anti-symmetry of . Putting these back into the formula, we get
We can then verify that is invariant under the compact form by calculating
This completes the proof of Theorem 1.
The above proof follows directly from  and is provided here for readers’ convenience.
Appendix B Proof of Completeness
In this section, we provide a constructive proof for Theorem 2, the existence of .
The proof is first outlined as follows:
We transform the above equation and its constraint into the frequency domain and obtain a set of linear equations.
Then we construct a solution to the linear equations and use inverse Fourier transform to obtain.
The complete procedure is:
Multiplying on both sides of Eq. (3) in the main paper, and noting that:
we arrive at:
The equation for can now be written as:
Recall that the Fokker-Planck equation for the stochastic process, Eq. (2), is:
The nice forms of Eqs. (S.4) and (S.6) imply that the questions can be transformed into a linear algebra problem once we apply a Fourier transform to them. Denote the Fourier transform of as ; and Fourier transform of as , where is the set of the spectral variables. That is:
Then, is transformed to , and Eq. (S.4) becomes the following equivalent form in Fourier space:
Hence, it is clear that matrix must be a skew-symmetric projection matrix from the span of to the span of , where and are always orthogonal to each other. We thereby construct as combination of two rank projection matrices:
We arrive at the final result that matrix is equal to times the inverse Fourier transform of :
Thus, if belongs to the space of , then any continuous time Markov process, Eq. (2), can be turned into this new formulation. ∎
Entries in the skew-symmetric projector constructed here are real.
Denote , then the inverse Fourier transform of along the partial variable is equal to:
where is the Heaviside function. Because is an even function in , , its total inverse Fourier transform is real.
Therefore, the inverse Fourier transform of is the convolution of two real functions.
Appendix C 2-D Case as a Simple Intuitive Example of the Construction
For 2-dimensional systems, we have:
and hence Eq. (S.9) has a simple form: