Dynamics-based Markov chain Monte Carlo (MCMC) methods in Bayesian inference have drawn great attention because of their wide applicability, efficiency in drawing samples, and scalability for large-scale datasets(Neal et al., 2011; Welling & Teh, 2011; Chen et al., 2014, 2016; Li et al., 2019). They draw samples by simulating a continuous-time dynamics, or more precisely, a diffusion process, that keeps the target distribution invariant. However, they often exhibit slow empirical convergence and relatively small effective sample size, due to a positive sample correlation. An alternative kind of methods, called particle-based variational inference methods (ParVIs), aim to deterministically update samples, or particles as they call them, so that the particle distribution minimizes the KL divergence to the target distribution. They fully exploit the approximation ability of a set of particles by imposing an interaction among them, so they are more particle-efficient. Optimization-based principle also makes them convergence faster. Stein variational gradient descent (SVGD) (Liu & Wang, 2016) is the most famous representative, and the field is under an active development both in theory (Liu, 2017; Chen et al., 2018b, a; Liu et al., 2018) and application (Liu et al., 2017; Pu et al., 2017; Zhuo et al., 2018; Yoon et al., 2018).
The study on the relation between the two families starts from their interpretations on the 2-Wasserstein space supported on some smooth manifold (Villani, 2008; Ambrosio et al., 2008). It is defined as the space of distributions
is a probability measure onand
with the well-known 2-Wasserstein distance. It is very general yet still has necessary structures. With its canonical metric, the gradient flow (steepest descending curves) of the KL divergence is defined. It is well-known that the Langevin dynamics (LD) (Roberts et al., 1996; Welling & Teh, 2011), a particular type of dynamics in MCMC, simulates the gradient flow on (Jordan et al., 1998). Recent analysis reveals that existing ParVIs also simulate the gradient flow (Chen et al., 2018a; Liu et al., 2018), so they simulate the same dynamics as LD. However, besides LD, there are many more types of dynamics in the MCMC field that could converge faster and produce more effective samples (Neal et al., 2011; Chen et al., 2014; Ding et al., 2014), but no ParVI yet simulates them. These more general MCMC dynamics have not been understood as a process on the Wasserstein space , and this poses an obstacle towards a ParVI simulation. On the other hand, the convergence behavior of LD becomes clear when viewing LD as the gradient flow on (Cheng & Bartlett, 2017), which leads a distribution to the target steepestly in terms of KL divergence. However, such knowledge on other MCMC dynamics remains obscure, except a few. In fact, a general MCMC dynamics is only guaranteed to keep the target distribution invariant (Ma et al., 2015), but unnecessarily drives a distribution towards the target steepestly. So it is hard for the gradient flow formulation to cover general MCMC dynamics.
In this work, we propose a theoretical framework that gives a unified view of general MCMC dynamics on the Wasserstein space . We establish the framework by two generalizations over the concept of gradient flow towards a wide coverage: (a) we introduce a novel concept called fiber-Riemannian manifold , where only the Riemannian structure on each fiber (roughly a decomposed submanifold, or a slice of ) is required, and we develop the novel notion of fiber-gradient flow on its Wasserstein space ; (b) we also endow a Poisson structure to the manifold and exploit the corresponding Hamiltonian flow on . Combining both explorations, we define a fiber-Riemannian Poisson (fRP) manifold and a fiber-gradient Hamiltonian (fGH) flow on its Wasserstein space . We then show that any regular MCMC dynamics is the fGH flow on the Wasserstein space of an fRP manifold , and there is a correspondence between the dynamics and the structure of the fRP manifold .
This unified framework gives a clear picture on the behavior of MCMC dynamics. The Hamiltonian flow conserves the KL divergence to the target distribution, while the fiber-gradient flow minimizes it on each fiber, driving each conditional distribution to meet the corresponding conditional target. The target invariant requirement is recovered in which case the fiber-gradient is zero, and moreover, we recognize that the fiber-gradient flow acts as a stabilizing force on each fiber. It enforces convergence fiber-wise, making the dynamics in each fiber robust to simulation with the noisy stochastic gradient, which is crucial for large datasets. This generalizes the discussion of Chen et al. (2014) and Betancourt (2015) on Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal et al., 2011; Betancourt, 2017) to general MCMCs. In our framework, different MCMCs correspond to different fiber structures thus flow components. They can be categorized into three types, each of which has its particular behavior. We make a unified study on 17 existing MCMCs under the three types.
Our framework also bridges the fields of MCMCs and ParVIs, so that on one hand, the gate to the reservoir of MCMC dynamics is opened to the ParVI family and abundant dynamics are enabled beyond LD, and on the other hand, MCMC dynamics can be now simulated in the ParVI flavor, inheriting advantages like particle-efficiency. As an example, we develop two ParVI simulation methods for stochastic gradient Hamiltonian Monte Carlo (SGHMC) (Chen et al., 2014). We demonstrate the merits of using SGHMC dynamics over LD in the ParVI field, and ParVI advantages over conventional stochastic simulation in MCMC.
Related work Ma et al. (2015) give a complete recipe on general MCMC dynamics. Their formulation guarantees the target invariant principle, but leaves the behavior of these dynamics unexplained. Recent analysis towards a broader kind of dynamics via Fokker-Planck equation (Kondratyev & Vorotnikov, 2017; Bruna et al., 2017) is still within the gradient flow formulation, thus not general enough.
On connecting MCMC and ParVI, Chen et al. (2018a) explore the correspondence between LD and Wasserstein gradient flow, and develop new implementations for dynamics simulation. However, their consideration is still confined on LD, leaving more general MCMC dynamics untouched. Gallego & Insua (2018) formulate the dynamics of SVGD as a particular kind of MCMC dynamics, but no existing MCMC dynamics is recognized as a ParVI. More recently, Taghvaei & Mehta (2018) derive an accelerated ParVI that is similar to one of our ParVI simulations of SGHMC. The derivation does not utilize the dynamics and the method connects to SGHMC only algorithmically. Our theory solidates our ParVI simulations of SGHMC, and enables extensions to much more dynamics.
We first introduce the recipe for general MCMC dynamics (Ma et al., 2015), and prior knowledge on flows on a smooth manifold and its Wasserstein space .
A smooth manifold is a topological space that locally behaves like an Euclidean space. Since the recipe describes a general MCMC dynamics in an Euclidean space , it suffices to only consider that is globally diffeomorphic to , which is its global coordinate system. For brevity we use the same notation for a point on
and its coordinates due to their equivalence. A tangent vectorat can be viewed as the differentiation along the curve that is tangent to at , so can be expressed in components under basis of the tangent space at . The cotangent space at is the dual space of , and the cotangent bundle is the union . We adopt Einstein convention to omit the summation symbol for a pair of repeated indices in super- and sub-scripts (e.g. ). We assume the target distribution absolutely continuous so that we have its density function .
2.1 The Complete Recipe of MCMC Dynamics
The fundamental requirement on MCMCs is that the target distribution is kept stationary under the MCMC dynamics. Ma et al. (2015) give a general recipe for such a dynamics expressed as a diffusion process in an Euclidean space :
for any positive semi-definite matrix
(diffusion matrix) and any skew-symmetric matrix(curl matrix), where denotes standard Brownian motion in . The term represents a deterministic drift and denotes a stochastic diffusion. It is also shown that if is positive definite, is the unique stationary distribution. Moreover, the recipe is complete, i.e., any diffusion process with stationary can be cast into this form.
The recipe gives a universal view and a unified way to analyze MCMCs. In large scale Bayesian inference tasks, the stochastic gradient (SG), a noisy estimate of
on a randomly selected data mini-batch, is crucially desired for data scalability. The dynamics is compatible with SG, since the variance of the drift is of higher order of the diffusion part(Ma et al., 2015; Chen et al., 2015). In many MCMC instances, is taken as an augmentation of the target variable by an auxiliary variable . This could encourage the dynamics to explore a broader area to reduce sample autocorrelation and improve efficiency (e.g. Neal et al. (2011); Ding et al. (2014); Betancourt et al. (2017)).
2.2 Flows on a Manifold
The mathematical concept of the flow associated to a vector field on is a set of curves on , , such that the curve through any point satisfies and that its tangent vector at , , coincides with the vector . For any vector field, its flow exists at least locally (Do Carmo (1992), Sec. 0.5). We introduce two particular kinds of flows for our concern.
2.2.1 Gradient Flows
We consider the gradient flow on induced by a Riemannian structure (e.g. Do Carmo (1992)), which gives an inner product in each tangent space . Expressed in coordinates, , and the matrix is required to be symmetric (strictly) positive definite. The gradient of a smooth function on can then be defined as the steepest ascending direction and is expressed in coordinates as:
where is the entry of the inverse matrix of . It is a vector field and determines a gradient flow.
On , a Riemannian structure can be equipped for a Riemannian support (Otto, 2001; Villani, 2008; Ambrosio et al., 2008). The tangent space at is recognized as (Villani (2008), Thm. 13.8; Ambrosio et al. (2008), Thm. 8.3.1):
where is the set of compactly supported smooth functions on , is the Hilbert space with inner product , and the overline means closure. The tangent space inherits an inner product from , which defines a Riemannian structure on and is consistent with the Wasserstein distance (Benamou & Brenier, 2000). With this structure, the gradient of the KL divergence is given explicitly (Villani (2008), Formula 15.2, Thm. 23.18):
Noting that is a linear subspace of the Hilbert space , an orthogonal projection can be uniquely defined. For any , is the unique vector in such that (Ambrosio et al. (2008), Lem 8.4.2), where is the divergence on and when is the density w.r.t. the Lebesgue measure of the coordinate space . The projection can also be explained with a physical intuition. Let be a vector field on
, and let its flow act on the random variableof . The transformed random variable specifies a distribution , and a distribution curve is then induced by . The tangent vector of such at is exactly .
2.2.2 Hamiltonian Flows
Hamiltonian flow is an abstraction of the Hamiltonian dynamics in classical mechanics (Marsden & Ratiu, 2013). It is defined in association to a Poisson structure on a manifold (Fernandes & Marcut (2014)), which can be expressed either as a Poisson bracket , or equivalently as a bivector field via the relation . Expressed in coordinates, , where the matrix is required to be skew-symmetric and satisfy:
The Hamiltonian vector field of a smooth function on is defined as , with coordinate expression:
A Hamiltonian flow is then determined by . Its key property is that it conserves : is constant w.r.t. . The Hamiltonian flow may be more widely known on a symplectic manifold or more particularly a cotangent bundle (e.g. Da Silva (2001); Marsden & Ratiu (2013)), but these cases are not general enough for our purpose (e.g. they require to be even-dimensional).
On , a Poisson structure can be induced by the one of . Consider linear functions on in the form for . A Poisson bracket for these linear functions can be defined as (e.g. Lott (2008), Sec. 6; Gangbo et al. (2010), Sec. 7.2):
This bracket can be extended for any smooth function by its linearization at , i.e. a linear function such that . The extended bracket is then given by (Gangbo et al. (2010), Rem. 7.8), where , are the linearizations of smooth functions , at . The Hamiltonian vector field of is then identified as (Gangbo et al. (2010), Sec. 7.2):
On the same topic, Ambrosio & Gangbo (2008) study the existence and simulation of the Hamiltonian flow on for as a symplectic Euclidean space, and verify the conservation of Hamiltonian under certain conditions. Gangbo et al. (2010) investigate the Poisson structure on the algebraic dual , a superset of , and find that the canonical Poisson structure induced by the Lie structure of coincides with Eq. (8). Their consideration is also for symplectic Euclidean , but the procedures and conclusions can be directly adapted to Riemannian Poisson manifolds. Lott (2008) considers the Poisson structure Eq. (8) on the space of smooth distributions on a Poisson manifold , and find that it is the restriction of the Poisson structure of by Gangbo et al. (2010).
3 Understanding MCMC Dynamics as Flows on the Wasserstein Space
This part presents our main discovery that connects MCMC dynamics and flows on the Wasserstein space . We first work on the two concepts and introduce novel concepts for preparation, then propose the unified framework and analyze existing MCMC instances under the framework.
3.1 Technical Development
We excavate into MCMC and Wasserstein flows and introduce novel concepts in preparation for the framework.
On the MCMC side Noting that flows on are deterministic while MCMCs involve stochastic diffusion, we first reformulate MCMC dynamics as an equivalent deterministic one for unification. Here we say two dynamics are equivalent if they produce the same distribution curve. [Equivalent deterministic MCMC dynamics] The MCMC dynamics Eq. (2) with symmetric diffusion matrix is equivalent to the deterministic dynamics in :
where is the distribution density of at time . Proof is provided in Appendix A.1. For any , the projected vector field can be treated as a tangent vector at , so defines a vector field on . In this way, we give a first view of an MCMC dynamics as a Wasserstein flow. An equivalent flow with a richer structure will be given in Theorem 2.
This expression also helps understanding Barbour’s generator (Barbour, 1990) of an MCMC dynamics, which can be used in Stein’s method (Stein et al., 1972) of constructing distribution metrics. For instance the standard Langevin dynamics induces the Stein’s operator, and it in turn produces a metric called the Stein discrepancy (Gorham & Mackey, 2015), which inspires SVGD, and Liu & Zhu (2017) consider the Riemannian counterparts. The Barbour’s generator maps a function to another , where obeys initial condition (Dirac measure). In terms of the linear function on , we recognize as the directional derivative of along at . This knowledge gives the expression
which meets existing results (e.g. Gorham et al. (2016), Thm. 2). Details are provided in Appendix A.2.
On the Wasserstein flow side We deepen the knowledge on flows on with a Riemannian and Poisson structure of .111 We do not consider the compatibility of the Riemannian and Poisson structure so it is different from a Kähler manifold. The gradient of is given by Eq. (5), but its Hamiltonian vector field is not directly available due to its non-linearity. We first develop an explicit expression for it. [Hamiltonian vector field of KL on ] Let be the bivector field form of a Poisson structure on and endowed with the induced Poisson structure described in Section 2.2.2. Then the Hamiltonian vector field of on is
Proof is provided in Appendix A.3. Note that the projection does not make much difference recalling and produce the same distribution curve through .
For a wider coverage of our framework on MCMC dynamics, we introduce a novel concept called fiber-Riemannian manifold and develop associated objects. This notion generalizes Riemannian manifold, such that the non-degenerate requirement of the Riemannian structure is relaxed. [Fiber-Riemannian manifold] We say that a manifold is a fiber-Riemannian manifold if it is a fiber bundle and there is a Riemannian structure on each fiber.
See Fig. 1 for illustration. Roughly, (of dimension ) is a fiber bundle if there are two smooth manifolds (of dimension ), (of dimension ) such that is locally equivalent to the product space (e.g. Nicolaescu (2007), Def. 2.1.21). Denoting the surjective projection as , the fiber through is defined as the submanifold , which is diffeomorphic to . The coordinate of can be decomposed with this structure: where is the coordinate of and of . Elements in have the same part. We allow or to be zero.
A fiber-Riemannian manifold furnish each fiber with a Riemannian structure that has coordinate expression . It defines a gradient on fiber with coordinate expression . Taking the union over all fibers, we define a vector field on called fiber-gradient given a function on , whose coordinate expression is , where
Note that is tangent to the fiber and its flow moves points within each fiber. We denote the fiber-Riemannian manifold as . It is not a Riemannian manifold for since is singular.
We define a fiber bundle as the manifold that is locally equivalent to . A similar structure can be induced on it. Each of its fiber, , has a Riemannian structure induced by the one of (see Section 2.2.1), and the gradient of the function on evaluated at is the vector field on fiber (see Eq. (5)). So the fiber-gradient of on evaluated at is:
where the last equality holds since only the derivative w.r.t. survives after multiplication with and . After projection by , is a vector field on . Note that we cannot develop the fiber-gradient directly on since it is locally equivalent to thus not a fiber-Riemannian manifold.
3.2 The Unified Framework
We introduce a regularity assumption on MCMC dynamics that our unified framework considers. It is satisfied by almost all existing MCMCs and its relaxation will be discussed at the end of this section.
Assumption (Regular MCMC dynamics).
Now we formally state our unified framework, with an illustration provided in Fig. 2. [Unified framework: equivalence between regular MCMC dynamics and fGH flows on ] We call a fiber-Riemannian Poisson (fRP) manifold, and define the fiber-gradient Hamiltonian (fGH) flow on as the flow induced by the vector field
Then: (a) Any regular MCMC dynamics on targeting is equivalent to the fGF flow on for a certain fRP manifold ; (b) Conversely, for any fRP manifold , the fGF flow on is equivalent to a regular MCMC dynamics targeting in the coordinate space of ; (c) More precisely, in both cases, the coordinate expression of the fiber-Riemannian structure and Poisson structure of coincide respectively with the diffusion matrix and the curl matrix of the regular MCMC dynamics. The idea of proof is to show ( defined in Lemma 3.1) at any so that the two vector fields produce the same evolution rule of distribution. Proof details are presented in Appendix A.4.
This formulation unifies regular MCMC dynamics and flows on the Wasserstein space, and provides a direct explanation on the behavior of general MCMC dynamics. The fundamental requirement on MCMCs that the target distribution is kept stationary, turns obvious in our framework: . The Hamiltonian flow conserves (difference to ), while encourages efficient exploration in the sample space that helps faster convergence and lower autocorrelation (Betancourt et al., 2017). The fiber-gradient flow minimizes on each fiber (with ), driving to and enforcing convergence. Specification of this general behavior is discussed below.
3.3 Existing MCMCs under the Unified Framework
Now we make detailed analysis on existing MCMC methods under our unified framework. Depending on the diffusion matrix , they can be categorized into three types. Each type has a particular fiber structure of the corresponding fRP manifold, thus a particular behavior of the dynamics.
Type 1: is non-singular ( in Eq. (13)).
In this case, the corresponding degenerates and itself is the unique fiber, so is a Riemannian manifold with structure . The fiber-gradient flow on becomes the gradient flow on so
which indicates the convergence of the dynamics: the Hamiltonian flow conserves while the gradient flow minimizes on steepestly, so they jointly minimize monotonically, leading to the unique minimizer . This meets the conclusion in Ma et al. (2015).
The Langevin dynamics (LD) (Roberts et al., 1996), used in both full-batch (Roberts & Stramer, 2002) and stochastic gradient (SG) simulation (Welling & Teh, 2011), falls into this class. Its curl matrix makes its fGH flow comprise purely the gradient flow, allowing a rich study on its convergence (e.g. Durmus & Moulines (2016); Cheng & Bartlett (2017)). Its Riemannian version (Girolami & Calderhead, 2011) chooses as the inverse Fisher metric so that is the distribution manifold in information geometry (Amari, 2016). Patterson & Teh (2013) further explore the simulation with SG.
Type 2: ( in Eq. (13)).
In this case, and fibers degenerate. The fGF flow comprises purely the Hamiltonian flow , which conserves and helps distant exploration. We note that under this case, the decrease of is not guaranteed, so care must be taken in simulation. Particularly, this type of dynamics cannot be simulated with parallel chains unless samples initially distribute as , so they are not suitable for ParVI simulation. The lack of a stabilizing force in the dynamics also explains their vulnerability in face of SG, where the noisy perturbation is uncontrolled. This generalizes the discussion on HMC by Chen et al. (2014) and Betancourt (2015) to dynamics of this type.
The Hamiltonian dynamics (e.g. Marsden & Ratiu (2013), Chap. 2) that HMC simulates is a representative of this kind. To sample from a distribution on manifold of dimension , variable is augmented with a vector called momentum. In our framework, this is to take as the cotangent bundle , whose canonical Poisson structure corresponds to . A conditional distribution is chosen for an augmented target distribution . HMC produces more effective samples than LD with the help of the Hamiltonian flow (Betancourt et al., 2017). As we mentioned, the dynamics of HMC cannot guarantee convergence, so it relies on the ergodicity of its simulation for convergence (Livingstone et al., 2016; Betancourt, 2017). It is simulated in a deliberated way: the second-order symplectic leap-frog integrator is employed, and is successively redrew from .
HMC considers Euclidean and chooses Gaussian , while Zhang et al. (2016) take
as the monomial Gamma distribution. On Riemannian, is chosen as , i.e. the standard Gaussian in the cotangent space (Girolami & Calderhead, 2011). Byrne & Girolami (2013) simulate the dynamics for manifolds with no global coordinates, and Lan et al. (2015) take the Lagrangian form for better simulation, which uses velocity (tangent vector) in place of momentum (covector).
Type 3: and is singular ( in Eq. (13)).
In this case, both the Hamiltonian and fiber-gradient flows take effect. The fiber-gradient flow stabilizes the dynamics only on each fiber , but this is enough for most SG-MCMCs since SG only appears on each fiber.
SGHMC (Chen et al., 2014) is the first instance of this type. Similar to the Hamiltonian dynamics, it takes and shares the same , but its is in the form in Assumption 3.2 with a constant , whose inverse defines a Riemannian structure in every fiber . Viewed in our framework, this makes the fiber bundle structure of coincides with the one of : , , and . Using Lemma 3.1, with a specified , we derive its equivalent deterministic dynamics:
We note that it adds the dynamics to the Hamiltonian dynamics. This added dynamics is essentially the fiber-gradient flow on (Eq. (14)), or the gradient flow on fiber , which pushes towards . In presence of SG, the dynamics for is unaffected, but for in each fiber, a fluctuation is introduced due to the noisy estimate of , which will mislead . The fiber-gradient compensates this by guiding to the correct target, making the dynamics robust to SG.
Another famous example of this kind is the SG Nosé-Hoover thermostats (SGNHT) (Ding et al., 2014). It further augments with the thermostats to better balance the SG noise. In terms of our framework, the thermostats augments , and the fiber is the same as SGHMC.
Both SGHMC and SGNHT choose , while SG monomial Gamma thermostats (SGMGT) (Zhang et al., 2017) uses monomial Gamma, and Lu et al. (2016) choose according to a relativistic energy function to adapt the scale in each dimension. Riemannian extensions of SGHMC and SGNHT on are explored by Ma et al. (2015) and Liu et al. (2016). Viewed in our framework, they induce a Riemannian structure in each fiber .
Discussions Due to the linearity of the equivalent systems (2), (10), (15) w.r.t. , or , , MCMC dynamics can be combined. From the analysis above, SGHMC can be seen as the combination of the Hamiltonian dynamics on the cotangent bundle and the LD in each fiber (cotangent space ). As another example, Zhang et al. (2017) combine SGMGT of Type 3 with LD of Type 1, creating a Type 1 method that decreases on the entire manifold instead of each fiber. This improves the convergence, which meets their empirical observation.
Assumption 3.2(a) is satisfied by all the mentioned MCMC dynamics, and Assumption 3.2(b) is also satisfied by all except SGNHT related dynamics. On this exception, we note from the derivation of Theorem 2 that, Assumption 3.2(b) is only required for thus to be a Poisson manifold, but is not used in the deduction afterwards. Definition of a Hamiltonian vector field and its key property could also be established without the assumption, so it is possible to extend the framework under a more general mathematical concept that relaxes Assumption 3.2(b). Assumption 3.2(a) could also be hopefully relaxed by an invertible transformation from any positive semi-definite into the required form, effectively converting the dynamics into an equivalent regular one. We leave further investigations as future work.
4 Simulation as ParVIs
The unified framework (Theorem 2) recognizes an MCMC dynamics as an fGH flow on the Wasserstein space of an fRP manifold , expressed in Eq. (15) explicitly. Lemma 3.1 gives another equivalent dynamics that leads to the same flow on . These findings enable us to simulate these flow-based dynamics for an MCMC method, using existing finite-particle flow simulation methods in the ParVI field. This hybrid of ParVI and MCMC largely extends the ParVI family with various dynamics, and also gives advantages like particle-efficiency to MCMCs.
The key problem in simulating these flow-based dynamics with finite particles is that the density is unknown. Liu et al. (2018) give a summary on the solutions in the ParVI field, and find that they are all based on a smoothing treatment, in a certain formulation of either smoothing density or smoothing functions. Here we adopt the Blob method (Chen et al., 2018a) that smoothes density. With a set of particles of , Blob makes the following approximation with a kernel function for :
where . Approximation for can be established in a similar way. Note that the gradient at points outwards from , so the estimation effectively poses a repulsive interaction among particles, similar to the behavior of SVGD (Liu & Wang, 2016). The vanilla SGHMC simulates dynamics (18) with replaced by , but dynamics (19) cannot be simulated in a similar stochastic way. More discussions are provided in Appendix B.
)), respectively (“p” for “particle” and “det” for “deterministic”). Compared to the vanilla SGHMC, the proposed methods could converge faster and be more particle-efficient with deterministic update and explicit repulsive interaction. On the other hand, SGHMC could make a more efficient exploration and converges faster than LD, so the behavior also holds for the corresponding ParVI simulations, i.e., our methods could speed up over Blob. One may note that pSGHMC-det resembles a direct application of stochastic gradient descent with momentum (SGDM)(Sutskever et al., 2013) to Blob, but we stress that this derivation is not theoretically sound since Blob optimizes on the infinite-dimensional manifold while SGDM is only for finite-dimensional . Moreover, the two methods can be nourished with advanced techniques in the ParVI field. This includes the HE bandwidth selection method and acceleration frameworks by Liu et al. (2018), and other approximations to like SVGD and GFSD/GFSF (Liu et al., 2018).
Detailed settings are provided in Appendix C.
5.1 Synthetic Experiment
We show in Fig. 3 the equivalence of various dynamics simulations, and the advantages of pSGHMC-det and pSGHMC-fGH. We first find that all methods eventually produce properly distributed particles, demonstrating their equivalence. For ParVI methods, both proposed methods (Rows 3, 4) converge faster than Blob (Row 1), indicating the benefit of using SGHMC dynamics over LD, where the momentum accumulates in the vertical direction. For the same SGHMC dynamics, we see that our ParVI versions (Rows 3, 4) converge faster than the vanilla stochastic version (Row 2), due to the deterministic update rule. Moreover, pSGHMC-fGF (Row 4) enjoys the HE bandwidth selection method (Liu et al., 2018) for ParVIs, which makes the particles neatly and regularly aligned thus more representative for the distribution. pSGHMC-det (Row 3) does not benefit much from HE since the density on particles, , is not directly used in the dynamics (18).
5.2 Latent Dirichlet Allocation (LDA)
We study the advantages of our pSGHMC methods in the real-world task of posterior inference for LDA. We follow the same settings as Liu et al. (2018) and Chen et al. (2014). We see from Fig. 4(a) the saliently faster convergence over Blob, benefited from the usage of SGHMC dynamics in the ParVI field. Particle-efficiency is compared in Fig. 4(b), where we find the better results of pSGHMC methods over vanilla SGHMC under a same particle size. This demonstrates the advantage of ParVI simulation of MCMC dynamics, where particle interaction is directly considered to make full use of a set of particles.
5.3 Bayesian Neural Networks (BNNs)
We investigate our methods in the supervised task of training BNNs. We follow the settings of Chen et al. (2014) with slight modification explained in Appendix. Results in Fig. 5 is consistent with our claim: pSGHMC methods converge faster than Blob due to the usage of SGHMC dynamics. Their slightly better particle-efficiency can also be observed.
We construct a theoretical framework that connects general MCMC dynamics with flows on the Wasserstein space. By introducing novel concepts, we find that a regular MCMC dynamics corresponds to an fGH flow for an fRP manifold. The framework gives a clear picture on the behavior of various MCMC dynamics, and also enables ParVI simulation of MCMC dynamics. We group existing MCMC dynamics into 3 types under the framework and analyse their behavior, and develop two ParVI methods of SGHMC dynamics. We empirically demonstrate the faster convergence by more general MCMC dynamics for ParVIs, and particle-efficiency by ParVI simulation for MCMCs.
- Amari (2016) Amari, S.-I. Information geometry and its applications. Springer, 2016.
- Ambrosio & Gangbo (2008) Ambrosio, L. and Gangbo, W. Hamiltonian odes in the wasserstein space of probability measures. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 61(1):18–53, 2008.
- Ambrosio et al. (2008) Ambrosio, L., Gigli, N., and Savaré, G. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
- Barbour (1990) Barbour, A. D. Stein’s method for diffusion approximations. Probability theory and related fields, 84(3):297–322, 1990.
- Benamou & Brenier (2000) Benamou, J.-D. and Brenier, Y. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84(3):375–393, 2000.
The fundamental incompatibility of scalable hamiltonian monte carlo
and naive data subsampling.
International Conference on Machine Learning, pp. 533–540, 2015.
- Betancourt (2017) Betancourt, M. A conceptual introduction to hamiltonian monte carlo. arXiv preprint arXiv:1701.02434, 2017.
- Betancourt et al. (2017) Betancourt, M., Byrne, S., Livingstone, S., Girolami, M., et al. The geometric foundations of hamiltonian monte carlo. Bernoulli, 23(4A):2257–2298, 2017.
- Bruna et al. (2017) Bruna, M., Burger, M., Ranetbauer, H., and Wolfram, M.-T. Asymptotic gradient flow structures of a nonlinear fokker-planck equation. arXiv preprint arXiv:1708.07304, 2017.
- Byrne & Girolami (2013) Byrne, S. and Girolami, M. Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825–845, 2013.
- Chen et al. (2015) Chen, C., Ding, N., and Carin, L. On the convergence of stochastic gradient mcmc algorithms with high-order integrators. In Advances in Neural Information Processing Systems, pp. 2278–2286, 2015.
- Chen et al. (2016) Chen, C., Ding, N., Li, C., Zhang, Y., and Carin, L. Stochastic gradient mcmc with stale gradients. In Advances in Neural Information Processing Systems, pp. 2937–2945, 2016.
- Chen et al. (2018a) Chen, C., Zhang, R., Wang, W., Li, B., and Chen, L. A unified particle-optimization framework for scalable bayesian sampling. arXiv preprint arXiv:1805.11659, 2018a.
- Chen et al. (2014) Chen, T., Fox, E., and Guestrin, C. Stochastic gradient hamiltonian monte carlo. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1683–1691, 2014.
- Chen et al. (2018b) Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X., and Oates, C. J. Stein points. arXiv preprint arXiv:1803.10161, 2018b.
- Cheng & Bartlett (2017) Cheng, X. and Bartlett, P. Convergence of langevin mcmc in kl-divergence. arXiv preprint arXiv:1705.09048, 2017.
- Da Silva (2001) Da Silva, A. C. Lectures on symplectic geometry, volume 3575. Springer, 2001.
- Ding et al. (2014) Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., and Neven, H. Bayesian sampling using stochastic gradient thermostats. In Advances in neural information processing systems, pp. 3203–3211, 2014.
- Do Carmo (1992) Do Carmo, M. P. Riemannian Geometry. 1992.
- Duane et al. (1987) Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. Hybrid monte carlo. Physics Letters B, 195(2):216–222, 1987.
- Durmus & Moulines (2016) Durmus, A. and Moulines, E. High-dimensional bayesian inference via the unadjusted langevin algorithm. arXiv preprint arXiv:1605.01559, 2016.
- Fernandes & Marcut (2014) Fernandes, R. L. and Marcut, I. Lectures on Poisson Geometry. Springer, 2014.
- Gallego & Insua (2018) Gallego, V. and Insua, D. R. Stochastic gradient mcmc with repulsive forces. arXiv preprint arXiv:1812.00071, 2018.
- Gangbo et al. (2010) Gangbo, W., Kim, H. K., and Pacini, T. Differential forms on Wasserstein space and infinite-dimensional Hamiltonian systems. American Mathematical Soc., 2010.
- Girolami & Calderhead (2011) Girolami, M. and Calderhead, B. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
- Gorham & Mackey (2015) Gorham, J. and Mackey, L. Measuring sample quality with stein’s method. In Advances in Neural Information Processing Systems, pp. 226–234, 2015.
- Gorham et al. (2016) Gorham, J., Duncan, A. B., Vollmer, S. J., and Mackey, L. Measuring sample quality with diffusions. arXiv preprint arXiv:1611.06972, 2016.
- Jordan et al. (1998) Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
- Kondratyev & Vorotnikov (2017) Kondratyev, S. and Vorotnikov, D. Nonlinear fokker-planck equations with reaction as gradient flows of the free energy. arXiv preprint arXiv:1706.08957, 2017.
- Lan et al. (2015) Lan, S., Stathopoulos, V., Shahbaba, B., and Girolami, M. Markov chain monte carlo from lagrangian dynamics. Journal of Computational and Graphical Statistics, 24(2):357–378, 2015.
Li et al. (2019)
Li, C., Chen, C., Pu, Y., Henao, R., and Carin, L.
Communication-efficient stochastic gradient mcmc for neural networks.2019.
- Liu & Zhu (2017) Liu, C. and Zhu, J. Riemannian stein variational gradient descent for bayesian inference. arXiv preprint arXiv:1711.11216, 2017.
- Liu et al. (2016) Liu, C., Zhu, J., and Song, Y. Stochastic gradient geodesic mcmc methods. In Advances In Neural Information Processing Systems, pp. 3009–3017, 2016.
- Liu et al. (2018) Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J., and Carin, L. Accelerated first-order methods on the wasserstein space for bayesian inference. arXiv preprint arXiv:1807.01750, 2018.
- Liu (2017) Liu, Q. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pp. 3118–3126, 2017.
- Liu & Wang (2016) Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2378–2386, 2016.
- Liu et al. (2017) Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. Stein variational policy gradient. arXiv preprint arXiv:1704.02399, 2017.
- Livingstone et al. (2016) Livingstone, S., Betancourt, M., Byrne, S., and Girolami, M. On the geometric ergodicity of hamiltonian monte carlo. arXiv preprint arXiv:1601.08057, 2016.
- Lott (2008) Lott, J. Some geometric calculations on wasserstein space. Communications in Mathematical Physics, 277(2):423–437, 2008.
- Lu et al. (2016) Lu, X., Perrone, V., Hasenclever, L., Teh, Y. W., and Vollmer, S. J. Relativistic monte carlo. arXiv preprint arXiv:1609.04388, 2016.
- Ma et al. (2015) Ma, Y.-A., Chen, T., and Fox, E. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pp. 2917–2925, 2015.
- Marsden & Ratiu (2013) Marsden, J. E. and Ratiu, T. S. Introduction to mechanics and symmetry: a basic exposition of classical mechanical systems, volume 17. Springer Science & Business Media, 2013.
- Neal et al. (2011) Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011.
- Nicolaescu (2007) Nicolaescu, L. I. Lectures on the Geometry of Manifolds. World Scientific, 2007.
- Otto (2001) Otto, F. The geometry of dissipative evolution equations: the porous medium equation. 2001.
- Patterson & Teh (2013) Patterson, S. and Teh, Y. W. Stochastic gradient riemannian langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pp. 3102–3110, 2013.
- Pu et al. (2017) Pu, Y., Gan, Z., Henao, R., Li, C., Han, S., and Carin, L. Vae learning via stein variational gradient descent. In Advances in Neural Information Processing Systems, pp. 4239–4248, 2017.
- Risken (1996) Risken, H. Fokker-planck equation. In The Fokker-Planck Equation, pp. 63–95. Springer, 1996.
- Roberts & Stramer (2002) Roberts, G. O. and Stramer, O. Langevin diffusions and metropolis-hastings algorithms. Methodology and computing in applied probability, 4(4):337–357, 2002.
- Roberts et al. (1996) Roberts, G. O., Tweedie, R. L., et al. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
- Santambrogio (2017) Santambrogio, F. Euclidean, metric, and wasserstein gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87–154, 2017.
- Stein et al. (1972) Stein, C. et al. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972.
Sutskever et al. (2013)
Sutskever, I., Martens, J., Dahl, G., and Hinton, G.
On the importance of initialization and momentum in deep learning.In International conference on machine learning, pp. 1139–1147, 2013.
Taghvaei & Mehta (2018)
Taghvaei, A. and Mehta, P. G.
Accelerated gradient flow for probability distributions.ICLR 2019 submission, 2018. URL https://openreview.net/forum?id=HkxCEhAqtQ¬eId=H1lvDj2FCX.
- Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
- Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.
- Yoon et al. (2018) Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 7343–7353, 2018.
- Zhang et al. (2016) Zhang, Y., Wang, X., Chen, C., Henao, R., Fan, K., and Carin, L. Towards unifying hamiltonian monte carlo and slice sampling. In Advances in Neural Information Processing Systems, pp. 1741–1749, 2016.
- Zhang et al. (2017) Zhang, Y., Chen, C., Gan, Z., Henao, R., and Carin, L. Stochastic gradient monomial gamma sampler. arXiv preprint arXiv:1706.01498, 2017.
- Zhuo et al. (2018) Zhuo, J., Liu, C., Shi, J., Zhu, J., Chen, N., and Zhang, B. Message passing stein variational gradient descent. In International Conference on Machine Learning, pp. 6013–6022, 2018.
A.1. Proof of Lemma 3.1
which reduces to
where we have used the symmetry of and skew-symmetry of in the last equality: and similarly ; so and similarly , .
The deterministic dynamics in the theorem with defined in Eq. (10) induces the curve