DeepAI

Is There an Analog of Nesterov Acceleration for MCMC?

We formulate gradient-based Markov chain Monte Carlo (MCMC) sampling as optimization on the space of probability measures, with Kullback-Leibler (KL) divergence as the objective function. We show that an underdamped form of the Langevin algorithm perform accelerated gradient descent in this metric. To characterize the convergence of the algorithm, we construct a Lyapunov functional and exploit hypocoercivity of the underdamped Langevin algorithm. As an application, we show that accelerated rates can be obtained for a class of nonconvex functions with the Langevin algorithm.

• 17 publications
• 2 publications
• 21 publications
• 35 publications
• 21 publications
• 267 publications
02/04/2020

tfp.mcmc: Modern Markov Chain Monte Carlo Tools Built for Modern Hardware

Markov chain Monte Carlo (MCMC) is widely regarded as one of the most im...
11/17/2017

Techniques for proving Asynchronous Convergence results for Markov Chain Monte Carlo methods

Markov Chain Monte Carlo (MCMC) methods such as Gibbs sampling are findi...
07/27/2019

The Wang-Landau Algorithm as Stochastic Optimization and its Acceleration

We show that the Wang-Landau algorithm can be formulated as a stochastic...
06/15/2022

Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling

We introduce repriorisation, a data-dependent reparameterisation which t...
10/21/2019

In this paper, we explore a general Aggregated Gradient Langevin Dynamic...
08/30/2019

On the robustness of gradient-based MCMC algorithms

We analyse the tension between robustness and efficiency for Markov chai...
06/16/2020

Hessian-Free High-Resolution Nesterov Acceleration for Sampling

We propose an accelerated-gradient-based MCMC method. It relies on a mod...

1 Introduction

While optimization methodology has provided much of the underlying algorithmic machinery that has driven the theory and practice of machine learning in recent years, sampling-based methodology, in particular Markov chain Monte Carlo (MCMC), remains of critical importance, given its role in linking algorithms to statistical inference and, in particular, its ability to provide notions of confidence that are lacking in optimization-based methodology. However, the classical theory of MCMC is largely asymptotic and the theory has not developed as rapidly in recent years as the theory of optimization.

Recently, however, a literature has emerged that derives nonasymptotic rates for MCMC algorithms [see, e.g., 9, 12, 10, 8, 6, 14, 21, 22, 2, 5]. This work has explicitly aimed at making use of ideas from optimization; in particular, whereas the classical literature on MCMC focused on reversible Markov chains, the recent literature has focused on non-reversible stochastic processes that are built on gradients [see, e.g., 18, 20, 3, 1]. In particular, the gradient-based Langevin algorithm [33, 32, 13] has been shown to be a form of gradient descent on the space of probabilities [see, e.g., 36].

What has not yet emerged is an analog of acceleration. Recall that the notion of acceleration has played a key role in gradient-based optimization methods [26]. In particular, the Nesterov accelerated gradient descent (AGD) method, an instance of the general family of “momentum methods,” provably achieves faster convergence rate than gradient descent (GD) in a variety of settings [25]. Moreover, it achieves the optimal convergence rate under an oracle model of optimization complexity in the convex setting [24].

This motivates us to ask: Is there an analog of Nesterov acceleration for gradient-based MCMC algorithms? And does it provably accelerate the convergence rate of these algorithms?

This paper answers these questions in the affirmative by showing that an underdamped form of the Langevin algorithm performs accelerated gradient descent. Critically, our work is based on the use of Kullback-Leibler (KL) divergence as the metric. We build on previous work that has studied the underdamped Langevin algorithm and has used coupling methods to establish convergence of the algorithm in the Wasserstein distance [see, e.g., 8, 7, 11]. Our work establishes a direct linkage between the underdamped Langevin algorithm and Nesterov acceleration by working directly in the objective functional, the KL divergence. Combining ideas from optimization theory and diffusion processes, we construct a Lyapunov functional that couples the convergence in the momentum and the original variables. We then prove the overall convergence rate by leveraging the hypocoercivity structure of the underdamped Langevin algorithm [35]. For target distributions satisfying a log-Sobolev inequality, we find that the underdamped Langevin algorithm accelerates the convergence rate of the classical Langevin algorithm from to in terms of KL divergence (See Theorem 1 for formal statement).

2 Problem Setting

Assume that we wish to sample from a target (posterior) probability density,

, where . Consider the KL divergence to this target :

 F(p)=∫p(θ)ln(p(θ)p∗(θ))dθ.

We use this KL divergence as an objective function in an optimization-theoretic formulation of convergence to .

We assume that satisfies the following conditions.

1. For , the gradient of the function is -Lipschitz and its Hessian -Lipschitz.

That is: ; , and . 111It is worth noting that this definition of Hessian Lipschitzness with Frobenius norm is stronger than that with the spectral norm. We postulate here that the requirement of Hessian Lipschitz condition is an artifact of our Lyapunov functional and can possibly be removed in future work.

2. The target density satisfies a log-Sobolev inequality with constant  [15, 28]. That is, for any smooth function , we have

3. Without loss of generality, for , let and (which can be achieved by shifting the function ). Further assume that the normalization constant is bounded: .

As a concrete example, the above assumption is satisfied and instantiated in the “locally nonconvex” case with nonconvex region of radius and strong convexity studied by [19]; see also Assumption 13 in Appendix A.1. Here and . In the proofs that follow, we use this instantiation to study and compare dependence of the convergence guarantees on the smoothness and conditioning of . But the proofs apply directly to Assumptions 13 and the dimension and accuracy dependencies remain the same.

3 Underdamped Langevin Algorithm as Accelerated Gradient Descent

A recent trend of research in optimization makes uses of a continuous dynamical systems framework to decompose the convergence of algorithms into continuous components and discretization components [34, 37]. We follow similar ideas and first propose a continuous dynamics corresponding to the accelerated gradient flow over the KL divergence, . We then construct the underdamped Langevin algorithm as a discretization of the accelerated gradient flow and show that it is precisely accelerated gradient descent over .

3.1 Gradient Descent Dynamics over KL Divergence

We start by defining the dynamics of gradient descent via a consideration of the gradient flow associated with the KL divergence

. We formulate the “vector flow” associated with the following stochastic differential equation with Lipschitz continuous drift

:

 dθt=b(θt)dt+√2dBt, (1)

where is a standard Brownian motion. Without the Brownian motion, the vector flow is defined by on the state space. The evolution of probabilities under its effect is characterized by the continuity (Liouville) equation over :

 ∂∂t¯pt(θ)+∇T(¯pt(θ)b(θ))=0.

With the Brownian motion, the deformation of the probability density function

) follows the transport of probability mass along the vector flow in the state space:

 ∂∂tpt(θ)+∇T(pt(θ)vt(θ))=0,

where

On the other hand, for the objective functional , its time change when follows Eq. (1) is:

 ddtF(pt)=Eθ∼pt[⟨∇δF(pt)δpt(θ),b(θ)+∇lnpt⟩],

where is the strong subdifferential of associated with the -Wasserstein metric [17, Lemma 10.4.1]. Therefore, the gradient descent flow over involves taking . When is the KL divergence, we have:

 vGDt(θ)=−∇δF(pt)δpt(θ)=−∇lnpt(θ)p∗(θ),

or . The time evolution of the KL divergence along is

 ddtF(pt)=−Eθ∼pt[∥∥∥∇δF(pt)δpt(θ)∥∥∥2]=−Eθ∼pt[∥∥∥∇lnpt(θ)p∗(θ)∥∥∥2].

If satisfies Assumption 2 then taking in the log-Sobolev inequality yields:

 (2)

Note the resemblance of this bound to the Polyak-Łojasiewicz condition [31] used in optimization theory for studying the convergence of gradient methods—in both cases the difference from the current iterate to the optimum is upper bounded by the norm of the gradient squared.

With the log-Sobolev inequality, we obtain that

 ddtF(pt)=−Eθ∼pt[∥∥∥∇lnpt(θ)p∗(θ)∥∥∥2]≤−2ρF(pt),

which implies the linear convergence of along the gradient descent dynamics.

3.2 Accelerated Gradient Descent Dynamics over KL Divergence

We now consider an acceleration of the gradient descent dynamics in the space of probabilities via the introduction of a momentum variable . Denote and let the joint target distribution to be: 222We use and to denote marginal distributions of and over .. To design the accelerated gradient descent dynamics over the KL divergence, we leverage the acceleration phenomenon in optimization, which uses the gradient of the expanded objective function to guide the algorithm (see discussion at the end of the current section). We expand the KL divergence (over both the and coordinates) to obtain:

 F(pt(θ,r)) =∫∫pt(θ,r)lnpt(θ,r)p∗(θ)p∗(r)dθdr =KL(pt(θ)∥p∗(θ))+Eθ∼pt(θ)[KL(pt(r|θ)∥p∗(r))],

and form the vector field:

 vAGDt(x) =⎛⎜⎝∇rlnpt(θ,r)+ξr−∇θlnpt(θ,r)−∇U(θ)−γ∇rlnpt(θ,r)p∗(r)⎞⎟⎠. (3)

The corresponding continuity equation defined by this vector field is

 0 =∂∂tpt(θ,r)+∇T(pt(θ,r)vAGDt(θ,r)) =∂∂tpt(θ,r)+(∇Tθ,∇Tr)⎡⎢⎣pt(θ,r)⎛⎜⎝ξr−∇U(θ)−γ∇rlnpt(θ,r)p∗(r)⎞⎟⎠⎤⎥⎦.

This implies that the vector field can be implemented via the following stochastic differential equation

 {dθt=ξrtdtdrt=−∇U(θt)dt−γξrtdt+√2γdBt. (4)

If we only consider the time derivative of the KL divergence, ,

 ddtF(pt) =∫⟨∇xδFδpt,vAGDt(θ,r)⟩pt dx =∫⟨∇xδFδpt,−(0−IIγI)∇xlnptp∗⟩pt dx =−γEpt[∥∥∥∇rlnptp∗∥∥∥2]. (5)

This does not directly provide a linear convergence rate over time. To quantify the convergence rate for this accelerated gradient descent dynamics with KL divergence objective, we design a Lyapunov functional:

 L(pt) =F(pt)+Ept[⟨∇xδD(pt)δpt,S∇xδD(pt)δpt⟩] =Ept[lnptp∗+⟨∇xlnptp∗,S∇xlnptp∗⟩], (6)

where is a measurement of how close and are. Here we take the positive definite matrix . This form is inspired by the finite dimensional accelerated gradient descent dynamics of [37] and a line of work on hypocoercive diffusion operators [35, 4].

Proposition 1.

The time evolution of the Lyapunov functional with respect to the continuous time vector flow is:

 ddtL(pt) =∫⟨∇xδLδpt,vAGDt⟩pt dx =−Ept[⟨∇xln(ptp∗),MC∇xln(ptp∗)⟩F] −2γEpt[⟨∇x∇rln(ptp∗),S∇x∇rln(ptp∗)⟩F],

where for -Lipschitz smooth , and , and for all ,

 MC=1ρ⎛⎜ ⎜ ⎜⎝ξ2 I(1−γ2)ξ I−∇2U(θ)(1−γ2)ξ I−∇2U(θ)γ(4ξ+ρ)I−12∇2U(θ)⎞⎟ ⎟ ⎟⎠⪰λ(S+12ρI2d×2d). (7)

Since is positive definite,

 ddtL(pt) ≤−Ept[⟨∇xln(ptp∗),MC∇xln(ptp∗)⟩F] ≤−λ(Ept[⟨∇xln(ptp∗),S∇xln(ptp∗)⟩F]+12ρEpt[∥∥∥∇xlnptp∗∥∥∥2]).

Using Assumption 2 and the log-Sobolev inequality in Eq. (2), we directly obtain:

 ddtL(pt) ≤−λ(Ept[⟨∇xln(ptp∗),S∇xln(ptp∗)⟩F]+12ρEpt[∥∥∥∇xlnptp∗∥∥∥2]) ≤−λ(Ept[⟨∇xln(ptp∗),S∇xln(ptp∗)⟩F]+Ept[lnptp∗]) =−λL(pt),

which implies the linear convergence of the continuous process.

Accelerated gradient descent dynamics for optimization.

AGD has an exact correspondence with the above derivation. For the accelerated gradient descent dynamics for optimization problems defined on a Euclidean space, the original objective function is also expanded to be and its dynamics follows the vector field:

 dxdt=−(0−IIγI)(∇θH(x)∇rH(x)). (8)

To quantify convergence for the strongly convex objective , we take a Lyapunov function [37] of the form , where is the squared distance from to the optimum of , , and is a positive definite matrix scaling inversely with the strong convexity parameter .

3.3 Underdamped Langevin Algorithm via Second Order Discretization

It has been observed in both the MCMC and optimization communities that the discretization of the continuous dynamics is crucial for fast convergence. For optimization, a stable discretization scheme for accelerated gradient descent dynamics facilitates fast convergence [34, 37]. For MCMC, on the other hand, higher order (and more accurate) discretization schemes are found to accelerate convergence [23, 16, 8, 11, 21, 22].

Following [8], we design an underdamped Langevin algorithm by discretizing the dynamics of Eq. (4) for in the following fashion (denote and define ):

 {dθτ=ξrτdτdrτ=−γξrτdτ−∇U(θn)dτ+√2γdBτ. (9)

[8] provide explicit formulas for given which we state in Appendix A.2

for completeness. We also take the hyperparameters

, , and set the step size as follows:

 h=tn+1−tn=1120min⎧⎨⎩√ϵdρ2L5/2G,√ϵdρ2LGLH,1LG⎫⎬⎭.

The discretized vector field is

 (10)

It can be observed that this is a higher order discretization scheme for than the Euler-Maruyama method.

Other discretization schemes correspond to different choices of .

• The Euler-Maruyama discretization scheme corresponds to:

 ^vE−Mτ=(ξrn−∇U(θn)−γξrn−γ∇rlnp(θτ,rτ)).

After integration, we obtain for (and define ):

 {θτ=θn+νξrnrτ=(1−νγξ)rn−ν∇U(θn)+√2γBν,

where the Brownian motion is defined as . This is the lowest order integration scheme and does not grant faster convergence guarantees.

• For higher order discretization schemes other than our Eq. (10), it is worth noting that decomposes into two parts:

 vAGDt=(ξrt−∇U(θt))+⎛⎜⎝0−γ∇rlnp(θt,rt)p∗(rt)⎞⎟⎠,

where each part preserves as the invariant distribution. This inspires a splitting scheme for integrating . The first part is a Hamiltonian vector flow, which naturally motivates symplectic integration schemes such as the leap-frog method; the second part can be explicitly integrated to be .

Taking , is resampled as: according to the stationary distribution . This is exactly what Hamiltonian Monte Carlo (HMC) does [23]. Relating to the optimization, this momentum resampling step corresponds to a momentum restart method: one periodically restarts momentum from the stationary point [27]. Similar to optimization, it has been empirically observed that not taking at every step increases mixing [29].

4 Convergence of the Underdamped Langevin Algorithm

From Fig. 1, we see that the underdamped Langevin algorithm, Eq. (17), has a similar profile to accelerated gradient descent; it uses oscillatory behavior to greatly increases the convergence speed. In this section, we explicitly prove that the convergence of the underdamped Langevin algorithm is of order in terms of KL divergence. Consider the KL divergence from to as the target functional to minimize:

 KL(pt(θ)∥p∗(θ))≤KL(pt(θ,r)∥p∗(θ)p∗(r))=F(pt(θ,r)).
Theorem 1.

Assume satisfies Assumptions 13. Then if we run the underdamped Langevin algorithm (17) with initial condition for:

 k≥O(√dϵln(dϵ))

steps, we have .

If we further assume that function is locally nonconvex with radius and global strong convexity (Assumption 13), we obtain explicit dependence of convergence rate on other constants:

 k≥O⎛⎝√dϵmax⎧⎨⎩L5/2Gρ3,LGLHρ3⎫⎬⎭ln(dϵLGρ)⎞⎠,

where .

To prove convergence of the underdamped Langevin algorithm, we examine its instantaneous change within each step and integrate it. First consider the time evolution of following the discretized vector flow for :

 ∂p(xτ|xn)∂τ =−∇Tx(p(xτ|xn)⋅^vAGDτ) =−∇Tx(p(xτ|xn)⋅vAGDτ)−∇Tx(p(xτ|xn)⋅(^vAGDτ−vAGDτ)).

Therefore, for ,

 ∂p(xτ)∂τ=−∇Tx(p(xτ)⋅vAGDτ)−Exn∼p(xn)[∇Tx((^vAGDτ−vAGDτ)p(xτ|xn))]. (11)

Just like convergence of the continuous process (see Sec. 3.2), we use the same Lyapunov functional, Eq. (6), to characterize convergence of the underdamped Langevin algorithm: . Along the underdamped Langevin algorithm, instantaneous change of the Lyapunov functional follows the overall vector flow . It is composed of convergence with the continuous vector flow and the discretization error :

 ddtL[p(xτ)] =∫δLδp(xτ)∂p(xτ)∂t dxτ=∫⟨∇xδLδp(xτ),^vAGDτ⟩p(xτ) dxτ =∫⟨∇xδLδp(xτ),vAGDτ⟩p(xτ) dxτ

We analyze them separately and combine the analyses to obtain the overall convergence time.

4.1 Convergence of the Continuous Process

We first use Proposition 1 to quantify convergence of with respect to the continuous vector flow (where is defined in Eq. (7)):

 ∫⟨∇xδLδpt(x),vAGDτ(x)⟩pτ(x) dx =−2γEpt[⟨∇x∇rln(ptp∗),S∇x∇rln(ptp∗)⟩F] (12)

The two terms on the right hand side of Eq. (12) are both less than or equal to zero. We will use the first term to cancel the higher order terms in the discretization error and use the second term to drive the convergence of the process (by way of the log-Sobolev inequality).

4.2 Discretization Error

For the discretization error, we calculate its contribution to in the following Proposition 2.

Proposition 2.

For , the time evolution of the Lyapunov functional with respect to the discretization error is:

 ∫⟨∇xδLδp(xτ),Exn∼p(xtn)[(^vAGDτ−vAGDτ)p(xτ|xn)]⟩ dxτ =ξρ∫⟨∇θlnpτ(xτ)p∗(xτ),Exn∼p(xtn)[(∇U(θτ)−∇U(θn))p(xτ|xn)]⟩F dxτ +(4ξρ+1)∫⟨∇rlnpτ(xτ)p∗(xτ),Exn∼p(xtn)[(∇U(θτ)−∇U(θn))p(xτ|xn)]⟩ dxτ +2∫⟨∇x∇rlnpτ(xτ)p∗(xτ),S∇xτExn∼p(xn|xτ)[∇U(θτ)−∇U(θn)]⟩Fpτ(xτ) dxτ. (13)

It can be observed that there are two types of terms in Eq. (13): terms with first order derivatives (for labeling or ), and terms with the second order derivative .

We first use Young’s inequality to bound the first type of terms:

 ∫⟨∇#lnpτ(xτ)p∗(xτ),Exn∼p(xtn)[(∇U(θτ)−∇U(θn))p(xτ|xn)]⟩ dxτ

The main difficulty is in bounding the term

 ∫⟨∇x∇rlnpτ(xτ)p∗(xτ),S∇xτExn∼p(xn|xτ)[∇U(θτ)−∇U(θn)]⟩Fpτ(xτ) dxτ. (14)

This is the object of the following lemma.

Lemma 3.

Under Assumption 1, we explicitly bound the term in Eq. (14) when , , and :

 ∫⟨∇x∇rlnpτ(xτ)p∗(xτ),S∇xτExn∼p(xn|xτ)[∇U(θτ)−∇U(θn)]⟩Fpτ(xτ) dxτ ≤γEpτ(xτ)[⟨∇x∇rlnpτ(xτ)p∗(xτ),S∇x∇rlnpτ(xτ)p∗(xτ)⟩F]

Therefore, we can bound the overall discretization error as:

 ∫⟨∇xδLδp(xτ),Exn∼p(xtn)[(^vAGDτ−vAGDτ)p(xτ|xn)]⟩ dxτ ≤2γEpτ(xτ)[⟨∇x∇rlnpτ(xτ)p∗(xτ),S∇x∇rlnpτ(xτ)p∗(xτ)⟩F] +ξ4ρEpτ[∥∥∥∇θlnpτ(xτ)p∗(xτ)∥∥∥2]+(2ξρ+12)Epτ[∥∥∥∇rlnpτ(xτ)p∗(xτ)∥∥∥2] +((3ξρ+12)L2G+2L2Hγρ)Ep(xn,xτ)[∥θτ−θn∥2]+16edγρmax{L4Gξ2ν4,L2Gξ2ν2}.

4.3 Overall Convergence of Underdamped Langevin Algorithm

Combining the convergence of the continuous underdamped Langevin dynamics and the discretization error, we obtain that the overall time evolution of the Lyapunov functional with respect to one step of the underdamped Langevin algorithm is:

 dL(pt)dt =∫⟨∇xδLδpt,vAGDt⟩pt dx+∫⟨∇xδLδpt,Exn∼p(xtn)[(^vAGDτ−vAGDτ)p(xτ|xn)]⟩pt dx ≤−Ept[⟨∇xln(ptp∗),M∇xln(ptp∗)⟩F] (15) +((5ξ2ρ+12)L2G+2L2Hγρ)Ep(xn,xτ)[∥θτ−θn∥2]+16edγρmax{L4Gξ2ν4,L2Gξ2ν2},

where

 M=⎛⎜ ⎜ ⎜ ⎜⎝ξ4ρI2ξ−γξ2ρI−1ρ∇2U(θ)2ξ−γξ2ρI−1ρ∇2U(θ)γ(2ξρ+12)I−12ρ∇2U(θ)⎞⎟ ⎟ ⎟ ⎟⎠