Sharp Convergence Rates for Langevin Dynamics in the Nonconvex Setting

05/04/2018 ∙ by Xiang Cheng, et al. ∙ berkeley college 0

We study the problem of sampling from a distribution where the negative logarithm of the target density is L-smooth everywhere and m-strongly convex outside a ball of radius R, but potentially non-convex inside this ball. We study both overdamped and underdamped Langevin MCMC and prove upper bounds on the time required to obtain a sample from a distribution that is within ϵ of the target distribution in 1-Wasserstein distance. For the first-order method (overdamped Langevin MCMC), the time complexity is Õ(e^cLR^2d/ϵ^2), where d is the dimension of the underlying space. For the second-order method (underdamped Langevin MCMC), the time complexity is Õ(e^cLR^2√(d)/ϵ) for some explicit positive constant c. Surprisingly, the convergence rate is only polynomial in the dimension d and the target accuracy ϵ. It is however exponential in the problem parameter LR^2, which is a measure of non-logconcavity of the target distribution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we study the problem of sampling from a target distribution

where , and the potential function is -smooth everywhere and -strongly convex outside a ball of radius (see detailed assumptions in Section 1.2.1).

Our focus is on theoretical rates of convergence of sampling algorithms, including analysis of the dependence of these rates on the dimension

. Much of the theory of convergence of sampling—for example, sampling based on Markov chain Monte Carlo (MCMC) algorithms—has focused on asymptotic convergence, and has stopped short of providing a detailed study of dimension dependence. In the allied field of optimization algorithms, a significant new literature has emerged in recent years on non-asymptotic rates, including tight characterizations of dimension dependence. The optimization literature, however, generally stops short of the kinds of inferential and decision-theoretic computations that are addressed by sampling, in domains such as Bayesian statistics 

(Robert and Casella, 2013), bandit algorithms (Cesa-Bianchi and Lugosi, 2006) and adversarial online learning (Bubeck, 2011, Abbasi et al., 2013).

In both optimization and sampling, the classical theory focused on convex problems, while recent work focuses on the more broadly useful setting of non-convex problems. While general non-convex problems are infeasible, it is possible to make reasonable assumptions that allow theory to proceed while still making contact with practice.

We will consider the class of MCMC algorithms that have access to the gradients of the potential, . A particular algorithm of this kind that has received significant recent attention from theoreticians is the overdamped Langevin MCMC algorithm (Dalalyan, 2017, Durmus and Moulines, 2016, Dalalyan and Karagulyan, 2017). The underlying first-order stochastic differential equation (henceforth SDE) is given by:

(1)

where represents a standard Brownian motion in . Overdamped Langevin MCMC (Algorithm 1) is a discretization of this SDE. It is possible to show that under mild assumptions on , the invariant distribution of the overdamped Langevin diffusion is given by .

The second-order generalization of overdamped Langevin diffusion is underdamped Langevin diffusion, which can be represented by the following SDE:

(2)

where are free parameters. This SDE can also be discretized appropriately to yield a corresponding MCMC algorithm (Algorithm 2). Second-order methods like underdamped Langevin MCMC are particularly interesting as it has been previously observed both empirically (Neal, 2011) and theoretically (Cheng et al., 2017, Mangoubi and Smith, 2017) that these methods can be faster than the more classical overdamped methods.

In this work, we show that it is possible to sample from in time polynomial in the dimension and the target accuracy (as measured in -Wasserstein distance). We also show that the convergence depends exponentially on the product . Intuitively, is a measure of the non-convexity of . Our results establish rigorously that as long as the problem is not “too badly non-convex,” sampling is provably tractable.

Our main results are presented in Theorem 2.1 and Theorem 3.1, and can be summarized informally as follows:

Theorem 1.1 (informal).

Given a potential that is -smooth everywhere and strongly-convex outside a ball of radius , we can output a sample from a distribution which is close in to by running steps of overdamped Langevin MCMC (Algorithm 1), or steps of underdamped Langevin MCMC (Algorithm 2). Here, is an explicit constant.

For the case of convex , it has been shown by Cheng et al. (2017) that the iteration complexity of Algorithm 2 is , quadratically improving upon the best known iteration complexity of for Algorithm 1, as shown by Durmus and Moulines (2016). We will find this quadratic speed-up in and in our setting as well (see Theorem 2.1 versus Theorem 3.1).

The problem of sampling from non-logconcave distributions has been studied by Raginsky et al. (2017), but under weaker assumptions, with a worst-case convergence rate that is exponential in . On the other hand, Ge et al. (2017) established a

convergence rate for sampling from a distribution close to a mixture of Gaussians, where the mixture components have the same variance (which is subsumed by our assumptions).

1.1 Related Work

The convergence rate of overdamped Langevin diffusion, under assumptions 1 - 3 has been established by Eberle (2016), but the continuous-time diffusion studied in that paper is not implementable algorithmically. In a more algorithmic line of work, Dalalyan (2017) bounded the discretization error of overdamped Langevin MCMC, and provided the first non-asymptotic convergence rate of overdamped Langevin MCMC under log-concavity assumptions. This was followed by a sequence of papers in the strongly log-concave setting (see, e.g., Durmus and Moulines, 2016, Cheng and Bartlett, 2017, Dalalyan and Karagulyan, 2017, Dwivedi et al., 2018).

Our result for overdamped Langevin MCMC is in line with this existing work; indeed, we combine the continuous-time convergence rate of Eberle (2016) with a variant of the discretization error analysis by Durmus and Moulines (2016). The final number of timesteps needed is , which is expected, as the rate of Eberle (2016) is (for the continuous-time process) and the iteration complexity established by Durmus and Moulines (2016) is .

On the other hand, convergence of underdamped Langevin MCMC under (strongly) log-concave assumptions was first established by Cheng et al. (2017). Also very relevant to this work is the paper by Eberle et al. (2017) that demonstrated a contraction property of the continuous-time process in (2). That result deals, however, with a much larger class of potential functions, and because of this the distance to the invariant distribution scales exponentially with dimension . At a high level, our analysis in Section 3 yields a more favorable result by combining ideas from both Eberle et al. (2017) and Cheng et al. (2017), under new assumptions.

Also noteworthy is that the problem of sampling from non-logconcave distributions has been studied by Raginsky et al. (2017), but under weaker assumptions, with a worst-case convergence rate that is exponential in . On the other hand, Ge et al. (2017) established a convergence rate for sampling from a distribution close to a mixture of Gaussians, where the mixture components have the same variance (which is subsumed by our assumptions).

Finally, there is a large class of sampling algorithms known as Hamiltonian Monte Carlo (HMC), which involve Hamiltonian dynamics in some form. We refer to Ma et al. (2015) for a survey of the results in this area. Among these, the variant studied in this paper (Algorithm 2), based on the discretization of (2), has a natural physical interpretation as the evolution of a particle’s dynamics under a viscous force field. This model was first studied by (Kramers, 1940) in the context of chemical reactions. The continuous-time process has been studied extensively (Hérau, 2002, Villani, 2009, Eberle et al., 2017, Gorham et al., 2016, Baudoin, 2016, Bolley et al., 2010, Calogero, 2012, Dolbeault et al., 2015, Mischler and Mouhot, 2014). Three recent papers—Mangoubi and Smith (2017), Lee and Vempala (2017) and Mangoubi and Vishnoi (2018)—study the convergence rate of HMC under log-concavity assumptions.

After the completion of this paper, Bou-Rabee et al. (2018) independently published a preprint on arXiv analyzing Hamiltonian Monte Carlo under similar assumptions as ours.

1.2 Notation, Definitions and Assumptions

In this section we present the basic definitions, notational conventions and assumptions used throughout the paper. For we let denote the

-norm of a vector

. Throughout the paper we use to denote standard Brownian motion (Mörters and Peres, 2010).

1.2.1 Assumptions on the potential

We make the following assumption on the potential function :

  1. [label=(A0)]

  2. The function is continuously-differentiable on and has Lipschitz continuous gradients; that is, there exists a positive constant such that for all ,

  3. The function has a stationary point at zero:

  4. The function is strongly convex outside of a ball; that is, there exist constants such that for all with , we have:

Finally we define the condition number as . Observe that Assumption 2 is imposed without loss of generality, because we can always find a local minimum in polynomial time and shift the coordinate system so that the local minimum of is at zero. These conditions are similar to the assumptions made by Eberle (2016). Note that crucially Assumption 3 is strictly stronger than the assumption made in recent papers by Raginsky et al. (2017) and Zhang et al. (2017). To see this observe that these papers only require Assumption 3 to hold for a fixed , while we require this to hold for all . One can also think of the difference between these two conditions as being analogous to the difference between strong convexity (outside a ball) and one-point strong convexity (outside a ball).

1.2.2 Coupling and Wasserstein Distance

Denote by the Borel -field of

. Given probability measures

and on , we define a transference plan between and as a probability measure on such that for all sets , and . We denote

as the set of all transference plans. A pair of random variables

is called a coupling if there exists a such that are distributed according to . (With some abuse of notation, we will also refer to as the coupling.)

Given a function , we define the -Wasserstein distance between a pair of probability measures as follows:

Finally we denote by the set of transference plans that achieve the infimum in the definition of the Wasserstein distance between and (for more properties of , see Villani, 2008). For any we define the -Wasserstein distance as

1.2.3 Defining and related inequalities

We follow Eberle (2016) in our specification of the distance function that is used in the definition of the Wasserstein distance. We let and be two arbitrary constants (these are parameters used in defining ). We begin by defining auxiliary functions , and , all from to :

(3)

Let us summarize some important properties of the functions and :

  • is decreasing, , and for any .

  • is decreasing, , and for any .

Finally we define as

(4)

We now state some useful properties of the distance function .

Lemma 1.2.

The function defined in Eq. (4) has the following properties.

  1. [label=(F0)]

  2. , .

  3. .

  4. .

  5. For all ,

  6. For all , , and when .

  7. If , for any , .

These properties follow fairly easily from the definition of the function above. We present proofs in Appendix A.

2 Overdamped Langevin Diffusion

We first set up the notation specific to the continuous and discrete processes that we use to study overdamped Lanvegin diffusion:

  1. Consider the exact overdamped Langevin diffusion defined by the SDE in Eq. (1), with an initial condition for some distribution on . Let denote the distribution of and let denote the operator that maps from to :

    (5)
  2. One step of the overdamped Langevin MCMC is defined by the SDE:

    (6)

    with an initial condition . We define analogously for the discrete process.

    Note 1: The discrete update differs from Eq. (1) by using a fixed instead of in the drift.

    Note 2: We will only be analyzing the solutions to Eq. (6) for small . Think of an integral solution of Eq. (6) as a single step of the discrete Langevin MCMC.

Input  : Step size , number of iterations , initial point , and gradient oracle
1 for  do
2       Sample
3 end for
Algorithm 1 Overdamped Langevin MCMC

It can be easily verified that in Algorithm 1 has the same distribution as in Eq. (6). Throughout this section, we denote by the unique distribution which satisfies . It can be shown that is the unique invariant distribution of (1) (see, for example, Proposition 6.1 in Pavliotis, 2016). In the discussion that follows we will use to denote the distribution of as defined in Algorithm 1. The main result of this section is Theorem 2.1, which establishes a convergence rate for Algorithm 1.

Theorem 2.1.

Let be the Dirac delta distribution at with . Define . Let be the distribution of the iterate of Algorithm 1 with step size

Let

Then .

Remark. Note that in most interesting cases, is constrained by the first term, which shows that it suffices to have

Intuitively, measures the extent of nonconvexity. When this quantity is large, it is possible for to contain numerous local minima that are very deep. It is reasonable that the runtime of the algorithm should be exponential in this quantity.

2.1 Convergence of Continuous-Time Process

We begin by establishing the convergence of the continuous-time process (1) to the invariant distribution. Following Eberle (2016), we construct a coupling between two processes evolving according to the SDE (1). We accordingly define the first process as:

where , and the second process as:

with , where

( is the indicator function, which is 1 if and 0 otherwise.)

Additionally we also couple the processes so that the initial joint distribution of

and corresponds to the optimal coupling between the two processes under . To simplify notation, we define the difference process as with

(7)

With this notation in place we now show the contraction of the continuous-time process (1) in .

Proposition 2.2.

Let and be as defined in Section 1.2.3 with and . Then for any , and any probability measure ,

where is as defined in Eq. (5) and is the invariant distribution of (1).

Proof We define . By Itô’s Formula (see Theorem E.1, sufficient regularity is established in Lemma E.8),

where is a one-dimensional Brownian motion. Applying Itô’s Formula once again to :

Taking an expectation,

(8)

We now complete the argument by considering two cases:

Case 1 (): In this case, we know that by the smoothness assumption on (Assumption 1),

Combining with Eq. (8),

The second inequality follows from the choice of and given in the statement of Proposition 2.2 and (F4) in Lemma 1.2.

Case 2 (): In this case, we know that for points that are far away, the potential satisfies a strong-convexity-like condition (Assumption 3). Also, by Lemma 1.2, for any , and . Thus

Combining the two cases we get that, for any ,

The claimed result follows by Grönwall’s Inequality (see Corollary 3 in Dragomir, 2003) assuming that the initial distributions are optimally coupled under .  

2.2 Convergence of the Discrete-Time Process

We control the discretization error between the continuous and discrete processes using standard arguments (see, for example, Durmus and Moulines, 2016). The main conclusion is that the discretization error in (and consequently in ) essentially scales as .

Proposition 2.3.

Let the initial distribution be a Dirac-delta distribution at . Let be the distribution of . Then, for all , if ,

The proof of this proposition is in Appendix B.

2.3 Proof of Theorem 2.1

In this section we combine the continuous-time contraction result (Proposition 2.2) with the discretization error bound (Proposition 2.3) to prove Theorem 2.1.

Proof [Proof of Theorem 2.1] We know that for any measures ,

as . We also have with . Thus Proposition 2.3 implies that for any for ,

By the triangle inequality and concavity of ,

By Proposition 2.2, the continuous-time process contracts, so

Unrolling this inequality for steps:

where follows by the sum of the geometric series for any and follows by the approximation for . Finally, for any two measures and , as . Plugging this into the inequality above gives us the desired result:

We choose

to ensure that the second term corresponding to the discretization error is small,

and we pick

to ensure that the first step contracts sufficiently

Finally, by our choice of , we can upper bound by

where the first inequality is by the triangle inequality and the last inequality follows from Lemma E.3. Combining the pieces and simplifying gives us the desired result.  

3 Underdamped Langevin Diffusion

For the rest of this paper, we will define the absolute constant , which will help clarify presentation. In this section, we study underdamped Langevin diffusion, a second-order diffusion process given by the following SDE:

(9)

is the condition number.

Similar to the case of overdamped Langevin diffusion, it can be readily verified that the invariant distribution of the SDE is . This ensures that the marginal along is the distribution that we are interested in. Based on Eq. (9), we can define the discretized underdamped Langevin diffusion as

(10)

where is the step size of the discretization. In Theorem 3.2, we establish the rate at which (10) converges to . The SDE in Eq. (10) is implementable as the following algorithm:

Input  : Step size , number of iterations , initial point , smoothness parameter , condition number and gradient oracle
1 for  do
2       Sample
3 end for
Algorithm 2 Underdamped Langevin MCMC

The random vector , conditioned on

, has a Gaussian distribution with conditional mean and covariance obtained from the following computations:

It can be verified that from Algorithm 2 and from Eq. (10) have the same distribution (see Lemma E.6 for a proof of this statement; this lemma is essentially extracted from the calculations of Cheng et al. (2017), and we include it in the appendix for completeness). In the discussion that follows we will use to denote the distribution as defined in Algorithm 2. The following theorem establishes the convergence rate of Algorithm 2.

Theorem 3.1.

Let be the Dirac delta distribution at with . Let be the iterate of Algorithm 2 with step size

and let

Then .

Remark. The final expression for can be simplified to

The proof of this theorem relies on an intricate coupling argument. Similar to the overdamped case we begin by defining two processes, and , and then couple them appropriately using both synchronous and reflection coupling. In the rest of this section we use the variables

(11)

Here denotes the difference of the position variables, is the difference of the velocity variables, is the sum of and , is the unit vector along , denotes the difference between the gradients at and while captures the difference between the gradients as is discretized at a scale of .

Similar to Section 2, we initialize according to the invariant distribution , and thus when evolves according to (9), it remains distributed according to the invariant distribution. The process will denote the path of the iterates of Algorithm 2 and we will use the difference between these processes to track the distance between the distributions. We define a stochastic process

The dynamics of are defined as follows:

(12)
(13)
(14)