Proximal Langevin Algorithm: Rapid Convergence Under Isoperimetry

11/04/2019
by   Andre Wibisono, et al.
0

We study the Proximal Langevin Algorithm (PLA) for sampling from a probability distribution ν = e^-f on R^n under isoperimetry. We prove a convergence guarantee for PLA in Kullback-Leibler (KL) divergence when ν satisfies log-Sobolev inequality (LSI) and f has bounded second and third derivatives. This improves on the result for the Unadjusted Langevin Algorithm (ULA), and matches the fastest known rate for sampling under LSI (without Metropolis filter) with a better dependence on the LSI constant. We also prove convergence guarantees for PLA in Rényi divergence of order q > 1 when the biased limit satisfies either LSI or Poincaré inequality.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/20/2019

Rapid Convergence of the Unadjusted Langevin Algorithm: Log-Sobolev Suffices

We prove a convergence guarantee on the unadjusted Langevin algorithm fo...
06/17/2020

A Non-Asymptotic Analysis for Stein Variational Gradient Descent

We study the Stein Variational Gradient Descent (SVGD) algorithm, which ...
02/13/2022

Improved analysis for a proximal algorithm for sampling

We study the proximal sampler of Lee, Shen, and Tian (2021) and obtain n...
10/11/2020

Fast Convergence of Langevin Dynamics on Manifold: Geodesics meet Log-Sobolev

Sampling is a fundamental and arguably very important task with numerous...
01/09/2018

Better and Simpler Error Analysis of the Sinkhorn-Knopp Algorithm for Matrix Scaling

Given a non-negative n × m real matrix A, the matrix scaling problem is...
07/22/2020

A Brief Note on the Convergence of Langevin Monte Carlo in Chi-Square Divergence

We study sampling from a target distribution ν_* ∝ e^-f using the unadju...
07/25/2019

Improved Bounds for Discretization of Langevin Diffusions: Near-Optimal Rates without Convexity

We present an improved analysis of the Euler-Maruyama discretization of ...

1 Introduction

Sampling is a fundamental algorithmic task. While the case of logconcave sampling is relatively well-studied, there are recent efforts in understanding the convergence guarantees for non-logconcave sampling. This is motivated by practical applications which require sampling complicated distributions in high-dimensional spaces, as well as by the recent progress in non-convex optimization.

In this paper we study the Proximal Langevin Algorithm (PLA) for sampling from a probability distribution on :

where is step size and

is an independent Gaussian random variable in

. The above is an implicit update, and we assume we can solve for , for example via the proximal step (see Section 2 for more detail):

PLA is a discretization of the continuous-time Langevin dynamics that uses the backward (implicit) method for the gradient. It is an implicit variant of the Unadjusted Langevin Algorithm (ULA), which uses the forward (explicit) method for the gradient. PLA was introduced by Pereyra [45] (in a more general form with Metropolis filter) from a smoothing perspective, and it was also proposed by Bernton [3] from an optimization perspective of sampling. PLA has been widely applied in practice, in particular when combined with ULA to sample from composite distributions [20, 9, 47, 38], and analyzed under logconcavity or strong logconcavity [3, 8, 4, 49]. Analogous to implicit vs. explicit methods in optimization, we expect PLA to have better properties than ULA at the cost of a more expensive per-iteration complexity (solving an optimization problem). See also [23, 52] for some recent extensions of PLA.

In this paper we prove convergence guarantees for PLA under isoperimetry, namely, when the target distribution satisfies either log-Sobolev inequality (LSI) or Poincaré inequality. Isoperimetry is a natural relaxation of logconcavity that still allows for fast sampling in continuous time. Strong logconcavity (SLC) implies LSI, and in turn implies Poincaré inequality with the same constant. Moreover, logconcavity and bounded diameter implies LSI and Poincaré inequality. However, isoperimetry is more general, as it is preserved under Lipschitz mapping and bounded perturbation, whereas logconcavity is not. Therefore, there is a wide class of probability distributions, including multimodal ones, satisfying isoperimetry.

In continuous time, isoperimetry is sufficient for fast sampling. For example, if satisfies LSI with constant , then along the Langevin dynamics, the Kullback-Leibler (KL) divergence converges exponentially fast at rate . This means the Langevin dynamics reaches KL divergence less than in time . In particular, there is no dependence on dimension and no assumption on the smoothness of is required, beyond differentiability in order to run the Langevin dynamics. This is analogous to the exponential convergence of gradient flow for optimization in continuous time under gradient domination condition (via the perspective of sampling as optimization in the space of measures [24, 55]).

In discrete time, sampling is more challenging. We can discretize continuous-time dynamics to obtain algorithms, such as PLA or ULA above from the Langevin dynamics. We need some smoothness assumptions (bounds on derivatives of ) to control the discretization error, so the iteration complexity now depends on the condition number. However, the discretization error leads to an asymptotic bias, which means the algorithm converges to the wrong distribution. This bias arises because in algorithms such as PLA or ULA we are applying mismatched splitting methods for solving a composite optimization problem in the space of measures; see [55] for more discussion.

It is possible to remove the bias by applying the Metropolis filter (accept-reject step) in each iteration; this has a geometric interpretation as projection in total variation (TV) distance [5]. With the Metropolis filter, it is possible to prove the algorithm still converges exponentially fast in discrete time, and obtain an iteration complexity of to reach error in TV distance with warm start and under various conditions such as strong logconcavity, isoperimetry, or distant dissipativity [7, 21, 37, 41]. However, if we want convergence in KL divergence—which is stronger—then Metropolis filter does not work because it makes the distributions singular (have point masses). Furthermore, Metropolis filter can slow down the algorithm in practice when the rejection probability is high.

In this paper we follow another approach, which is to control the convergence of the algorithm and the size of the bias, then choose a small enough step size to make the error less than any given threshold. This approach was pioneered by Dalalyan [15, 14] and Durmus and Moulines [19] to analyze ULA under strong logconcavity, and has been extended to many other algorithms. However, the bias becomes a bottleneck in complexity. The bias scales with some power of the step size , resulting in an iteration complexity which is polynomial in (rather than logarithmic as in continuous time) to reach error in KL divergence. For example, we show in [53] that under LSI and second-order smoothness, the bias of ULA is , resulting in an iteration complexity of (ignoring dimension dependence for now). However, basic considerations suggest the correct bias is since ULA and PLA are first-order discretization, which will yield an iteration complexity of . In this paper we show this is indeed the case for PLA under LSI and third-order smoothness.

Our main result is the following. We say is -smooth if and . Here is the KL divergence of with respect to . See Theorem 1 in Section 2.3 for detail.

theoremThmMain Assume satisfies -LSI and is -smooth. For any , the iterates of PLA with step size satisfies:

(1)

This implies the following iteration complexity for PLA under LSI: to reach , it suffices to run PLA with and step size for

(2)

iterations. Here is a stationary point of (), which we can find via gradient descent.

This improves on the result [53] for ULA, in which we show under -LSI and -smoothness, ULA has iteration complexity . However, as noted above, it is likely the analysis in [53] is not tight since it only implies a bias of for ULA rather than for PLA in Theorem 1. We prove Theorem 1

by comparing a continuous-time interpolation of PLA with the Langevin dynamics to establish a recurrence for the decrease of KL divergence in each iteration; this technique is similar to 

[53] and earlier papers [14, 12]. Our improvement comes because we can show a tight error bound for the interpolation of PLA by comparing it with the weighted Langevin dynamics; see Section 2.5. Furthermore, we illustrate in the Gaussian case that the bias is indeed .

Algorithm Assumptions Iterations to Iterations to
ULA [53] -LSI, -smoothness
Underdamped Langevin dynamics [32] -LSI, -smoothness
Randomized midpoint for ULD [50] -SLC, -smoothness -
PLA (this paper) -LSI, -smoothness
Table 1: Iteration complexities for Langevin algorithms under LSI, and the fastest under SLC. Here is the KL divergence and is the Wasserstein distance.

We note a recent work [42] improves the analysis of ULA under LSI and third-order smoothness with an additional dissipativity assumption, and shows an iteration complexity for ULA which is similar to our result (2) for PLA.

Currently the fastest (in terms of error in KL divergence) algorithm for sampling under LSI is a discretization of the underdamped Langevin dynamics [32], which has iteration complexity under -LSI and -smoothness to reach . We see from (2) that PLA has the same dependence on but better dependence on .

We recall LSI implies Talagrand’s inequality, which bounds Wasserstein distance by KL divergence . Then Theorem 1 also implies the iteration complexity for PLA to reach under -LSI and -smoothness is

A previous analysis [3] shows an iteration complexity of for PLA to reach under -SLC and -smoothness. Thus, our result shows a better iteration complexity for PLA under SLC and third-order smoothness.

We note for sampling under SLC, faster rates are achieveable via more advanced algorithms, whose analyses are made possible by coupling techniques. Currently the fastest algorithm is a randomized midpoint discretization of the underdamped Langevin dynamics [50], which has iteration complexity to reach under -SLC and -smoothness, and it can be made faster by parallelizing. See also [43] for a higher-order Langevin dynamics that achieves a similar iteration complexity under an additional separability assumption. Previously the fastest results were by Hamiltonian Monte Carlo [27, 11, 35, 36], various discretization of the overdamped or underdamped Langevin dynamics [17, 18, 12, 14, 16], or using higher-order integrators such as stochastic Runge-Kutta [28]. Thus, there is a gap between the known complexity for sampling under LSI and under SLC. It is interesting to understand whether these more advanced algorithms can be analyzed under LSI, when coupling techniques no longer work.

We also note that for the case when is logconcave, there are other methods that can be used, including the ball walk and hit-and-run [25, 30, 31, 29], which have iteration complexity with logarithmic dependence on the error in TV distance or -divergence, and no dependence on the condition number.

Our second main result is a convergence guarantee for Rényi divergence of order along PLA when the biased limit satisfies either LSI or Poincaré inequality. Rényi divergence of order is a stronger generalization of KL divergence (which is the case ) with fundamental applications in statistics, physics, and computer science [46, 51, 6, 22]. Under LSI, Rényi divergence converges exponentially fast along the Langevin dynamics. Under Poincaré inequality, Rényi divergence still converges along the Langevin dynamics, but now at a rate which is initially linear, then exponential. We show that when the biased limit of PLA satisfies either LSI or Poincaré inequality, Rényi divergence with respect to converges along PLA at the same speed as along the Langevin dynamics. We can combine this with a decomposition property of Rényi divergence to obtain an iteration complexity for PLA in Rényi divergence which is controlled by the size of the bias; see Theorem 1 in Section 3.2 and Theorem 2 in Section 3.3. Furthermore, the iteration complexity under Poincaré inequality is a factor of larger than the complexity under LSI. These results are similar to the result [53] for ULA. However, we illustrate with an example in the Gaussian case that the bias in Rényi divergence of PLA is smaller (and always finite) than the bias of ULA (which can be infinite).

The rest of this paper is organized as follows. In Section 2 we state the algorithm and main result on the convergence of KL divergence under LSI. In Section 3 we state the second result on the convergence of Rényi divergence under LSI or Poincaré inequality. In Section 4 we review the Langevin dynamics. In Sections 5 and 6 we provide proofs and details. We conclude with a discussion in Section 7.

2 Algorithm and main result

Let be the target probability distribution on . We assume is differentiable.

2.1 Proximal Langevin Algorithm

We study the Proximal Langevin Algorithm (PLA) that starts from any random variable and maintains the iterates

(3)

where is step size and is an independent Gaussian random variable. The above is an implicit update, and we assume we can solve for , for example via the proximal step:

(4)

Indeed, the solution of (4) satisfies , which is (3). Note that the formulation (4) also makes sense when is not differentiable. If is -smooth (), then (4) is a strongly convex optimization problem with a unique minimizer , so PLA is well-defined. If is convex, then the restriction can be removed.

Example 1.

Let be Gaussian with mean and covariance , so . The PLA iteration is , so where and is independent. Note that for any , as . Thus, for any , PLA converges to where .

PLA (in a more general form with a Metropolis filter) was first introduced in [45] from a smoothing perspective. PLA was also studied in [3] from the optimization perspective of sampling under logconcavity assumption.111Both [45, 3] apply the proximal step before the Gaussian step in each iteration, while PLA applies them in the opposite order. Over iterations, both [45, 3] and PLA only differ by a single proximal or Gaussian step. PLA is the implicit variant of another popular algorithm, the Unadjusted Langevin Algorithm (ULA), which is the explicit iteration . PLA and ULA are discretization of the Langevin dynamics in continuous time (see Section 4 for a review), where PLA applies the backward (implicit) method to discretize the gradient, while ULA applies the forward (explicit) method. This makes PLA more expensive to implement in practice, but it offers better behavior and analysis than ULA. For example, in the Gaussian case , recall ULA converges to only for , where ; see also [55]. In this case PLA always converges and the bias is smaller than the bias of ULA.

Example 2 (PLA vs. ULA for Gaussian).

Let . For , the limit of PLA is . For , the limit of ULA is . Let

denote the eigenvalues of

. The bias of PLA in relative entropy is

while the bias of ULA is

Note that we always have

Furthermore, and .

2.2 Log-Sobolev inequality and smoothness assumption

Before stating our results, we recall some definitions for the analysis.

2.2.1 Log-Sobolev inequality

Let be probability distributions on

with smooth densities and finite second moments. We recall the

relative entropy (or Kullback-Leibler (KL) divergence) of with respect to is

(5)

Relative entropy has the property that , and if and only if . The relative Fisher information of with respect to is

(6)

The geometric meaning of relative Fisher information is as the squared gradient of relative entropy in the space of probability measures with the Wasserstein metric.

We recall satisfies log-Sobolev inequality (LSI) with constant if for all ,

(7)

LSI has the geometric interpretation as the gradient domination condition for relative entropy in the Wasserstein space [44], which ensures the Langevin dynamics converges exponentially fast in continuous time; see Section 4 for a review. We recall is strongly logconcave (SLC) with constant if is -strongly convex: for all . SLC is a strong condition that allows the analysis of many sampling algorithms. However, SLC is brittle, as it is not preserved under perturbation or arbitrary mapping. A classic result by Bakry and Émery [2] shows that SLC implies LSI with the same constant . Furthermore, LSI is more stable, as it is preserved under bounded perturbation and Lipschitz mapping. LSI also has an isoperimetric content as a bound on the log-Cheeger constant, see for example [26]. Therefore, LSI provides a natural condition to obtain fast sampling in discrete time.

2.2.2 Smoothness assumptions

We say is -smooth if is three-times differentiable and satisfies the following two conditions:

  1. The gradient is -Lipschitz:

    or equivalently, , which means .

  2. The Hessian is -Lipschitz in the operator norm:

    This implies for , where is the matrix with entry .

2.3 Main result: Convergence of relative entropy along PLA under LSI

Our first main result is the following convergence guarantee of relative entropy along PLA when the target distribution satisfies LSI and a smoothness assumption. We note that smoothness is only used in the analysis and not for the definition of PLA.

Here is the distribution of along PLA. We provide the proof of Theorem 1 in Section 2.5.

As , this implies the bias of PLA is . This bias is of the right order, since for Gaussian we have . This is smaller than the bias for ULA from [53], and thus yields a faster iteration complexity for PLA.

Concretely, given , to reach error , it suffices to run PLA such that the two terms in (1) are each less than : So we want to run PLA with step size for iterations. If we start with a Gaussian where is a stationary point (, which we can find via gradient descent), then , see [53, Lemma 1]. Therefore, Theorem 1 implies the following iteration complexity for PLA.

Corollary 1.

Assume satisfies -LSI and is -smooth. To reach , it suffices to run PLA with and step size for

(8)

iterations.

This matches the best known rate (in terms of ) for sampling under LSI, achieved by the underdamped Langevin algorithm [32], but PLA has better dependence on the LSI constant .

Furthermore, since LSI implies Talagrand’s inequality () with the same constant [44], Theorem 1 also implies the iteration complexity for PLA to reach under -LSI and -smoothness is

(9)

2.4 Analysis of relative entropy in one step of PLA

The proof of Theorem 1 relies on the following result which says relative entropy decreases by a constant factor with an additional error term in each step of PLA; this leads to bias for PLA as stated in Theorem 1. In contrast, recall the analogous result for ULA [53, Lemma 3] has error in each iteration, which leads to bias for ULA.

In the following, is the probability distribution of the iterates of PLA. We provide the proof of Lemma 1 in Section 5.1.

Lemma 1.

Assume satisfies -LSI and is -smooth, and . In each step of PLA, we have

(10)
Proof sketch.

The output of PLA (3) is the value at time of the stochastic process

(11)

starting at , where is the standard Brownian motion in . We show in Lemma 5 that  (11) evolves following the SDE

(12)

where and . Recall the Langevin dynamics with covariance converges to if the drift is (see Section 4 for a review). The difference between (12) and the ideal drift is . In Lemma 6 we show that and . Using these bounds and the LSI assumption, we can show the time derivative of relative entropy along (12) is bounded by

(13)

Integrating (13) for yields the desired bound (10). See Section 5.1 for a full proof. ∎

2.5 Proof of Theorem 1

Proof of Theorem 1.

By iterating the bound from Lemma 1, we have

where in the last step we use for , which holds because by assumption. ∎

3 Convergence in Rényi divergence

Before stating our next result, we review the definition and some properties of Rényi divergence.

3.1 Rényi divergence

The Rényi divergence of order , , of a probability distribution with respect to is

(14)

As , Rényi divergence recovers the relative entropy (KL divergence): . Rényi divergence satisfies for all , and if and only if . Furthermore, is increasing. Therefore, Rényi divergence of order is a family of stronger generalizations of KL divergence. Rényi divergence has fundamental applications in statistics, physics, and computer science [46, 22, 1, 39, 13, 40, 51, 6]. We recall Rényi divergence converges exponentially fast along the Langevin dynamics under LSI; see Section 4.

Convergence guarantee of Rényi divergence for sampling in discrete time was first studied in [53], who show that Rényi divergence converges along ULA to its biased limit at the same rate as along the Langevin dynamics when itself satisfies either LSI or Poincaré inequality. We will show a similar convergence guarantee for PLA in Section 3.2. Thus, the iteration complexity is dominated by the bias . We recall the bias of ULA can be infinite for large enough , even in the Gaussian case [53, Example 3]. On the other hand, the bias of PLA in the Gaussian case is always finite and smaller than the bias of ULA, as we show in the following example.

Example 3.

Let . For , the limit of PLA is , and the bias is finite for all :

On the other hand, for , the limit of ULA is , and the bias is:

For , we have .

3.2 Convergence of Rényi divergence along PLA under LSI

Our second main result is the following convergence guarantee in Rényi divergence along PLA, assuming the biased limit satisfies LSI. We provide the proof of Theorem 1 in Section 6.2.

Theorem 1.

Assume satisfies LSI with constant , is -smooth, and . Let . Then along PLA, for all ,

(15)

This result shows the iteration complexity for Rényi divergence along PLA depends on the bias. For , let , and assume is small so . Theorem 1 states to achieve , it suffices to run PLA with step size for

(16)

iterations. If we choose to be a proximal step from a Gaussian, then the initial Rényi divergence scales with , as we show below. Here is a stationary point for ().

Lemma 2.

Assume is -smooth, and . Let (concretely, solves where ). For all , .

Thus, Theorem 1 yields an iteration complexity of

(17)

for PLA under LSI to reach with . For example, if , then the iteration complexity is . If , as in the Gaussian case (Example 3), then the iteration complexity is . However, in general we do not know how to control this bias.

3.3 Convergence of Rényi divergence along PLA under Poincaré

We recall satisfies Poincaré inequality with a constant if for all smooth ,

where

is the variance of

under . Poincaré inequality is an isoperimetry condition which is weaker than LSI. LSI implies Poincaré inequality with the same constant, and in fact Poincaré inequality is a linearization of LSI [48, 54]. Like LSI, Poincaré inequality is preserved under bounded perturbation and Lipschitz mapping. However, Poincaré inequality is more general than LSI; for example, distributions satisfying LSI have sub-Gaussian tails, while distributions satisfying Poincaré inequality can have sub-exponential tails. Whereas LSI is equivalent to a bound on the log-Cheeger constant, Poincaré inequality is equivalent to a bound on the Cheeger constant [26]. We recall when satisfies Poincaré inequality, Rényi divergence converges along the Langevin dynamics at a rate which is initially linear then exponential; see Section 4 for a review.

Our third main result is the following convergence guarantee in Rényi divergence along PLA, assuming the biased limit satisfies Poincaré inequality. We provide the proof of Theorem 2 in Section 6.5.

Theorem 2.

Assume satisfies Poincaré inequality with constant , is -smooth, and . Let . Then along PLA, for ,

(18)

This result shows the iteration complexity for Rényi divergence along PLA depends on the bias. For , let , and assume is small so . Theorem 2 states to achieve , it suffices to run PLA with step size for

(19)

iterations. Note the dependence on is now linear, rather than logarithmic under LSI (16). As in Lemma 2, if we choose to be a proximal step from a Gaussian, then . Thus, Theorem 2 yields an iteration complexity of

(20)

for PLA under Poincaré to reach with . This is a factor of larger than the complexity under LSI (17).

4 A review on Langevin dynamics

Notation.

For a matrix , let denote the operator norm and the Hilbert-Schmidt norm. If is symmetric with eigenvalues , then and . Note that .

For a differentiable function , let

denote the gradient vector,

the Hessian matrix, and

the tensor of third-order derivatives. Let

denote the Laplacian of .

Let denote the divergence operator that acts on a vector field by . The divergence of gradient is the Laplacian: . We will use the integration by parts formula: , where the boundary term is zero if have sufficiently fast decay at infinity.

For a matrix-valued function , let denote the vector whose -th component is where . Let denote the vector whose -th component is , where is the -th row of . Let .

4.1 Weighted Langevin dynamics

Let be a differentiable matrix-valued function where is positive definite. Recall the weighted Langevin dynamics for with covariance is the SDE

(21)

where is the standard Brownian motion on . The drift term above is chosen to ensure is a stationary measure for the weighted Langevin dynamics (21). This is apparent from the following Fokker-Planck equation; see also [33, 28].

Lemma 3.

If evolves following the weighted Langevin dynamics (21), then the density evolves following

(22)
Proof.

Recall for a general Langevin dynamics , the Fokker-Planck equation for the density of is (see for example [34, 56]):

(23)

where for simplicity we write in place of