Sampling is a fundamental algorithmic task. While the case of logconcave sampling is relatively well-studied, there are recent efforts in understanding the convergence guarantees for non-logconcave sampling. This is motivated by practical applications which require sampling complicated distributions in high-dimensional spaces, as well as by the recent progress in non-convex optimization.
In this paper we study the Proximal Langevin Algorithm (PLA) for sampling from a probability distribution on :
where is step size and
is an independent Gaussian random variable in. The above is an implicit update, and we assume we can solve for , for example via the proximal step (see Section 2 for more detail):
PLA is a discretization of the continuous-time Langevin dynamics that uses the backward (implicit) method for the gradient. It is an implicit variant of the Unadjusted Langevin Algorithm (ULA), which uses the forward (explicit) method for the gradient. PLA was introduced by Pereyra  (in a more general form with Metropolis filter) from a smoothing perspective, and it was also proposed by Bernton  from an optimization perspective of sampling. PLA has been widely applied in practice, in particular when combined with ULA to sample from composite distributions [20, 9, 47, 38], and analyzed under logconcavity or strong logconcavity [3, 8, 4, 49]. Analogous to implicit vs. explicit methods in optimization, we expect PLA to have better properties than ULA at the cost of a more expensive per-iteration complexity (solving an optimization problem). See also [23, 52] for some recent extensions of PLA.
In this paper we prove convergence guarantees for PLA under isoperimetry, namely, when the target distribution satisfies either log-Sobolev inequality (LSI) or Poincaré inequality. Isoperimetry is a natural relaxation of logconcavity that still allows for fast sampling in continuous time. Strong logconcavity (SLC) implies LSI, and in turn implies Poincaré inequality with the same constant. Moreover, logconcavity and bounded diameter implies LSI and Poincaré inequality. However, isoperimetry is more general, as it is preserved under Lipschitz mapping and bounded perturbation, whereas logconcavity is not. Therefore, there is a wide class of probability distributions, including multimodal ones, satisfying isoperimetry.
In continuous time, isoperimetry is sufficient for fast sampling. For example, if satisfies LSI with constant , then along the Langevin dynamics, the Kullback-Leibler (KL) divergence converges exponentially fast at rate . This means the Langevin dynamics reaches KL divergence less than in time . In particular, there is no dependence on dimension and no assumption on the smoothness of is required, beyond differentiability in order to run the Langevin dynamics. This is analogous to the exponential convergence of gradient flow for optimization in continuous time under gradient domination condition (via the perspective of sampling as optimization in the space of measures [24, 55]).
In discrete time, sampling is more challenging. We can discretize continuous-time dynamics to obtain algorithms, such as PLA or ULA above from the Langevin dynamics. We need some smoothness assumptions (bounds on derivatives of ) to control the discretization error, so the iteration complexity now depends on the condition number. However, the discretization error leads to an asymptotic bias, which means the algorithm converges to the wrong distribution. This bias arises because in algorithms such as PLA or ULA we are applying mismatched splitting methods for solving a composite optimization problem in the space of measures; see  for more discussion.
It is possible to remove the bias by applying the Metropolis filter (accept-reject step) in each iteration; this has a geometric interpretation as projection in total variation (TV) distance . With the Metropolis filter, it is possible to prove the algorithm still converges exponentially fast in discrete time, and obtain an iteration complexity of to reach error in TV distance with warm start and under various conditions such as strong logconcavity, isoperimetry, or distant dissipativity [7, 21, 37, 41]. However, if we want convergence in KL divergence—which is stronger—then Metropolis filter does not work because it makes the distributions singular (have point masses). Furthermore, Metropolis filter can slow down the algorithm in practice when the rejection probability is high.
In this paper we follow another approach, which is to control the convergence of the algorithm and the size of the bias, then choose a small enough step size to make the error less than any given threshold. This approach was pioneered by Dalalyan [15, 14] and Durmus and Moulines  to analyze ULA under strong logconcavity, and has been extended to many other algorithms. However, the bias becomes a bottleneck in complexity. The bias scales with some power of the step size , resulting in an iteration complexity which is polynomial in (rather than logarithmic as in continuous time) to reach error in KL divergence. For example, we show in  that under LSI and second-order smoothness, the bias of ULA is , resulting in an iteration complexity of (ignoring dimension dependence for now). However, basic considerations suggest the correct bias is since ULA and PLA are first-order discretization, which will yield an iteration complexity of . In this paper we show this is indeed the case for PLA under LSI and third-order smoothness.
theoremThmMain Assume satisfies -LSI and is -smooth. For any , the iterates of PLA with step size satisfies:
This implies the following iteration complexity for PLA under LSI: to reach , it suffices to run PLA with and step size for
iterations. Here is a stationary point of (), which we can find via gradient descent.
This improves on the result  for ULA, in which we show under -LSI and -smoothness, ULA has iteration complexity . However, as noted above, it is likely the analysis in  is not tight since it only implies a bias of for ULA rather than for PLA in Theorem 1. We prove Theorem 1
by comparing a continuous-time interpolation of PLA with the Langevin dynamics to establish a recurrence for the decrease of KL divergence in each iteration; this technique is similar to and earlier papers [14, 12]. Our improvement comes because we can show a tight error bound for the interpolation of PLA by comparing it with the weighted Langevin dynamics; see Section 2.5. Furthermore, we illustrate in the Gaussian case that the bias is indeed .
|Algorithm||Assumptions||Iterations to||Iterations to|
|ULA ||-LSI, -smoothness|
|Underdamped Langevin dynamics ||-LSI, -smoothness|
|Randomized midpoint for ULD ||-SLC, -smoothness||-|
|PLA (this paper)||-LSI, -smoothness|
We note a recent work  improves the analysis of ULA under LSI and third-order smoothness with an additional dissipativity assumption, and shows an iteration complexity for ULA which is similar to our result (2) for PLA.
Currently the fastest (in terms of error in KL divergence) algorithm for sampling under LSI is a discretization of the underdamped Langevin dynamics , which has iteration complexity under -LSI and -smoothness to reach . We see from (2) that PLA has the same dependence on but better dependence on .
We recall LSI implies Talagrand’s inequality, which bounds Wasserstein distance by KL divergence . Then Theorem 1 also implies the iteration complexity for PLA to reach under -LSI and -smoothness is
A previous analysis  shows an iteration complexity of for PLA to reach under -SLC and -smoothness. Thus, our result shows a better iteration complexity for PLA under SLC and third-order smoothness.
We note for sampling under SLC, faster rates are achieveable via more advanced algorithms, whose analyses are made possible by coupling techniques. Currently the fastest algorithm is a randomized midpoint discretization of the underdamped Langevin dynamics , which has iteration complexity to reach under -SLC and -smoothness, and it can be made faster by parallelizing. See also  for a higher-order Langevin dynamics that achieves a similar iteration complexity under an additional separability assumption. Previously the fastest results were by Hamiltonian Monte Carlo [27, 11, 35, 36], various discretization of the overdamped or underdamped Langevin dynamics [17, 18, 12, 14, 16], or using higher-order integrators such as stochastic Runge-Kutta . Thus, there is a gap between the known complexity for sampling under LSI and under SLC. It is interesting to understand whether these more advanced algorithms can be analyzed under LSI, when coupling techniques no longer work.
We also note that for the case when is logconcave, there are other methods that can be used, including the ball walk and hit-and-run [25, 30, 31, 29], which have iteration complexity with logarithmic dependence on the error in TV distance or -divergence, and no dependence on the condition number.
Our second main result is a convergence guarantee for Rényi divergence of order along PLA when the biased limit satisfies either LSI or Poincaré inequality. Rényi divergence of order is a stronger generalization of KL divergence (which is the case ) with fundamental applications in statistics, physics, and computer science [46, 51, 6, 22]. Under LSI, Rényi divergence converges exponentially fast along the Langevin dynamics. Under Poincaré inequality, Rényi divergence still converges along the Langevin dynamics, but now at a rate which is initially linear, then exponential. We show that when the biased limit of PLA satisfies either LSI or Poincaré inequality, Rényi divergence with respect to converges along PLA at the same speed as along the Langevin dynamics. We can combine this with a decomposition property of Rényi divergence to obtain an iteration complexity for PLA in Rényi divergence which is controlled by the size of the bias; see Theorem 1 in Section 3.2 and Theorem 2 in Section 3.3. Furthermore, the iteration complexity under Poincaré inequality is a factor of larger than the complexity under LSI. These results are similar to the result  for ULA. However, we illustrate with an example in the Gaussian case that the bias in Rényi divergence of PLA is smaller (and always finite) than the bias of ULA (which can be infinite).
The rest of this paper is organized as follows. In Section 2 we state the algorithm and main result on the convergence of KL divergence under LSI. In Section 3 we state the second result on the convergence of Rényi divergence under LSI or Poincaré inequality. In Section 4 we review the Langevin dynamics. In Sections 5 and 6 we provide proofs and details. We conclude with a discussion in Section 7.
2 Algorithm and main result
Let be the target probability distribution on . We assume is differentiable.
2.1 Proximal Langevin Algorithm
We study the Proximal Langevin Algorithm (PLA) that starts from any random variable and maintains the iterates
where is step size and is an independent Gaussian random variable. The above is an implicit update, and we assume we can solve for , for example via the proximal step:
Indeed, the solution of (4) satisfies , which is (3). Note that the formulation (4) also makes sense when is not differentiable. If is -smooth (), then (4) is a strongly convex optimization problem with a unique minimizer , so PLA is well-defined. If is convex, then the restriction can be removed.
Let be Gaussian with mean and covariance , so . The PLA iteration is , so where and is independent. Note that for any , as . Thus, for any , PLA converges to where .
PLA (in a more general form with a Metropolis filter) was first introduced in  from a smoothing perspective. PLA was also studied in  from the optimization perspective of sampling under logconcavity assumption.111Both [45, 3] apply the proximal step before the Gaussian step in each iteration, while PLA applies them in the opposite order. Over iterations, both [45, 3] and PLA only differ by a single proximal or Gaussian step. PLA is the implicit variant of another popular algorithm, the Unadjusted Langevin Algorithm (ULA), which is the explicit iteration . PLA and ULA are discretization of the Langevin dynamics in continuous time (see Section 4 for a review), where PLA applies the backward (implicit) method to discretize the gradient, while ULA applies the forward (explicit) method. This makes PLA more expensive to implement in practice, but it offers better behavior and analysis than ULA. For example, in the Gaussian case , recall ULA converges to only for , where ; see also . In this case PLA always converges and the bias is smaller than the bias of ULA.
Example 2 (PLA vs. ULA for Gaussian).
For , the limit of PLA is .
For , the limit of ULA is .
Let denote the eigenvalues of
denote the eigenvalues of. The bias of PLA in relative entropy is
while the bias of ULA is
Note that we always have
Furthermore, and .
2.2 Log-Sobolev inequality and smoothness assumption
Before stating our results, we recall some definitions for the analysis.
2.2.1 Log-Sobolev inequality
Let be probability distributions on
with smooth densities and finite second moments. We recall therelative entropy (or Kullback-Leibler (KL) divergence) of with respect to is
Relative entropy has the property that , and if and only if . The relative Fisher information of with respect to is
The geometric meaning of relative Fisher information is as the squared gradient of relative entropy in the space of probability measures with the Wasserstein metric.
We recall satisfies log-Sobolev inequality (LSI) with constant if for all ,
LSI has the geometric interpretation as the gradient domination condition for relative entropy in the Wasserstein space , which ensures the Langevin dynamics converges exponentially fast in continuous time; see Section 4 for a review. We recall is strongly logconcave (SLC) with constant if is -strongly convex: for all . SLC is a strong condition that allows the analysis of many sampling algorithms. However, SLC is brittle, as it is not preserved under perturbation or arbitrary mapping. A classic result by Bakry and Émery  shows that SLC implies LSI with the same constant . Furthermore, LSI is more stable, as it is preserved under bounded perturbation and Lipschitz mapping. LSI also has an isoperimetric content as a bound on the log-Cheeger constant, see for example . Therefore, LSI provides a natural condition to obtain fast sampling in discrete time.
2.2.2 Smoothness assumptions
We say is -smooth if is three-times differentiable and satisfies the following two conditions:
The gradient is -Lipschitz:
or equivalently, , which means .
The Hessian is -Lipschitz in the operator norm:
This implies for , where is the matrix with entry .
2.3 Main result: Convergence of relative entropy along PLA under LSI
Our first main result is the following convergence guarantee of relative entropy along PLA when the target distribution satisfies LSI and a smoothness assumption. We note that smoothness is only used in the analysis and not for the definition of PLA.
As , this implies the bias of PLA is . This bias is of the right order, since for Gaussian we have . This is smaller than the bias for ULA from , and thus yields a faster iteration complexity for PLA.
Concretely, given , to reach error , it suffices to run PLA such that the two terms in (1) are each less than : So we want to run PLA with step size for iterations. If we start with a Gaussian where is a stationary point (, which we can find via gradient descent), then , see [53, Lemma 1]. Therefore, Theorem 1 implies the following iteration complexity for PLA.
Assume satisfies -LSI and is -smooth. To reach , it suffices to run PLA with and step size for
This matches the best known rate (in terms of ) for sampling under LSI, achieved by the underdamped Langevin algorithm , but PLA has better dependence on the LSI constant .
2.4 Analysis of relative entropy in one step of PLA
The proof of Theorem 1 relies on the following result which says relative entropy decreases by a constant factor with an additional error term in each step of PLA; this leads to bias for PLA as stated in Theorem 1. In contrast, recall the analogous result for ULA [53, Lemma 3] has error in each iteration, which leads to bias for ULA.
Assume satisfies -LSI and is -smooth, and . In each step of PLA, we have
The output of PLA (3) is the value at time of the stochastic process
where and . Recall the Langevin dynamics with covariance converges to if the drift is (see Section 4 for a review). The difference between (12) and the ideal drift is . In Lemma 6 we show that and . Using these bounds and the LSI assumption, we can show the time derivative of relative entropy along (12) is bounded by
2.5 Proof of Theorem 1
3 Convergence in Rényi divergence
Before stating our next result, we review the definition and some properties of Rényi divergence.
3.1 Rényi divergence
The Rényi divergence of order , , of a probability distribution with respect to is
As , Rényi divergence recovers the relative entropy (KL divergence): . Rényi divergence satisfies for all , and if and only if . Furthermore, is increasing. Therefore, Rényi divergence of order is a family of stronger generalizations of KL divergence. Rényi divergence has fundamental applications in statistics, physics, and computer science [46, 22, 1, 39, 13, 40, 51, 6]. We recall Rényi divergence converges exponentially fast along the Langevin dynamics under LSI; see Section 4.
Convergence guarantee of Rényi divergence for sampling in discrete time was first studied in , who show that Rényi divergence converges along ULA to its biased limit at the same rate as along the Langevin dynamics when itself satisfies either LSI or Poincaré inequality. We will show a similar convergence guarantee for PLA in Section 3.2. Thus, the iteration complexity is dominated by the bias . We recall the bias of ULA can be infinite for large enough , even in the Gaussian case [53, Example 3]. On the other hand, the bias of PLA in the Gaussian case is always finite and smaller than the bias of ULA, as we show in the following example.
Let . For , the limit of PLA is , and the bias is finite for all :
On the other hand, for , the limit of ULA is , and the bias is:
For , we have .
3.2 Convergence of Rényi divergence along PLA under LSI
Assume satisfies LSI with constant , is -smooth, and . Let . Then along PLA, for all ,
This result shows the iteration complexity for Rényi divergence along PLA depends on the bias. For , let , and assume is small so . Theorem 1 states to achieve , it suffices to run PLA with step size for
iterations. If we choose to be a proximal step from a Gaussian, then the initial Rényi divergence scales with , as we show below. Here is a stationary point for ().
Assume is -smooth, and . Let (concretely, solves where ). For all , .
3.3 Convergence of Rényi divergence along PLA under Poincaré
We recall satisfies Poincaré inequality with a constant if for all smooth ,
is the variance ofunder . Poincaré inequality is an isoperimetry condition which is weaker than LSI. LSI implies Poincaré inequality with the same constant, and in fact Poincaré inequality is a linearization of LSI [48, 54]. Like LSI, Poincaré inequality is preserved under bounded perturbation and Lipschitz mapping. However, Poincaré inequality is more general than LSI; for example, distributions satisfying LSI have sub-Gaussian tails, while distributions satisfying Poincaré inequality can have sub-exponential tails. Whereas LSI is equivalent to a bound on the log-Cheeger constant, Poincaré inequality is equivalent to a bound on the Cheeger constant . We recall when satisfies Poincaré inequality, Rényi divergence converges along the Langevin dynamics at a rate which is initially linear then exponential; see Section 4 for a review.
Assume satisfies Poincaré inequality with constant , is -smooth, and . Let . Then along PLA, for ,
This result shows the iteration complexity for Rényi divergence along PLA depends on the bias. For , let , and assume is small so . Theorem 2 states to achieve , it suffices to run PLA with step size for
iterations. Note the dependence on is now linear, rather than logarithmic under LSI (16). As in Lemma 2, if we choose to be a proximal step from a Gaussian, then . Thus, Theorem 2 yields an iteration complexity of
for PLA under Poincaré to reach with . This is a factor of larger than the complexity under LSI (17).
4 A review on Langevin dynamics
For a matrix , let denote the operator norm and the Hilbert-Schmidt norm. If is symmetric with eigenvalues , then and . Note that .
For a differentiable function , let
denote the gradient vector,the Hessian matrix, and
the tensor of third-order derivatives. Letdenote the Laplacian of .
Let denote the divergence operator that acts on a vector field by . The divergence of gradient is the Laplacian: . We will use the integration by parts formula: , where the boundary term is zero if have sufficiently fast decay at infinity.
For a matrix-valued function , let denote the vector whose -th component is where . Let denote the vector whose -th component is , where is the -th row of . Let .
4.1 Weighted Langevin dynamics
Let be a differentiable matrix-valued function where is positive definite. Recall the weighted Langevin dynamics for with covariance is the SDE
where is the standard Brownian motion on . The drift term above is chosen to ensure is a stationary measure for the weighted Langevin dynamics (21). This is apparent from the following Fokker-Planck equation; see also [33, 28].
If evolves following the weighted Langevin dynamics (21), then the density evolves following