Suppose we wish to sample from a probability distributionsupported on a convex set where is differentiable. A popular algorithm is the Unadjusted Langevin Algorithm (ULA), which is a basic discretization of the Langevin Dynamics in continuous time:
Langevin Dynamics has an optimization interpretation as the gradient flow for minimizing relative entropy (KL divergence) with respect to using the Wasserstein metric
in the space of probability distributions on, starting from the seminal work of Jordan et al. (1998); see also Wibisono (2018). In continuous time, Langevin Dynamics has convergence guarantees in various distances, including distance, KL divergence, or -divergence, under various conditions, such as strong log-concavity, or functional inequalities such as the Log-Sobolev Inequality (LSI) or Poincaré inequality. In discrete time, ULA is a biased discretization of the Langevin Dynamics, and it has an asymptotic bias which scales with the step size. In particular, by setting a small enough step size, we can obtain a mixing time bound of ULA which has inverse polynomial dependence on the error threshold; see for example Dalalyan (2017); Durmus and Moulines (2017, 2019); Dalalyan and Karagulyan (2019); Vempala and Wibisono (2019); Li et al. (2019); Li et al. (2021).
In many settings, the problem of interest is non-smooth or constrained (e.g., the ball or a general polytope), and the basic Langevin algorithm does not apply. In optimization, this is handled effectively (and elegantly) by interior-point methods, which use a convex “barrier” function to define a non-Euclidean metric. The resulting metric is given locally by the Hessian of the barrier function. This method results is a convergence bound that scales as for linear and convex optimization.
It is natural to wonder whether such a modification of the geometry could be useful for sampling. Early evidence of this is the Dikin walk, which replaces the ball walk (a discrete-time implementation of constrained Brownian motion) by using an ellipsoid at each step, defined by the Hessian of the logarithmic barrier function. This walk was shown to converge in steps for uniformly sampling a polytope with facets in dimension (Kannan and Narayanan, 2012). It was recently refined using a weighted barrier function to improve the convergence time to (Laddha et al., 2020).
A related approach that also originated in optimization is mirror descent, which uses a mirror map (the gradient of the barrier function) to change the geometry favorably, and in the context of sampling, can be seen as a generalization of the Langevin algorithm by changing the metric. For constrained sampling, the Mirror Langevin Dynamics was introduced by Zhang et al. (2020) using a mirror map to constrain the domain; see also Hsieh et al. (2018) for a related earlier approach. Mirror Langevin Dynamics is the Langevin dynamics for sampling from using the Hessian metric generated by the mirror map. In continuous time, Mirror Langevin Dynamics has nice convergence guarantees under an analogous notion of mirror Poincare inequality relative to the Hessian metric, as shown by Chewi et al. (2020); see also Appendix A for a review. In discrete time, the Mirror Langevin Algorithm (MLA) is a simple discretization of MLD proposed by Zhang et al. (2020), who showed a biased convergence analysis but with a non-vanishing bias (does not go to with step size, but remains a constant). This raised the question of whether we need a better analysis of MLA or a better discretization of MLD to achieve a vanishing bias. This also led to an alternative discretization proposed by Ahn and Chewi (2020) which achieves a vanishing bias, but requires an exact simulation of the Brownian motion with changing covariance.
In this paper, we study the Mirror Langevin Algorithm and show that it indeed has a vanishing bias. The tool we will use is the mean-square analysis framework, proposed by Li et al. (2019) and then refined by Li et al. (2021); the latter version will be used. It will help establish a biased convergence analysis of MLA under relative smoothness, strong convexity and modified self-concordance; these are a subset of the conditions assumed by Zhang et al. (2020). We show that the bias of MLA with step size scales as ; this leads to a mixing time bound for MLA (see Theorem 3.1 and Corollary 3.2).
2 Algorithm and Problem Set-Up
2.1 Problem set-up
Suppose we want to sample from a probability distribution supported on a convex set . We assume is absolutely continuous with respect to the Lebesgue measure on and has density for some differentiable .
Let be a twice-differentiable strictly convex function which is of Legendre type (Rockafellar, 1970). This implies , and in particular the gradient map is bijective. We also have for all . Moreover, we require that and as approaches the boundary of . Using the Hessian metric on will prevent the iterates from leaving the domain . We call the mirror map and the dual space.
Let be the dual function of , defined by . Recall , and we have , so for all . Furthermore, for all .
For a vector, let be the -norm. For a matrix , let be the Hilbert-Schmidt norm.
2.2 Mirror Langevin Algorithm
In this paper we study the Mirror Langevin Algorithm:
where is step size and
is an independent Gaussian random variable in. Here is a square-root of , i.e. any matrix satisfying . This algorithm can be seen as a sampling version of the mirror descent algorithm from optimization, since we can write the update of MLA in the following form which resembles mirror descent:
where is the Bregman divergence of . In particular, in the Euclidean case, i.e. when and , MLA recovers the usual Unadjusted Langevin Algorithm (ULA).
MLA can be seen as a coordinate-transformed Euler-Maruyama discretization of the Mirror Langevin Dynamics in continuous time, given by
See Section 2.3 for a reformulation purely in the dual space, and Appendix A for more details on the continuous-time dynamics. Zhang et al. (2020) studied MLA as a simple discretization of the Mirror Langevin Dynamics, and showed that under certain assumptions, the iterates of MLA converge to a Wasserstein ball around the target with some radius which depends on the modified self-concordance parameter of (see Section 3.1 for more detail). In the Euclidean case (when is quadratic) this radius is , so MLA converges to with sufficiently small step size, recovering the typical bound for ULA. However, for general , the radius is positive. Therefore, the result of Zhang et al. (2020) only guarantees MLA enters a ball around the target, but it may not converge to the target even when the step size goes to . They further conjectured the bias is unavoidable. This raises an interesting question of whether the non-vanishing bias of MLA is indeed unavoidable because we are discretizing a diffusion process with changing covariance, or whether there is a better analysis of MLA with vanishing bias. Here we show that indeed MLA has a vanishing bias, by applying the mean-square analysis framework of Li et al. (2019) and Li et al. (2021).
2.3 Mirror Langevin Algorithm in the Dual Space
Let us work in the dual space via the mirror map . Given , we define the dual variable
and its inverse is given by . The target distribution on the dual space is the pushforward of the original target under the mirror map: . If we write the density as , then we have . Moreover, the Hessian metric on corresponds to the Hessian metric on generated by the dual function ; that is, on is the pullback metric of on under the inverse mirror map . Therefore, the metric space is isometric to .
If follows the Mirror Langevin Algorithm (1), then follows the Mirror Langevin Algorithm in the dual space:
MLA in the dual space (3) can be seen as a discretization of the mirror Langevin dynamics to sample from with the Hessian metric on .
Let us define and by
Note here is any square-root of . Then we can write MLA in the dual space as
As , MLA converges to the Mirror Langevin Dynamics, which is a continuous-time stochastic process following the stochastic differential equation:
where is the standard Brownian motion in ; see Section 4.2.1 for more properties.
2.4 Wasserstein distance in dual space
Along MLA in the dual space (3), let denote the distribution of the random variable . We will show a convergence analysis of MLA in the dual space in terms of the Euclidean Wasserstein distance between and on :
Note that this distance does not use the Hessian metric on . In the original space , this gives a modified distance under the mirror map:
That is, if and , then . This is the same modified Wasserstein distance that is used in Zhang et al. (2020). This corresponds to using the squared Hessian metric on , which is isometric to the Euclidean metric on (rather than the Hessian metric on , which is isometric to the Hessian metric on , and which is used in the continuous-time analysis in Chewi et al. (2020)).
3 Main Result: Mixing Time Bound for MLA
We present our main result on the mixing time bound of MLA. We need the following assumptions.
satisfies the modified self-concordance property with parameter , which means:
Equivalently, is -Lipschitz in the Hilbert-Schmidt norm:
is -smooth with respect to for some , which means:
Equivalently, is -Lipschitz:
is -strongly convex with respect to for some , which means:
Equivalently, is -monotone:
These are a subset of the assumptions in Zhang et al. (2020). In particular, we do not assume a bound on the commutator of and . Our main result is the following.
See Section 4.3 for the proof of Theorem 3.1 and explicit forms of the constant and maximum step size . This result shows MLA has a bias that is vanishing with step size, and thus we can reach an arbitrary accuracy by using a small enough step size. In particular, this improves on the analysis in Zhang et al. (2020), which has a non-vanishing bias and under stronger assumptions. By choosing a small step size, we obtain the following mixing time bound for MLA.
For any (small) error threshold , to reach , it suffices to run MLA in the dual space (3) with step size for iterations where
3.1 Discussion of result
Theorem 3.1 shows that MLA has a biased convergence guarantee where the bias scales as where is dimension and is step size (assuming are independent of for now). This leads to a mixing time bound of for MLA.
Let us compare MLA with ULA (i.e., MLA in the Euclidean case with ). Recall for ULA, the mean-square analysis by Li et al. (2021) yields a biased convergence guarantee where the bias scales as under an additional 3rd-order regularity condition on . This leads to a mixing time bound of for ULA. We see the bias of MLA has a worse dependence on than the bias of ULA. This is because the continuous-time Mirror Langevin Dynamics (26) of MLA has a changing covariance, while the usual continuous-time Langevin Dynamics of ULA has a constant covariance; therefore, MLA incurs an additional stochastic error from the Brownian motion part, which is not incurred by ULA. Formally, this is reflected in the orders of error of the two algorithms: We show below that MLA has local weak and strong errors of orders at least and (note the local weak order of MLA is actually , because it is the Euler-Maruyama discretization of an SDE; the multiplicative noise causes the strong error to lose half an order, but not the weak error (see e.g., (Milstein and Tretyakov, 2013, page 14)); however, we will see that as long as , the order of the final sampling error is determined by but not , and even though our bound is not tight in order, its constants can be made very explicit and hence helpful to later analysis). On the other hand, it is well known that ULA has local weak and strong error of orders and because it is the Euler-Maruyama discretization of an SDE with additive noise (see Milstein and Tretyakov (2013) for the general theory and Li et al. (2021) for details of worked out constants). It would be interesting to understand whether we can improve the local errors and the bias of MLA, perhaps using more sophisticated discretization of MLD to improve the stochastic error.
Our result improves on the analysis of Zhang et al. (2020), who assume stronger assumptions (our assumptions (A1), (A2), (A3)
, along with two assumptions on the moment ofand a bound on the commutator of and ), and prove a biased convergence analysis where the bias scales as , where does not depend on . Note in the Euclidean case (when ), the modified self-concordance parameter is , and thus ; but for general , the asymptotic radius is positive: , so the result of Zhang et al. (2020) does not guarantee convergence to the target. With our mean-square analysis, we have shown that in fact there is no dependence on this radius , and the bias indeed scales as .
We note our result uses the modified self-concordance property, as also used in Zhang et al. (2020). In one-dimension (), modified self-concordance is equivalent to the classical self-concordance property: Both are equivalent to the condition that is a Lipschitz function. However, in higher dimension, they are different. In particular, modified self-concordance is not an affine-invariant property (in contrast to the classical self-concordance), and the parameter can be arbitrarily large; see example in Appendix D. This is problematic since our convergence bound only holds when is less than (the strong convexity parameter). It would be desirable to have an analysis of MLA with the more natural self-concordance property.
Our result in Theorem 3.1 shows that to obtain a consistent algorithm (with a vanishing bias) from MLD, it suffices to apply a simple discretization such as MLA. This shows we do not need to use an exact simulator of the Brownian motion with changing covariance, as proposed by Ahn and Chewi (2020), which allows a nice analysis under self-concordance property. It would be interesting to bridge the analysis technique to MLA.
The relative smoothness (A2) and relative strong convexity (A3) conditions imply that the Hessian of are bounded by the Hessian of :
See (Zhang et al., 2020, Appendix B) for more details. Since we assume is a Legendre function, as ; then for our result to hold, we need as . This restricts the applicability of the result; for example, it does not apply when is a uniform (
) or Gaussian distribution (is quadratic) restricted on a polytope with being the log-barrier function. It is desirable to have a more general convergence analysis of MLA under weaker conditions on and .
4 Proof of main result
The proof of Theorem 3.1 uses the mean-square analysis framework described in Li et al. (2021). We review the mean-square analysis framework in Section 4.1. We verify the conditions hold for MLA in Section 4.2, and apply the mean-square analysis to prove Theorem 3.1 in Section 4.3.
4.1 A review of the mean-square analysis framework
Mean-square analysis was a classical tool for analyzing the integration error of SDEs (e.g., Milstein and Tretyakov (2013)). Li et al. (2019) extended it to obtain non-asymptotic sampling error bound of an algorithm which is a discretization of a decaying stochastic differential equation (SDE). While Li et al. (2019) required the local errors to satisfy uniform bounds, Li et al. (2021) relaxes this requirement and only needs non-uniform bounds. We will establish non-uniform local error bounds for MLA, and thus use the version of mean-square analysis in Li et al. (2021). The results will be reviewed in a simplified setting; see (Li et al., 2021, Section 3) for details.
Consider a continuous-time process which evolves following the SDE:
for some vector field and matrix . We assume and are Lipschitz continuous. Here is the standard Brownian motion in .
Since and are Lipschitz continuous, one can show (Milstein and Tretyakov, 2013, Lemma 1.3) that there exist a maximum time and a constant such that for any solutions with synchronous coupling:
Algorithm and local error.
Suppose we have an algorithm depending on a step size that simulates the solution of the SDE (17) at time .
For any , let denote the solution of the SDE (17) at time , and let denote the output of the algorithm from . We say that the algorithm has (non-uniform) local weak error of order if there exist a maximum step size and constants such that
We say the algorithm has (non-uniform) local strong error of order if there exist a maximum step size and constants such that
Here and are coupled by sharing the same filtration (i.e. the algorithm has access to the realization of the Wiener process that generates ).
When , the bounds are termed as uniform bounds in Li et al. (2019).
Bound on global error.
With the set-up above, the mean-square analysis framework produces the following bound on the global (long-term) error.
Theorem 4.1 ((Li et al., 2021, Theorem 3.3, 3.4)).
Assume the SDE (17) is contractive with rate . Assume the algorithm has local weak error of order and local strong error of order with . Let us define a maximum step size by
and constants and by
Starting from any , suppose we run the algorithm with step size to produce iterates . Let denote the solution to the SDE (17) at time . Then is close to at all time:
Furthermore, the distribution of has the following biased convergence guarantee:
4.2 Application to MLA
For our sampling problem, we wish to apply the mean-square analysis framework to the Mirror Langevin Algorithm in the dual space (3). The continuous-time SDE (17) of MLA is the Mirror Langevin Dynamics, which we review in the next section. We establish the local error orders of MLA in the following section.
4.2.1 Mirror Langevin Dynamics
Consider the Mirror Langevin Dynamics (MLD), which is a stochastic process following the SDE:
By assumptions (A1) and (A2), and are Lipschitz continuous. Let us establish the contractivity and deviation bound on MLD. The proofs are provided in Appendix B.
Assume (A1) and (A2) with . Then MLD (26) is contractive with rate .
Assume (A1), (A2), and (A3) with . Then any two solutions of MLD (26) with synchronous coupling satisfy
We also need the following bound on MLD. Let and .
Assume (A1), (A2), and (A3). Along MLD (26), for ,
4.2.2 Local Errors of the Mirror Langevin Algorithm
Assume (A1), (A2), and (A3). Then MLA (3) has local weak error at least of order , with maximum step size and constants
Assume (A1), (A2), and (A3). Then MLA (3) has local strong error at least of order , with maximum step size and constants
4.3 Proof of Theorem 3.1: Convergence Rate of MLA
Proof of Theorem 3.1.
Assume (A1), (A2), and (A3) with . We have verified that MLA satisfies the conditions in the mean-square analysis framework: In Lemma 4.2 we show MLD is contractive with rate . We derive the deviation bound in Lemma 4.3 with . In Lemmas 4.5 and 4.6 we show MLA has local weak error of order and local strong error of order , and indeed .
Then by Theorem 4.1, we can compute the maximum step size:
Our result leaves open many questions, including the following. It would be interesting to consider a more sophisticated discretization of MLD such that the mean-square analysis framework will show improved local errors and smaller bias.
It would be interesting to have a better analysis of MLA under more natural conditions on , such as self-concordance (rather than modified self-concordance), and under relaxed requirements on and (e.g. that allows us to sample from a uniform or Gaussian distribution on a polytope). It would be desirable to have a convergence analysis of MLA in the Wasserstein distance generated by the Hessian metric rather than the Euclidean metric, or in other measures such as KL or -divergence.
It would be interesting to understand whether we can discretize the Newton Langevin Dynamics (which is the case when as described in Appendix A.4 and which is affine-invariant in continuous time) and obtain a discrete-time algorithm with a convergence guarantee which is also affine-invariant.
It would also be interesting to understand whether we can derive a more general discrete-time analysis framework that works under a relaxed condition, e.g. without requiring contraction in continuous time, but only exponential convergence in function value (which is known for ULA under the log-Sobolev inequality, see for example Vempala and Wibisono (2019)).
- Ahn and Chewi (2020) Kwangjun Ahn and Sinho Chewi. Efficient constrained sampling via the mirror-Langevin algorithm. arXiv preprint arXiv:2010.16212, 2020.
- Brascamp and Lieb (1976) Herm Jan Brascamp and Elliott H Lieb. On extensions of the Brunn-Minkowski and Prékopa-Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation. Journal of Functional Analysis, 22(4):366–389, 1976. ISSN 0022-1236.
- Chewi et al. (2020) Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, Philippe Rigollet, and Austin Stromme. Exponential ergodicity of mirror-Langevin diffusions. In Advances in Neural Information Processing Systems, volume 33, pages 19573–19585. Curran Associates, Inc., 2020.
Further and stronger analogy between sampling and optimization:
Langevin Monte Carlo and gradient descent.
In Proceedings of the 2017 Conference on Learning Theory,
volume 65 of
Proceedings of Machine Learning Research, pages 678–689. PMLR, 07–10 Jul 2017. URL http://proceedings.mlr.press/v65/dalalyan17a.html.
- Dalalyan and Karagulyan (2019) Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 2019.
- Durmus and Moulines (2017) Alain Durmus and Eric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Annals of Applied Probability, 27(3):1551–1587, 2017.
Durmus and Moulines (2019)
Alain Durmus and Eric Moulines.
High-dimensional bayesian inference via the unadjusted Langevin algorithm.Bernoulli, 25(4A):2854–2882, 2019.
- Fathi (2019) Max Fathi. Quelques applications du transport optimal en analyse et en probabilités. Habilitation à diriger des recherches, Université Paul Sabatier (Toulouse 3), April 2019.
- Hsieh et al. (2018) Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored Langevin Dynamics. In Advances in Neural Information Processing Systems 31: NeurIPS 2018, Montréal, Canada, pages 2883–2892, 2018.
- Jordan et al. (1998) Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, January 1998.
Kannan and Narayanan (2012)
Ravindran Kannan and Hariharan Narayanan.
Random walks on polytopes and an affine interior point method for linear programming.Mathematics of Operations Research, 37(1):1–20, 2012. ISSN 0364765X, 15265471.
Laddha et al. (2020)
Aditi Laddha, Yin Tat Lee, and Santosh Vempala.
Strong self-concordance and sampling.
Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 1212–1222, 2020.
- Li et al. (2021) Ruilin Li, Hongyuan Zha, and Molei Tao. Sqrt() dimension dependence of Langevin Monte Carlo. arXiv preprint arXiv:2109.03839, 2021.
- Li et al. (2019) Xuechen Li, Yi Wu, Lester Mackey, and Murat A Erdogdu. Stochastic Runge-Kutta accelerates Langevin Monte Carlo and beyond. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Milstein and Tretyakov (2013) Grigori Noah Milstein and Michael V Tretyakov. Stochastic numerics for mathematical physics. Springer Science & Business Media, 2013.
- Rockafellar (1970) R. Tyrrell Rockafellar. Convex Analysis. Princeton Landmarks in Mathematics and Physics. Princeton University Press, 1970. ISBN 978-1-4008-7317-3.
- Vempala and Wibisono (2019) Santosh Vempala and Andre Wibisono. Rapid convergence of the Unadjusted Langevin Algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Wibisono (2018) Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 2093–3027, 2018. URL http://proceedings.mlr.press/v75/wibisono18a.html.
- Zhang et al. (2020) Kelvin Shuangjian Zhang, Gabriel Peyré, Jalal Fadili, and Marcelo Pereyra. Wasserstein control of Mirror Langevin Monte Carlo. In Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 3814–3841. PMLR, 2020. URL http://proceedings.mlr.press/v125/zhang20a.html.
Appendix A Riemannian and Mirror Langevin Dynamics in Continuous Time
Consider the problem of sampling from on as described in Section 2.
Suppose we endow with a Riemannian metric , which we write as a positive definite matrix: for all . This means at each point we measure local norm using the metric :
for all in the tangent space. We assume is differentiable. Let be the inverse matrix, and let be a square-root of . Let be the divergence of , which is a vector-valued function whose entries are the divergences of the columns of . We assume (equivalently, ) as approaches the boundary of .
a.1 Review for optimization
Recall in optimization, the Riemannian gradient flow (RGF) (or natural gradient flow) for minimizing using the metric is the solution to the differential equation:
Here we use the inverse metric to turn the -gradient into a gradient tangent vector under the Riemannian metric . RGF has nice properties when the objective function satisfies some properties. For example, if is geodesically strongly convex (which means is strongly convex along geodesics generated by the Riemannian metric ), then RGF is exponentially contracting. Moreover, if is gradient dominated with respect to , then the function value converges exponentially fast along RGF.
Consider when the metric is given by the Hessian of a convex Legendre function : . Then the RGF becomes:
In terms of the dual variable , this becomes the mirror flow:
Recall by the mirror map , the metric on becomes the Hessian metric on . The mirror flow is also the Riemannian gradient flow for minimizing the pushforward function under the Hessian metric (because ). Discretizing the mirror flow gives the mirror descent algorithm in optimization.
a.2 Riemannian Langevin Dynamics
The Riemannian Langevin Dynamics (RLD) for sampling from on using the metric is the solution to the stochastic differential equation:
Here is the standard Brownian motion in . Since as , the process does not leave : If , then for all .
The additional drift term accounts for the covariance in the Brownian motion. The stationary distribution for RLD is (the density is with respect to the Lebesgue measure on ). This can be seen, for example, from the following Fokker-Planck equation.
If follows RLD (29), then its density
evolves following the partial differential equation (PDE):
Clearly if then the dynamics is stationary. Furthermore, the PDE above can be interpreted as the gradient flow for minimizing relative entropy with respect to the Wasserstein metric on the metric space .
From the Fokker-Planck equation (30), we can derive how fast the dynamics RLD approaches the target distribution in various measures.
For example, recall the -divergence of a probability distribution with respect to is
Then a standard calculation reveals that the -divergence is decreasing along RLD (30):