Sampling from a target distribution is a fundamental task in machine learning. Consider the Euclidean spaceand a convex function . Assuming that has a positive finite integral w.r.t. the Lebesgue measure , we consider the task of sampling from the distribution whose density is proportional to (we shall write ).
If is smooth, Langevin algorithm produces a sequence of iterates asymptotically distributed according to . Langevin algorithm performs iterations of the form
is a sequence of i.i.d. standard Gaussian vectors in. Each iteration of (1) can be seen as a gradient descent step for , where the gradient of perturbed by a Gaussian vector. Nonasymptotic bounds for Langevin algorithm have been established in dalalyan2017theoretical ; durmus2017nonasymptotic . Moreover, Langevin algorithm can be interpreted as an inexact gradient descent method to minimize the Kullback-Leibler (KL) divergence w.r.t. in the space of probability measures cheng2018convergence ; durmus2018analysis ; bernton2018langevin ; wibisono2018sampling ; ma2019there ; ambrosio2008gradient .
In many applications, the function
is naturally written as the sum of a smooth and a nonsmooth term. In Bayesian statistics for example,typically represents some posterior distribution. In this case, is the sum of the -likelihood (which is itself a sum over the data points) and the possibly nonsmooth potential of the prior distribution welling2011bayesian ; durmus2018efficient ; durmus2018analysis , which plays the role of a regularizer. In some other applications in Bayesian learning, the support of is not the whole space bubeck2018sampling ; brosse2017sampling (i.e., can take the value ). In order to cover these applications, we consider the case where is written as
is a random variable,for every , is smooth and convex and is nonsmooth and convex. We assume to have access to the stochastic gradient (where is a random variable with values in ) and to the proximity operator of . The template (2) covers many log concave densitiesChaux2007 ; durmus2018efficient ; durmus2018analysis ; bubeck2018sampling . In optimization, the minimization of can be efficiently tackled by the proximal stochastic gradient algorithm atc-for-mou-17 . Inspired by this optimization algorithm, the Proximal Stochastic Gradient Langevin Algorithm (PSGLA) durmus2018analysis is the method performing proximal stochastic Langevin steps of the form
where , is a sequence of i.i.d. standard Gaussian random vectors in , and is a sequence of i.i.d. copies of . Remarkably, the iterates of PSGLA remain in the domain of , i.e., the support of , a property that is useful in many contexts. When is Lipschitz continuous, the support of is and PSGLA can be interpreted as an inexact proximal gradient descent method for minimizing KL, with convergence rates proven in terms of the KL divergence durmus2018analysis . However, for general , the KL divergence can take infinite values along PSGLA. Therefore, a new approach is needed.
1.1 Related works
First, various instances of the PSGLA algorithm have already been considered.
The only instance allowing to be infinite (i.e., the support of not to be ) is the Projected Langevin Algorithm bubeck2018sampling , which corresponds to our setting in the special case with (i.e., the indicator function of a convex body111A convex body is a compact convex set with a nonempty interior. ), and (i.e., the full gradient of ). In this case, is the orthogonal projection onto and is supported by . Bubeck et al bubeck2018sampling provide complexity results in terms of sufficient number of iterations to achieve accuracy in terms of the Total Variation between the target distribution and the current iterate distribution. Assuming that is convex and smooth, the complexity of the Projected Langevin Algorithm is 222Our big O notation ignores logarithm factors., and if , the complexity is improved to .
Other instances of PSGLA were proposed in the case where is Lipschitz continuous or smooth (and hence finite). Wibisono wibisono2018sampling considered the case with and , proposing the Symmetrized Langevin Algorithm (SLA), and showed that the current iterate distribution converges linearly in Wasserstein distance to the invariant measure of the SLA, if is strongly convex and smooth. Durmus et al durmus2018analysis considered the case where is Lipschitz continuous, and showed that the complexity of PSGLA is in terms of the KL divergence and in terms of the Total Variation distance if is convex and smooth. If is strongly convex, the complexity is in Wasserstein distance and in KL divergence. Bernton bernton2018langevin studied a setting similar to durmus2018analysis and derived a similar result for the Proximal Langevin Algorithm (i.e., PSGLA without the gradient step) in the strongly convex case. The Proximal Langevin Algorithm was also studied in a recent paper of Wibisono wibisono2019proximal , where a rapid convergence result was proven in the case where is nonconvex but satisfies further smoothness and geometric assumptions.
Second, the task of sampling w.r.t. , where is nonsmooth and possibly takes infinite values, using Langevin algorithm, has also been considered.
When is strongly convex and an indicator function of a bounded convex set, the existence of an algorithm achieving in Wasserstein and Total Variation distances was proven by Hsieh et al (hsieh2018mirrored, , Theorem 3). However, an actual algorithm is only given in a specific, although nonconvex, case. Besides, MYULA (Moreau-Yosida Unadjusted Langevin Algorithm) durmus2018efficient ; brosse2017sampling can tackle the task of sampling from efficiently. MYULA is equivalent to Langevin algorithm (1) applied to sampling from , where is the Moreau-Yosida approximation of atchade2015moreau . By choosing the smoothing parameter appropriately, and making assumptions that allow to control the distance between and (e.g., Lipschitz or ), complexity results for MYULA were established in durmus2018efficient ; brosse2017sampling . For example, if is the indicator function of a convex body, Brosse et al brosse2017sampling show that the complexity of MYULA is in terms of the Total Variation distance (resp. 1-Wasserstein distance) if is convex and smooth (resp., if is strongly convex and smooth), provided that the algorithm is initialized from a minimizer of . Similarly to PSGLA, MYULA involves one proximal step and one gradient step per iteration. However, the support of the smoothed distribution is always (even if is not fully supported), and therefore the iterates of MYULA do not remain in the support of the target distribution , contrary to PSGLA.
Finally, the task of sampling w.r.t. , where is not smooth but finite, has also been considered.
The Perturbed Langevin Algorithm proposed by Chatterji et al chatterji2019langevin allows to sample from in the case when satisfies a weak form of smoothness (generalizing both Lipschitz continuity and smoothness) and without accessing its proximity operator. Finally, if is Lipschitz continuous, the Stochastic Proximal Langevin Algorithm proposed by Salim et al salim2019stochastic and Schechtman et alschechtman2019passty allows to sample from using cheap stochastic proximity operators only.
In summary, the only instance of PSGLA allowing to be infinite has complexity in Total Variation bubeck2018sampling and only applies to the case where is the indicator of a convex body. In this case, assuming moreover that is strongly convex, MYULA has complexity in 1-Wasserstein distance brosse2017sampling , but allows the iterates to leave the support of . Besides, in the latter case, there exists a Langevin algorithm achieving rate in the Wasserstein distance hsieh2018mirrored .
In this paper, we consider other cases where can be infinite. More precisely, we consider a general and we assume that has a mild Sobolev regularity. Under this assumption, we prove that PSGLA has the complexity in Wasserstein distance if is strongly convex. We also show that, if is just convex, PSGLA has the complexity in terms of a newly defined duality gap, which can be seen as the notion that replaces KL, since KL can be infinite.
Our approach follows the line of works cheng2018convergence ; durmus2018analysis ; wibisono2018sampling ; bernton2018langevin ; ma2019there ; wibisono2019proximal ; vempala2019rapid that formulate the task of sampling form as the problem of minimizing the KL divergence w.r.t . In summary, our contributions are the following.
In the first part of the paper, we reformulate the task of sampling from as the resolution of a monotone inclusion defined on the space of probability measures. We subsequently use this reformulation to define a duality gap for the minimization of the KL divergence, and show that strong duality holds.
In the second part of this paper, we use this reformulation to represent PSGLA as a primal dual stochastic Forward Backward algorithm involving monotone operators.
This new representation of PSGLA, along with the strong duality result from the first part, allows us to prove new complexity results for PSGLA that extend and improve the state of the art.
Finally, we conduct some numerical experiments for sampling from a distribution supported by a half space of matrices.
In the first part we combine tools from optimization duality condatreview and optimal transport ambrosio2008gradient and in the second part we combine tools from the analysis of the Langevin algorithm durmus2018analysis , and the analysis of primal dual optimization algorithms drori2015simple ; chambolle2016ergodic .
The remainder is organized as follows. In Section 2 we provide some background knowledge on convex analysis and optimal transport. In Section 3 we develop a primal dual optimality theory for the task of sampling from . In Section 4 we give a new representation of PSGLA using monotone operators. We use it to state our main complexity result on PSGLA in Section 5. We discuss the broader impact of our work in Section 6. Numerical experiments (which are secondary as this is mainly a theoretical work) and all proofs are postponed to the appendix. Therein we also provide further intuitions on PSGLA along with an extension of PSGLA for handling a third (stochastic and proximable) term in the definition of the potential (2).
Throughout this paper, we use the conventions and .
2.1 Convex analysis
In this section, we recall some facts from convex analysis. These facts will be used in the proofs without mention. For more details, the reader is referred to bau17 .
2.1.1 Convex optimization
By we denote the set of proper, convex, lower semicontinuous functions . A function is -smooth if is differentiable and its gradient is -Lipschitz continuous. Consider and denote its domain. Given , a subgradient of at is any vector satisfying
for every . If the set of subgradients of at is not empty, then there exists a unique element of with minimal norm. This particular subgradient is denoted . The set valued map is called the subdifferential. The proximity operator of , denoted , is defined by
By we denote the indicator function of set given by if and if . If , where is a closed convex set, then is the orthogonal projection onto . Moreover, is the only solution to the inclusion . The Fenchel transform of is the function defined by Several properties relate to its Fenchel transform . First, the Fenchel transform of is . Then, the subdifferential is characterized by the relation Finally, is -strongly convex if and only if is -smooth.
2.1.2 Maximal monotone operators
A set valued function is monotone if whenever and . The inverse of , denoted , is defined by the relation , and the set of zeros of is . If is monotone, is maximal if its resolvent, i.e., the map , is single valued. If , then is a maximal monotone operator and . Moreover, and . If
is a skew symmetric matrix on, the operator is maximal monotone. Finally, the sum is also a maximal monotone operator. Many problems in optimization can be cast as the problem of finding a zero of the sum of two maximal monotone operators condatreview . For instance, . To solve this problem, the Forward Backward algorithm is given by the iteration , where is a symmetric positive definite matrix (),333The operators and are not monotone in general, however they are monotone under the inner product induced by . and is single valued. Using the definition of the resolvent, the Forward Backward algorithm can equivalently be written as
2.2 Optimal transport
In this section, we recall some facts from optimal transport theory. These facts will be used in the proofs without mention. For more details, the reader is referred to Ambrosio et al ambrosio2008gradient .
2.2.1 Wasserstein distance
By we denote the -field of Lesbesgue measurable subsets of , and by the set of probability measures over
with finite second moment. Denote the support of . The identity map belongs to the Hilbert space of -square integrable random vectors in . We denote (resp. ) the inner product (resp. the norm) in this space. Given , where is some Euclidean space, the pushforward measure of by , also called the image measure, is defined by for every . Consider . A coupling between and (we shall write ) is a probability measure over such that , where , and , where . In other words, is a random variable such that the distribution of is (we shall write ) and if and only if the distribution of is a coupling. The (2-)Wasserstein distance is then defined by
Let be the set of elements such that is absolutely continuous w.r.t. (we shall write ). Brenier’s theorem asserts that if , then the defining is actually a achieved by a unique minimizer . Moreover, there exists a uniquely determined -almost everywhere (a.e.) map such that , where . In this case, is called the optimal pushforward from to and satisfies
2.2.2 Geodesically convex functionals
We shall consider several functionals defined on the space . For every with density denoted w.r.t. , the entropy is defined by
and if , then . Given , the potential energy is defined for every by
Finally, if such that , the Kullback-Leibler (KL) divergence is defined by
where denotes the density of w.r.t. , and if is not absolutely continuous w.r.t. . The functionals , and satisfy a form of convexity over called geodesic convexity. If is geodesically convex, then for every , , and , Given , a (Wasserstein) subgradient of at is a random variable such that for every ,
Moreover, if is a subgradient of at , then the following monotonicity property holds
If the set of subgradients of at is not empty, then there exists a unique element of with minimal norm. This particular subgradient is denoted . However, the set might be empty. A typical condition for nonemptiness requires the density to have some Sobolev regularity. For every open set , we denote the Sobolev space of -integrable functions admitting a -integrable weak gradient . We say that if for every bounded open set . Obviously, .
2.3 Assumptions on and
Consider and . We make the following assumptions.
The function is convex and -smooth. Moreover, .
Note that . We denote (resp. ) the strong convexity parameter of (resp. ), equal to zero if (resp. ) is not strongly convex.
The integral is positive and finite.
Assumption 2 is needed to define the target distribution , and implies that , where .
The integral is finite.
Assumption 3 is equivalent to requiring , see below. Moreover, we assume the following regularity property for the function .
The function belongs to the space .
Assumption 4 is a necessary condition for , see below. This assumption precludes
from being a uniform distribution. However, Assumption4 is quite general, e.g., need not be continuous or positive (see the numerical experiment section). Finally, we assume that the stochastic gradients of
have a bounded variance. Consider an abstract measurable space, and a random variable with values in .
For every , is integrable and . Moreover, there exists such that for every , .
The last assumption implies that the stochastic gradients are unbiased: for every .
3 Primal dual optimality in Wasserstein space
Let be defined by
Equation (15) says that is the unique minimizer of :
3.1 Subdifferential calculus
The following result is a consequence of (ambrosio2008gradient, , Theorem 10.4.13).
Let be an element of . Then, and . Moreover, if and only if and there exists such that
for -a.e. . In this case, .
Equation (17) can be seen as the first order optimality conditions associated with the minimization of the functional . Consider the "dual" variable defined a.e. Using Assumption 3 and , . We can express the first order optimality condition (17) as , a.e. Besides, , therefore using Denote and . The relationship between and can be summarized as
In the sequel, we fix the probability space , denote the mathematical expectation and the space . The expression "almost surely" (a.s.) will be understood w.r.t. . Recall that is the map and . Using Assumption 3, , , , and a.s.
3.2 Lagrangian function and duality gap
We introduce the following Lagrangian function defined for every and by
where . We also define the duality gap by
The next theorem, which is of independent interest, can be interpreted as a strong duality result for the Lagrangian function , see (rockafellar1970convex, , Lemma 36.2).
Theorem 3 (Strong duality).
4 Forward Backward representation of PSGLA
In this section, we present our viewpoint on PSGLA (3). More precisely, we represent PSGLA as a (stochastic) Forward Backward algorithm involving (stochastic) monotone operators which are not necessarily subdifferentials.
Rigorous Forward Backward representation.
The “monotone” inclusion (23) intuitively suggests the following stochastic Forward Backward algorithm for obtaining samples from (and hence from by marginalizing):
Above, is an appropriately chosen matrix. Indeed, Algorithm (24)-(25) is a stochastic Forward-Backward algorithm rosasco2016stochastic ; com-pes-pafa16 ; bia-hac-16 ; bia-hac-sal-jca17 where the gradient is perturbed by a Gaussian vector, as in the Langevin algorithm (1). In Algorithm (24)-(25), we cannot set to be the identity map of because the inclusion (25) is intractable in this case. We take , i.e., with our notations, . Although the matrix is only semi-definite positive, the next lemma shows that Algorithm (24)-(25) is still well defined. More precisely, the next lemma shows that (by taking in the lemma) and hence the resulting algorithm (24)-(25) is PSGLA. Based on the representation (24)-(25) of PSGLA, the next lemma also provides an important inequality used later in the proof of Theorem 5.
Let . Then if and only if and . Moreover, if is -smooth, then
5 Main results
We now provide our main result on PSGLA (3). For , denote (resp. ) the distribution of (resp. ), defined in the previous section.
The proof of Theorem 5 relies on using Lemma 4 along with (durmus2018analysis, , Lemma 30). Inspecting the proof of Theorem 5, one can see that any can replace 444The proof does not rely on specific properties of the latter like being primal dual optimal.. The situation is similar to primal dual algorithms in optimization chambolle2016ergodic ; drori2015simple and Evolution Variational Inequalities in optimal transport ambrosio2008gradient .
If is Lipschitz continuous (in particular if ), then our Assumptions hold true. Moreover, inequality (28) recovers (durmus2018analysis, , Corollary 18) but with the duality gap instead of the KL divergence. Obtaining a result in terms of KL divergence is hopeless for PSGLA in general because the KL divergence is infinite; see the appendix. To the best of our knowledge, inequality (29) that holds when is just convex is new, even in the particular case where is Lipschitz. Corollary 6 implies the following complexity results. Given , choosing and in inequality (28) leads to . If (i.e., if is smooth), choosing and in inequality (29) leads to . Finally, if (i.e., if is strongly convex), choosing and i.e.,
6 Broader impact
We made a step towards theoretical understanding the properties of the Langevin algorithm in the elusive case where the target distribution is not smooth and not fully supported. The case is known to be difficult to analyze and has many applications brosse2017sampling ; durmus2018efficient ; bubeck2018sampling . Our analysis improves and extends the state of the art.
Moreover, our approach is new. We developed a primal dual theory for a minimization problem over the Wasserstein space, which is of independent interest and may inspire other researchers to push the field further. A broader duality theory for minimization problems in the Wasserstein space would be of practical and theoretical interest.
-  L. Ambrosio, N. Gigli, and G. Savaré. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer Science & Business Media, 2008.
-  Y. F Atchadé. A Moreau-Yosida approximation scheme for a class of high-dimensional posterior distributions. arXiv preprint arXiv:1505.07072, 2015.
-  Y. F Atchadé, G. Fort, and E. Moulines. On perturbed proximal gradient algorithms. Journal of Machine Learning Research, 18(1):310–342, 2017.
-  H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York, 2nd edition, 2017.
-  E. Bernton. Langevin Monte Carlo and JKO splitting. arXiv preprint arXiv:1802.08671, 2018.
-  P. Bianchi and W. Hachem. Dynamical behavior of a stochastic Forward-Backward algorithm using random monotone operators. Journal of Optimization Theory and Applications, 171(1):90–120, 2016.
-  P. Bianchi, W. Hachem, and A. Salim. A constant step Forward-Backward algorithm involving random maximal monotone operators. Journal of Convex Analysis, 26(2):397–436, 2019.
-  S. Brazitikos, A. Giannopoulos, P. Valettas, and B.-H. Vritsiou. Geometry of isotropic convex bodies, volume 196. American Mathematical Soc., 2014.
-  N. Brosse, A. Durmus, E. Moulines, and M. Pereyra. Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo. arXiv preprint arXiv:1705.08964, 2017.
-  S. Bubeck, R. Eldan, and J. Lehec. Sampling from a log-concave distribution with projected Langevin Monte Carlo. Discrete & Computational Geometry, 59(4):757–783, 2018.
-  A. Chambolle and T. Pock. On the ergodic convergence rates of a first-order primal–dual algorithm. Mathematical Programming, 159(1-2):253–287, 2016.
-  N. S Chatterji, J. Diakonikolas, M. I Jordan, and P. L Bartlett. Langevin monte carlo without smoothness. arXiv preprint arXiv:1905.13285, 2019.
-  C. Chaux, P. L. Combettes, J.-C. Pesquet, and V. R. Wajs. A variational formulation for frame-based inverse problems. Inverse Problems, 23(4):1495–1518, June 2007.
-  X. Cheng and P. L Bartlett. Convergence of Langevin MCMC in KL-divergence. PMLR 83, (83):186–211, 2018.
-  P. L. Combettes and J.-C. Pesquet. Stochastic approximations and perturbations in forward-backward splitting for monotone operators. Pure and Applied Functional Analysis, 1(1):13–37, 2016.
-  L. Condat, D. Kitahara, A. Contreras, and A. Hirabayashi. Proximal splitting algorithms: Overrelax them all! arXiv preprint arXiv:1912.00137, 2019.
-  A. S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
-  Y. Drori, S. Sabach, and M. Teboulle. A simple algorithm for a class of nonsmooth convex–concave saddle-point problems. Operations Research Letters, 43(2):209–214, 2015.
-  A. Durmus, S. Majewski, and B. Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
-  A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
A. Durmus, E. Moulines, and M. Pereyra.
Efficient Bayesian computation by proximal Markov Chain Monte Carlo: when Langevin meets Moreau.SIAM Journal on Imaging Sciences, 11(1):473–506, 2018.
-  Y.-P. Hsieh, A. Kavis, P. Rolland, and V. Cevher. Mirrored Langevin dynamics. In Advances in Neural Information Processing Systems, pages 2878–2887, 2018.
-  Y.-A. Ma, N. Chatterji, X. Cheng, N. Flammarion, P. L Bartlett, and M. I Jordan. Is there an analog of Nesterov acceleration for MCMC? arXiv preprint arXiv:1902.00996, 2019.
-  R. T Rockafellar. Convex analysis, volume 28. Princeton university press, 1970.
-  L. Rosasco, S. Villa, and B. C Vũ. Stochastic forward–backward splitting for monotone inclusions. Journal of Optimization Theory and Applications, 169(2):388–406, 2016.
-  A. Salim, D. Kovalev, and P. Richtárik. Stochastic proximal langevin algorithm: Potential splitting and nonasymptotic rates. In Advances in Neural Information Processing Systems, pages 6649–6661, 2019.
-  S. Schechtman, A. Salim, and P. Bianchi. Passty Langevin. In CAp, 2019.
Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
-  P. Toulis, T. Horel, and E. M Airoldi. Stable Robbins-Monro approximations through stochastic proximal updates. arXiv preprint arXiv:1510.00967, 2015.
-  S. Vempala and A. Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems, pages 8092–8104, 2019.
-  M. Welling and Y. W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pages 681–688, 2011.
-  A. Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. arXiv preprint arXiv:1802.08089, 2018.
-  A. Wibisono. Proximal Langevin algorithm: Rapid convergence under isoperimetry. arXiv preprint arXiv:1911.01469, 2019.
Appendix A Numerical experiments
In this section, we illustrate our results on PSGLA through numerical experiments.
Sampling a posteriori.
We consider a statistical framework where i.i.d. random vectors (data) with distribution are observed. We adopt a Bayesian strategy where we assume the distribution to be indexed by a random vector with values in . Denote the density of (a.k.a. the likelihood function) w.r.t. some reference measure. Given a prior distribution for with density w.r.t. , our goal is construct samples from the posterior distribution
in order e.g.
to estimate the meana posteriori via Monte Carlo approximations,
In the experiments, the Euclidean space is a spac