The task of sampling with respect to some target distribution has become a prerequesite for many inference procedures. We consider the case where admits a density proportional to with respect to Lebesgue measure over (we shall write ). The function , referred to as the potential function, is assumed convex and smooth. Our sampling problem can be approached from an optimization point of view. Indeed, the target distribution is the solution to the optimization problem defined on the set of probability measures such that by
denotes the Kullback-Leibler divergence, or relative entropy.444, see [8, Lemma 2.2.1].
Problem (1) can be solved in continuous time by the gradient flow of the relative entropy [19, 1], a family of functions defined over and converging to . In discrete time, some sampling algorithms can be seen as discretization schemes of the gradient flow of the relative entropy [15, 31, 6]. For example, Langevin algorithm [14, 16] can be seen as an inexact gradient descent algorithm to solve Problem (1) over . Because of this inexactness, Langevin algorithm converges at the rate (where is the number of iterations), whereas gradient descent in Euclidean spaces converges at the rate . However, exact discretization schemes of the gradient flow of the entropy exist and are reviewed in . These schemes typically rely on iterative applications of the JKO operator, a map from to . The JKO operator was introduced  to construct the gradient flow as the limit of piecewise constant functions with values in . Indeed, the JKO operator is similar to the proximity operator in optimization, a map from to . Many nonsmooth optimization algorithms rely on iterative applications of the proximity operator. Moreover, the proximity operator can also be used to construct continuous time flows as a limit of piecewise constant functions, see [26, 7].
Our first contribution is to study the Forward Backward (FB) discretization scheme of the gradient flow of the relative entropy [31, Section 4.1] as an optimization algorithm to solve problem (1). We show convergence rates for the FB scheme similar to the proximal gradient algorithm in Euclidean spaces. As , we observe that the relative entropy is the sum of a smooth and a nonsmooth functionals, respectively the potential energy and the entropy. The FB scheme runs alternatively a forward method for the potential energy and a backward method for the entropy. The FB scheme can be seen as an exact proximal gradient algorithm to solve Problem (1) over [31, Section 4.1]. Our first contribution is to provide a non-asymptotic analysis of the FB scheme. We show that the FB scheme converges at the rate , and linearly if is strongly convex. These rates match the convergence rate of the proximal gradient algorithm in Euclidean spaces. Our analysis relies on the theory of gradient flows in the space of probability measures . The main challenge in the convergence proof is to address the non convexity of the entropy functional. Indeed, the entropy functional satisfies a property weaker than convexity, called generalized geodesic convexity. Identifying the optimal transport maps between the iterates of the FB scheme, we profit from this geodesic convexity of the entropy to establish a discrete Evolution Variational Inequality (EVI) for the FB scheme, similarly to the analysis of Langevin algorithm in . The convergence rates are corollaries of the EVI.
The FB scheme is more challenging to implement than Langevin algorithm because of the backward (or proximal, or JKO) step. The practical implementation of the JKO operator has been considered in several papers, see [28, Section 4.8]. In the specific case of the JKO of the entropy functional, [31, Section G.1] provides a closed form formula for the gaussian case.
Our second contribution is to provide a closed form formula for the JKO of the entropy between discrete measures. When the space is discretized, the JKO of the entropy has to be redefined to fit the situation. Using a duality approach similar to the one used for regularized optimal transport , we are able to give a closed form formula for the JKO of the entropy.
We finally mention two lines of works using proximal techniques in settings close (but different) from ours. Proximity operators have been used in Langevin algorithm to sample from target distributions with composite potentials [15, 6, 31, 27]. Moreover, the space can be interpreted as a Riemannian space [25, 23, 30] and the proximal gradient algorithm over finite dimensional Riemannian space have recently been investigated in [11, 18].
The remainder is organized as follows. In Section 2 we provide some background knowledge in optimal transport and gradient flows. In Section 3, we introduce the Forward Backward Euler discretization scheme. This FB scheme is studied as an optimization algorithm in Section 4. We provide non-asymptotic rates for the resolution of (1). Then, a closed form formula for the JKO of the entropy between discrete measures is given in Section 5. Finally, we conclude in Section 6. The convergence proofs are postponed to the appendix.
In this section, we introduce the notations and recall fundamental definitions and properties on optimal transport and gradient flows that will be used throughout the paper.
In the sequel, is the space of probability measures on
with finite second order moment. Denotethe Borelian -field over . For any , is the space of functions such that . Note that the identity map is an element of . For any , we denote by and respectively the norm and the inner product of the space . For any measures , we write if is absolutely continuous with respect to , and we denote the Lebesgue measure over . The set of regular distributions of the Wasserstein space is denoted . If , the composition of by is sometimes denoted .
2.2 Optimal transport
For every measurable map defined on and for every , we denote the pushforward measure of by characterized by the transfer lemma:
Consider the -Wasserstein distance defined for every by
where is the set of couplings between and , i.e. the set of nonnegative measures over such that (resp. ) where (resp. ).
Let and . Then,
There exists an unique minimizer of (3). Besides, there exists an uniquely determined -almost everywhere (a.e.) map such that where . Finally, there exists a convex function such that -a.e.
As a corollary,
If is convex, then is well defined -a.e. and if , then -a.e.
If , then -a.e. and -a.e.
Under the assumptions of Theorem 2.1, the map is called the optimal transport (OT) map from to . In this paper, as it is commonly the case in the literature, we may refer to the space of probability distributions equipped with the -Wasserstein distance as the Wasserstein space.
2.3 Review of Gradient Flows and their discretizations
2.3.1 In an Euclidean space
Assume that is an Euclidean space, consider a proper lower semi-continuous function and denote its domain. We assume that is convex, i.e., for every and for every , we have:
Given , is a subgradient of at if for every ,
The (possibly empty) set of subgradients of at is denoted , and the map is called the subdifferential. If is differentiable at , then where is the gradient of at . The subdifferential of the convex function allows to define the gradient flow of : for every initial condition , there exists an unique absolutely continuous function solution to the differential inclusion [10, Th. 3.1], [26, Th. 2.7]:
One can check that the gradient flow of is also characterized by the following system of Evolution Variational Inequalities (EVI) :
In contrast to (6), the former characterization allows to define the gradient flow without using the notion of subdifferential, a property that can be practical in nonsmooth settings. Moreover, the non-asymptotic analysis of discretized gradient flows in the optimization literature often relies on discrete versions of the EVI.
the proximal scheme is written
which corresponds to the proximal point algorithm to minimize the function , see . The proximal scheme can be seen as a Backward Euler discretization of the gradient flow. Indeed, writing the first order conditions of (8), we have
Hence, each iteration of the proximal scheme requires solving an equation which can be intractable in many cases. The Forward Euler scheme is a more tractable integrator of the gradient flow of , but is less stable and requires the differentiability of . Under this assumption, this scheme is written
which corresponds to the well-known gradient descent algorithm to minimize the function . Consider now the case where the function can be decomposed as , where is convex and smooth and is convex and nonsmooth. To integrate the gradient flow of , another approach is to use the Forward and the Backward Euler schemes for the smooth term and nonsmooth term respectively . This approach is also motivated by the fact that in many situations, the function is simple enough to implement its proximity operator . If , the Forward Backward Euler scheme is written
Recalling the definition of the proximity operator, this scheme can be rewritten as
which corresponds to the proximal gradient algorithm to minimize the composite function .
2.3.2 In the Wasserstein space
Consider a proper lower semi continuous functional and denote its domain. We assume that is convex along generalized geodesics defined by the 2-Wasserstein distance [1, Chap. 9], i.e. for every and for every ,
where and are the optimal transport maps from to and from to respectively. Given , is a strong Fréchet subgradient of at [1, Chap. 10] if for every ,
The (possibly empty) set of strong Fréchet subgradients of at is denoted . The map is called the strong Fréchet subdifferential. Conveniently, a subdifferential notion similar to the strong Fréchet subdifferential enables to define the gradient flow of the functional [1, Chap. 11]. However in the nonsmooth setting that will be considered in this paper, the characterization of gradient flows through EVI will be more practical. The gradient flow of is the solution of the following system of Evolution Variational Inequalities (EVI) [1, Th. 11.1.4]:
We shall perform a non-asymptotic analysis of a discretized gradient flow scheme to minimize the functional . Our approach is to prove a discrete EVI for this scheme.
the JKO scheme is written
The JKO operator can be seen as a proximity operator by replacing the Wasserstein distance by the Euclidean distance. Moreover, the JKO scheme can be seen as a Backward Euler discretization of the gradient flow. More precisely, under some assumptions on the functional , using [1, Lemma 10.1.2]
Since, using Brenier’s theorem, , there exists a strong Fréchet subgradient of at denoted such that
Each iteration of the JKO scheme thus requires the minimization of a function which can be intractable in many cases. As previously, the Forward Euler scheme is more tractable and enjoys additionally a simpler geometrical interpretation, see e.g. [31, Section 3.1.3]. Assume is a singleton for any (some examples are given [Sec. 10.4]). The Forward Euler scheme for the gradient flow of is written:
and corresponds to the iterations of the gradient descent algorithm over the Wasserstein space to minimize . Although the Wasserstein space is not a Riemannian manifold, it can still be equipped with a Riemannian structure and a Riemannian interpretation [23, 25]. In particular, the Forward Euler scheme can be seen as a Riemannian gradient descent where the exponential map at is the map defined on .
3 A Forward Backward Euler scheme for the relative entropy
Given two nonnegative measures over , the Kullback-Leibler divergence (or relative entropy) of from is defined as:
if with density , and else. Then, for , we define the potential energy as:
if with density , and else. Throughout this paper, we assume the following on the potential function : there exists such that
A1. is -smooth i.e. is differentiable and is -Lipschitz continuous; for all :
A2. is -strongly convex (we allow ); for all :
We first recall that the KL divergence can be expressed (up to an additive constant) as a sum of the potential and the entropy functionals.
[15, Lemma 1] Let be the target distribution. Then, , and . Moreover, for every satisfying ,
Since the target distribution is a minimizer of the composite functional , we use a Forward Backward Euler scheme to integrate the gradient flow of . As is nonsmooth, see [31, Section 3.1.1], this scheme relies on the implementation of the JKO operator of the entropy , treated separately in Section 5. The proposed Forward Backward Euler scheme is written, for : [31, Section 4.1]
This scheme can be seen as a proximal gradient algorithm over the Wasserstein space to minimize the composite function .
Remark 1 (Comparison with Forward Euler schemes)
The Forward Euler scheme (14) for was considered in [31, Section 3.1.3]. This scheme is notably difficult to analyze because of the nonsmoothness of [31, Section 3.1.1]. Moreover, Stein Variational Gradient Descent (SVGD) algorithm introduced recently [21, 20] can be seen as a Forward Euler scheme under the inner product of a Reproducing Kernel Hilbert Space (RKHS) dense in for every . Indeed, SVGD is written
where is the kernel integral operator defined by where is the kernel of . The inner product can be seen as the inner product induced by on . Indeed, for every , , see e.g. [3, Section 2.1]. Moreover, can be seen as the gradient of at under the inner product since for every . Finally, to the best of our knowledge, the non asymptotic performances of SVGD algorithm remain unclear.
In the next section, we study the non asymptotic properties of the FB scheme.
4 Non asymptotic analysis
We consider a fixed step size and a probability distribution . Our main result (Proposition 4.7) combines two ingredients: the identification of the optimal transport maps between and (see Equations (20) and (21)), and a proof of a discrete EVI for the proximal gradient algorithm.
4.1 Identification of optimal transport maps
Lemmas 4.1,4.2 identify the optimal transport maps from to and from to in the Forward Backward Euler scheme, as soon as the step size is sufficiently small. In particular, Lemma 4.2 is a consequence of [1, Lemma 10.1.2].
Let and . Then if , the optimal transport map from to corresponds to
Let . If , then and the optimal transport map from to satisfies . In other words, there exists a strong Fréchet subgradient at denoted such that
Using Lemmas 4.1,4.2, if , then for every by induction. This remark allows to consider optimal transport maps from and to any . The next lemma extends [1, 10.1.1.B] using the generalized geodesic convexity of and the strong Fréchet subdifferential of .
Let , and the optimal transport maps from to and from to respectively. If , then
4.2 A descent lemma
Theorem 4.4 (Descent)
Assume , and A1. Then for , there exists a strong Fréchet subgradient at denoted such that:
where we use the notation to denote .
Hence, the sequence is decreasing as soon as the step-size is small enough.
4.3 Discrete EVI
To prove a discrete EVI and obtain convergence rates, we need the additional convexity assumption A2 on the potential function . We firstly prove the two following lemmas.
Assume , . Then for and , there exists a strong Fréchet subgradient at denoted such that:
Assume and , A1 and A2 with . Then for , and
We can now provide a discrete EVI for the functional .
Proposition 4.7 (discrete EVI)
Assume , , A1 and A2 with . Then for and , there exists a strong Fréchet subgradient at denoted such that the Forward Backward Euler scheme verifies:
4.4 Convergence rates
When the potential function is convex, we easily get rates from the discrete EVI inequality provided above. Theorem 4.8 is a direct consequence of Proposition 4.7 by taking , and its corollaries provide rates depending on the strong convexity parameter of .
Assume , A1 and A2 with . Then for every ,
Corollary 4.9 (Convex case rate)
Under the assumptions of Theorem 4.8, for :
Corollary 4.10 (Strongly convex case rate)
Under the assumptions of Theorem 4.8, if , then for :
Hence, as soon as is convex, we get sublinear rates in terms of the Kullback-Leibler divergence, while when is -strongly convex with , we get linear rates in the Wasserstein distance for the iterates of the Forward Backward Euler scheme.
The rates match the rates of the proximal gradient algorithm in the convex and strongly convex cases by replacing the squared Euclidean norm by the squared Wasserstein distance and the objective function by the KL divergence, . Moreover, these rates are discrete time analogues of the continuous time rates obtained in [1, Th. 11.2.1] for the gradient flow of .
The Forward Backward Euler scheme thus enjoys strong theoretical guarantees. However, the implementation of the JKO operator of the entropy functional remains a technical challenge. In the next section, we propose a practical implementation of this operator for discrete measures.
5 JKO operator of the entropy functional
The most demanding step of the FB scheme is the backward step (21). As explained above, the implementation of this step has been investigated in several papers. In this section, we derive an approximate implementation inspired by . First, note that
where and are defined in Section 2.2. Consider the case where the FB scheme is approximated by discretizing the space (by a grid for example). In this case, all probability distributions are approximated by discrete measures (supported by the nodes of the grid). From (5), we can obtain a discrete formulation of the JKO of the entropy as follows. Suppose that and are discrete, i.e., of the form and where the coefficients are known but not the . Then, a discrete reformulation of (5) is
where is the cost matrix and for . Moreover, is the discretized , where is the solution to (26). The discrete distribution can be computed in closed form. Introduce the Lagrangian multiplier associated to the constraint of (26). The Lagrangian reads:
Now fix and . Noting that , the first order optimality condition results in
Summing Equation (28) for all indices , we get successively
Let . Since does not depend on ,
Moreover as is a probability measure, . Therefore, can be rewritten as
Finally, the discretized is the probability measure such that
We provided a non-asymptotic analysis of a Forward Backward Euler scheme to sample from . The FB scheme can be seen as a proximal gradient algorithm in the Wasserstein space and enjoys the same convergence rates. Therefore, in terms of number of iterations, the FB scheme might be faster that its alternatives, like Langevin algorithm. However, the FB scheme is challenging to implement in practice, and the iteration complexity of the FB scheme is higher than the iteration complexity of Langevin algorithm. Indeed, the implementation of the FB scheme relies on the computation of the JKO operator of the entropy in the Wasserstein space. Therefore, we studied the JKO operator of the entropy between discrete measures and proved a closed form formula.
The entropy is the regularizer term in Problem (1) for every sampling problem formulated as . Therefore, we stress that the JKO operator of the entropy deserves more investigations. Note that our non-asymptotic analysis of the FB scheme could have been performed for any function convex along generalized geodesics and satisfying similar conditions than the entropy, see [1, Eq. 10.1.1a, 10.1.1b]
. Finally, in Euclidean spaces, proximal methods go beyond the proximal gradient algorithm, and include for example accelerated, stochastic, variance reduced and primal dual algorithms[12, 29, 5, 17]. The analogues of these methods in the Wasserstein space should lead to improvements and extensions of the FB scheme.
-  Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
-  Luigi Ambrosio, Giuseppe Savaré, and Lorenzo Zambotti. Existence and stability for fokker–planck equations with log-concave reference measure. Probability theory and related fields, 145(3-4):517–564, 2009.
On the equivalence between kernel quadrature rules and random feature
The Journal of Machine Learning Research, 18(1):714–751, 2017.
-  Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer, 2011.
-  Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
-  Espen Bernton. Langevin monte carlo and jko splitting. arXiv preprint arXiv:1802.08671, 2018.
-  Pascal Bianchi, Walid Hachem, and Adil Salim. A constant step Forward-Backward algorithm involving random maximal monotone operators. Journal of Convex Analysis, 2019.
-  Silouanos Brazitikos, Apostolos Giannopoulos, Petros Valettas, and Beatrice-Helen Vritsiou. Geometry of isotropic convex bodies, volume 196. American Mathematical Soc., 2014.
Polar factorization and monotone rearrangement of vector-valued functions.Communications on pure and applied mathematics, 44(4):375–417, 1991.
-  Haïm Brézis. Opérateurs maximaux monotones et semi-groupes de contractions dans les espaces de Hilbert. North-Holland mathematics studies. Elsevier Science, Burlington, MA, 1973.
-  Shixiang Chen, Shiqian Ma, Anthony Man-Cho So, and Tong Zhang. Proximal gradient method for manifold optimization. arXiv preprint arXiv:1811.00980, 5(6):8, 2018.
-  Laurent Condat, Daichi Kitahara, Andrés Contreras, and Akira Hirabayashi. Proximal splitting algorithms: Overrelax them all! arXiv preprint arXiv:1912.00137, 2019.
-  Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
-  Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
-  Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of langevin monte carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
-  Alain Durmus and Eric Moulines. Sampling from strongly log-concave distributions with the unadjusted langevin algorithm. arXiv preprint arXiv:1605.01559, 5, 2016.
-  Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A unified theory of sgd: Variance reduction, sampling, quantization and coordinate descent. arXiv preprint arXiv:1905.11261, 2019.
-  Wen Huang and Ke Wei. Riemannian proximal gradient methods. arXiv preprint arXiv:1909.06065, 2019.
-  Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
-  Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3115–3123, 2017.
Qiang Liu and Dilin Wang.
Stein variational gradient descent: A general purpose bayesian inference algorithm.In Advances in neural information processing systems, pages 2378–2386, 2016.
-  Bernard Martinet. Brève communication. régularisation d’inéquations variationnelles par approximations successives. Revue française d’informatique et de recherche opérationnelle. Série rouge, 4(R3):154–158, 1970.
-  Robert J McCann. Polar factorization of maps on riemannian manifolds. Geometric & Functional Analysis GAFA, 11(3):589–608, 2001.
-  Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.
-  Felix Otto. The geometry of dissipative evolution equations: the porous medium equation. 2001.
-  Juan Peypouquet and Sylvain Sorin. Evolution equations for maximal monotone operators: asymptotic analysis in continuous and discrete time. arXiv preprint arXiv:0905.1270, 2009.
-  Adil Salim, Dmitry Kovalev, and Peter Richtárik. Stochastic proximal langevin algorithm: Potential splitting and nonasymptotic rates. In Advances in Neural Information Processing Systems, pages 6649–6661, 2019.
-  Filippo Santambrogio. Euclidean, metric, and Wasserstein gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87–154, 2017.
-  Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.
-  Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
-  Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. arXiv preprint arXiv:1802.08089, 2018.
Appendix A Proof of Lemma 4.1
The map is a pushforward from to . Moreover, denoting , .
By elementary algebra, for any we have:
and from the smoothness of ,
In other words,
Therefore, if , then is convex and using Brenier’s theorem. Moreover, if then is -strongly convex. In consequence,
Appendix B Proof of Lemma 4.2
Appendix C Proof of Lemma 4.3
Appendix D Proof of Theorem 4.4
Denote the optimal transport map between and and the strong Fréchet subgradient of the entropy evaluated at defined by Lemma 4.2: . Since , using Brenier’s theorem. We thus have -a.e.:
We firstly bound the entropy term. By taking