Wasserstein Proximal Gradient

We consider the task of sampling from a log-concave probability distribution. This target distribution can be seen as a minimizer of the relative entropy functional defined on the space of probability distributions. The relative entropy can be decomposed as the sum of a functional called the potential energy, assumed to be smooth, and a nonsmooth functional called the entropy. We adopt a Forward Backward (FB) Euler scheme for the discretization of the gradient flow of the relative entropy. This FB algorithm can be seen as a proximal gradient algorithm to minimize the relative entropy over the space of probability measures. Using techniques from convex optimization and optimal transport, we provide a non-asymptotic analysis of the FB algorithm. The convergence rate of the FB algorithm matches the convergence rate of the classical proximal gradient algorithm in Euclidean spaces. The practical implementation of the FB algorithm can be challenging. In practice, the user may choose to discretize the space and work with empirical measures. In this case, we provide a closed form formula for the proximity operator of the entropy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/16/2020

Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm

We consider the task of sampling with respect to a log concave probabili...
02/22/2018

Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem

We study sampling as optimization in the space of measures. We focus on ...
10/18/2012

Optimal Computational Trade-Off of Inexact Proximal Methods

In this paper, we investigate the trade-off between convergence rate and...
06/26/2020

Semi-discrete optimization through semi-discrete optimal transport: a framework for neural architecture search

In this paper we introduce a theoretical framework for semi-discrete opt...
01/10/2019

Accelerated Flow for Probability distributions

This paper presents a methodology and numerical algorithms for construct...
10/30/2020

Efficient constrained sampling via the mirror-Langevin algorithm

We propose a new discretization of the mirror-Langevin diffusion and giv...
10/06/2021

Relative Entropy Gradient Sampler for Unnormalized Distributions

We propose a relative entropy gradient sampler (REGS) for sampling from ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of sampling with respect to some target distribution has become a prerequesite for many inference procedures. We consider the case where admits a density proportional to with respect to Lebesgue measure over (we shall write ). The function , referred to as the potential function, is assumed convex and smooth. Our sampling problem can be approached from an optimization point of view. Indeed, the target distribution is the solution to the optimization problem defined on the set of probability measures such that by

(1)

where

denotes the Kullback-Leibler divergence, or relative entropy.

444, see [8, Lemma 2.2.1].

Problem (1) can be solved in continuous time by the gradient flow of the relative entropy [19, 1], a family of functions defined over and converging to . In discrete time, some sampling algorithms can be seen as discretization schemes of the gradient flow of the relative entropy [15, 31, 6]. For example, Langevin algorithm [14, 16] can be seen as an inexact gradient descent algorithm to solve Problem (1) over . Because of this inexactness, Langevin algorithm converges at the rate (where is the number of iterations), whereas gradient descent in Euclidean spaces converges at the rate . However, exact discretization schemes of the gradient flow of the entropy exist and are reviewed in [31]. These schemes typically rely on iterative applications of the JKO operator, a map from to . The JKO operator was introduced [19] to construct the gradient flow as the limit of piecewise constant functions with values in . Indeed, the JKO operator is similar to the proximity operator in optimization, a map from to  [4]. Many nonsmooth optimization algorithms rely on iterative applications of the proximity operator. Moreover, the proximity operator can also be used to construct continuous time flows as a limit of piecewise constant functions, see [26, 7].

Our first contribution is to study the Forward Backward (FB) discretization scheme of the gradient flow of the relative entropy [31, Section 4.1] as an optimization algorithm to solve problem  (1). We show convergence rates for the FB scheme similar to the proximal gradient algorithm in Euclidean spaces. As [31], we observe that the relative entropy is the sum of a smooth and a nonsmooth functionals, respectively the potential energy and the entropy. The FB scheme runs alternatively a forward method for the potential energy and a backward method for the entropy. The FB scheme can be seen as an exact proximal gradient algorithm to solve Problem (1) over  [31, Section 4.1]. Our first contribution is to provide a non-asymptotic analysis of the FB scheme. We show that the FB scheme converges at the rate , and linearly if is strongly convex. These rates match the convergence rate of the proximal gradient algorithm in Euclidean spaces. Our analysis relies on the theory of gradient flows in the space of probability measures [1]. The main challenge in the convergence proof is to address the non convexity of the entropy functional. Indeed, the entropy functional satisfies a property weaker than convexity, called generalized geodesic convexity. Identifying the optimal transport maps between the iterates of the FB scheme, we profit from this geodesic convexity of the entropy to establish a discrete Evolution Variational Inequality (EVI) for the FB scheme, similarly to the analysis of Langevin algorithm in [15]. The convergence rates are corollaries of the EVI.

The FB scheme is more challenging to implement than Langevin algorithm because of the backward (or proximal, or JKO) step. The practical implementation of the JKO operator has been considered in several papers, see [28, Section 4.8]. In the specific case of the JKO of the entropy functional, [31, Section G.1] provides a closed form formula for the gaussian case.

Our second contribution is to provide a closed form formula for the JKO of the entropy between discrete measures. When the space is discretized, the JKO of the entropy has to be redefined to fit the situation. Using a duality approach similar to the one used for regularized optimal transport [13], we are able to give a closed form formula for the JKO of the entropy.

We finally mention two lines of works using proximal techniques in settings close (but different) from ours. Proximity operators have been used in Langevin algorithm to sample from target distributions with composite potentials [15, 6, 31, 27]. Moreover, the space can be interpreted as a Riemannian space [25, 23, 30] and the proximal gradient algorithm over finite dimensional Riemannian space have recently been investigated in [11, 18].

The remainder is organized as follows. In Section 2 we provide some background knowledge in optimal transport and gradient flows. In Section 3, we introduce the Forward Backward Euler discretization scheme. This FB scheme is studied as an optimization algorithm in Section 4. We provide non-asymptotic rates for the resolution of (1). Then, a closed form formula for the JKO of the entropy between discrete measures is given in Section 5. Finally, we conclude in Section 6. The convergence proofs are postponed to the appendix.

2 Preliminaries

In this section, we introduce the notations and recall fundamental definitions and properties on optimal transport and gradient flows that will be used throughout the paper.

2.1 Notations

In the sequel, is the space of probability measures on

with finite second order moment. Denote

the Borelian -field over . For any , is the space of functions such that . Note that the identity map is an element of . For any , we denote by and respectively the norm and the inner product of the space . For any measures , we write if is absolutely continuous with respect to , and we denote the Lebesgue measure over . The set of regular distributions of the Wasserstein space is denoted . If , the composition of by is sometimes denoted .

2.2 Optimal transport

For every measurable map defined on and for every , we denote the pushforward measure of by characterized by the transfer lemma:

(2)

Consider the -Wasserstein distance defined for every by

(3)

where is the set of couplings between and  [30], i.e. the set of nonnegative measures over such that (resp. ) where (resp. ).

We now recall the well-known Brenier theorem [9],[1, Section 6.2.3].

Theorem 2.1

Let and . Then,

  1. There exists an unique minimizer of (3). Besides, there exists an uniquely determined -almost everywhere (a.e.) map such that where . Finally, there exists a convex function such that -a.e.

  2. As a corollary,

    (4)
  3. If is convex, then is well defined -a.e. and if , then -a.e.

  4. If , then -a.e. and -a.e.

Under the assumptions of Theorem 2.1, the map is called the optimal transport (OT) map from to . In this paper, as it is commonly the case in the literature, we may refer to the space of probability distributions equipped with the -Wasserstein distance as the Wasserstein space.

2.3 Review of Gradient Flows and their discretizations

2.3.1 In an Euclidean space

Assume that is an Euclidean space, consider a proper lower semi-continuous function and denote its domain. We assume that is convex, i.e., for every and for every , we have:

(5)

Given , is a subgradient of at if for every ,

The (possibly empty) set of subgradients of at is denoted , and the map is called the subdifferential. If is differentiable at , then where is the gradient of at . The subdifferential of the convex function allows to define the gradient flow of : for every initial condition , there exists an unique absolutely continuous function solution to the differential inclusion [10, Th. 3.1], [26, Th. 2.7]:

(6)

One can check that the gradient flow of is also characterized by the following system of Evolution Variational Inequalities (EVI) :

In contrast to (6), the former characterization allows to define the gradient flow without using the notion of subdifferential, a property that can be practical in nonsmooth settings. Moreover, the non-asymptotic analysis of discretized gradient flows in the optimization literature often relies on discrete versions of the EVI.

The existence of Gradient Flows can be established as the limit of a proximal scheme [26, Th. 2.14][7, Th. 5.1] when the step-size . Defining the proximity operator of as:

(7)

the proximal scheme is written

(8)

which corresponds to the proximal point algorithm to minimize the function , see [22]. The proximal scheme can be seen as a Backward Euler discretization of the gradient flow. Indeed, writing the first order conditions of (8), we have

Hence, each iteration of the proximal scheme requires solving an equation which can be intractable in many cases. The Forward Euler scheme is a more tractable integrator of the gradient flow of , but is less stable and requires the differentiability of . Under this assumption, this scheme is written

(9)

which corresponds to the well-known gradient descent algorithm to minimize the function . Consider now the case where the function can be decomposed as , where is convex and smooth and is convex and nonsmooth. To integrate the gradient flow of , another approach is to use the Forward and the Backward Euler schemes for the smooth term and nonsmooth term respectively [26]. This approach is also motivated by the fact that in many situations, the function is simple enough to implement its proximity operator . If , the Forward Backward Euler scheme is written

(10)

Recalling the definition of the proximity operator, this scheme can be rewritten as

(11)

which corresponds to the proximal gradient algorithm to minimize the composite function .

2.3.2 In the Wasserstein space

Consider a proper lower semi continuous functional and denote its domain. We assume that is convex along generalized geodesics defined by the 2-Wasserstein distance [1, Chap. 9], i.e. for every and for every ,

(12)

where and are the optimal transport maps from to and from to respectively. Given , is a strong Fréchet subgradient of at [1, Chap. 10] if for every ,

The (possibly empty) set of strong Fréchet subgradients of at is denoted . The map is called the strong Fréchet subdifferential. Conveniently, a subdifferential notion similar to the strong Fréchet subdifferential enables to define the gradient flow of the functional [1, Chap. 11]. However in the nonsmooth setting that will be considered in this paper, the characterization of gradient flows through EVI will be more practical. The gradient flow of is the solution of the following system of Evolution Variational Inequalities (EVI) [1, Th. 11.1.4]:

We shall perform a non-asymptotic analysis of a discretized gradient flow scheme to minimize the functional . Our approach is to prove a discrete EVI for this scheme.

The existence of gradient flows can be established as the limit of a minimizing movement scheme [1, Th. 11.2.1][19]. Defining the JKO operator of as:

(13)

the JKO scheme is written

The JKO operator can be seen as a proximity operator by replacing the Wasserstein distance by the Euclidean distance. Moreover, the JKO scheme can be seen as a Backward Euler discretization of the gradient flow. More precisely, under some assumptions on the functional , using [1, Lemma 10.1.2]

Since, using Brenier’s theorem, , there exists a strong Fréchet subgradient of at denoted such that

Each iteration of the JKO scheme thus requires the minimization of a function which can be intractable in many cases. As previously, the Forward Euler scheme is more tractable and enjoys additionally a simpler geometrical interpretation, see e.g. [31, Section 3.1.3]. Assume is a singleton for any (some examples are given [1][Sec. 10.4]). The Forward Euler scheme for the gradient flow of is written:

(14)

and corresponds to the iterations of the gradient descent algorithm over the Wasserstein space to minimize . Although the Wasserstein space is not a Riemannian manifold, it can still be equipped with a Riemannian structure and a Riemannian interpretation [23, 25]. In particular, the Forward Euler scheme can be seen as a Riemannian gradient descent where the exponential map at is the map defined on .

3 A Forward Backward Euler scheme for the relative entropy

Given two nonnegative measures over , the Kullback-Leibler divergence (or relative entropy) of from is defined as:

if with density , and else. Then, for , we define the potential energy as:

(15)

and

if with density , and else. Throughout this paper, we assume the following on the potential function : there exists such that

  • A1. is -smooth i.e. is differentiable and is -Lipschitz continuous; for all :

    (16)
  • A2. is -strongly convex (we allow ); for all :

    (17)

We first recall that the KL divergence can be expressed (up to an additive constant) as a sum of the potential and the entropy functionals.

Lemma 3.1

[15, Lemma 1] Let be the target distribution. Then, , and . Moreover, for every satisfying ,

(18)

Based on Lemma (3.1), Problem (1) can be rewritten as the minimization of the potential energy regularized by the entropy:

(19)

Since the target distribution is a minimizer of the composite functional , we use a Forward Backward Euler scheme to integrate the gradient flow of . As is nonsmooth, see [31, Section 3.1.1], this scheme relies on the implementation of the JKO operator of the entropy , treated separately in Section 5. The proposed Forward Backward Euler scheme is written, for [31, Section 4.1]

(20)
(21)

This scheme can be seen as a proximal gradient algorithm over the Wasserstein space to minimize the composite function .

Remark 1 (Comparison with Forward Euler schemes)

The Forward Euler scheme (14) for was considered in [31, Section 3.1.3]. This scheme is notably difficult to analyze because of the nonsmoothness of  [31, Section 3.1.1]. Moreover, Stein Variational Gradient Descent (SVGD) algorithm introduced recently [21, 20] can be seen as a Forward Euler scheme under the inner product of a Reproducing Kernel Hilbert Space (RKHS) dense in for every . Indeed, SVGD is written

where is the kernel integral operator defined by where is the kernel of . The inner product can be seen as the inner product induced by on . Indeed, for every , , see e.g. [3, Section 2.1]. Moreover, can be seen as the gradient of at under the inner product since for every . Finally, to the best of our knowledge, the non asymptotic performances of SVGD algorithm remain unclear.

In the next section, we study the non asymptotic properties of the FB scheme.

4 Non asymptotic analysis

We consider a fixed step size and a probability distribution . Our main result (Proposition 4.7) combines two ingredients: the identification of the optimal transport maps between and (see Equations (20) and (21)), and a proof of a discrete EVI for the proximal gradient algorithm.

4.1 Identification of optimal transport maps

Lemmas 4.1,4.2 identify the optimal transport maps from to and from to in the Forward Backward Euler scheme, as soon as the step size is sufficiently small. In particular, Lemma 4.2 is a consequence of [1, Lemma 10.1.2].

Lemma 4.1

Let and . Then if , the optimal transport map from to corresponds to

Moreover,

Lemma 4.2

Let . If , then and the optimal transport map from to satisfies . In other words, there exists a strong Fréchet subgradient at denoted such that

(22)

Using Lemmas 4.1,4.2, if , then for every by induction. This remark allows to consider optimal transport maps from and to any . The next lemma extends [1, 10.1.1.B] using the generalized geodesic convexity of and the strong Fréchet subdifferential of .

Lemma 4.3

Let , and the optimal transport maps from to and from to respectively. If , then

(23)

Equipped with these preliminary results, we can establish a discrete EVI for the FB scheme (20)- (21) in the next subsection.

4.2 A descent lemma

Without using any convexity assumption on , we first obtain a descent lemma. We denote the optimal transport map between and in the Forward Backward Euler scheme (20), (21), and .

Theorem 4.4 (Descent)

Assume , and A1. Then for , there exists a strong Fréchet subgradient at denoted such that:

where we use the notation to denote .

Hence, the sequence is decreasing as soon as the step-size is small enough.

4.3 Discrete EVI

To prove a discrete EVI and obtain convergence rates, we need the additional convexity assumption A2 on the potential function . We firstly prove the two following lemmas.

Lemma 4.5

Assume , . Then for and , there exists a strong Fréchet subgradient at denoted such that:

Lemma 4.6

Assume and , A1 and A2 with . Then for , and

We can now provide a discrete EVI for the functional .

Proposition 4.7 (discrete EVI)

Assume , , A1 and A2 with . Then for and , there exists a strong Fréchet subgradient at denoted such that the Forward Backward Euler scheme verifies:

(24)

4.4 Convergence rates

When the potential function is convex, we easily get rates from the discrete EVI inequality provided above. Theorem 4.8 is a direct consequence of Proposition 4.7 by taking , and its corollaries provide rates depending on the strong convexity parameter of .

Theorem 4.8

Assume , A1 and A2 with . Then for every ,

Corollary 4.9 (Convex case rate)

Under the assumptions of Theorem 4.8, for :

Corollary 4.10 (Strongly convex case rate)

Under the assumptions of Theorem 4.8, if , then for :

Hence, as soon as is convex, we get sublinear rates in terms of the Kullback-Leibler divergence, while when is -strongly convex with , we get linear rates in the Wasserstein distance for the iterates of the Forward Backward Euler scheme.

Remark 2

The rates match the rates of the proximal gradient algorithm in the convex and strongly convex cases by replacing the squared Euclidean norm by the squared Wasserstein distance and the objective function by the KL divergence, [24]. Moreover, these rates are discrete time analogues of the continuous time rates obtained in [1, Th. 11.2.1] for the gradient flow of .

The Forward Backward Euler scheme thus enjoys strong theoretical guarantees. However, the implementation of the JKO operator of the entropy functional remains a technical challenge. In the next section, we propose a practical implementation of this operator for discrete measures.

5 JKO operator of the entropy functional

The most demanding step of the FB scheme is the backward step (21). As explained above, the implementation of this step has been investigated in several papers. In this section, we derive an approximate implementation inspired by [13]. First, note that

(25)

where and are defined in Section 2.2. Consider the case where the FB scheme is approximated by discretizing the space (by a grid for example). In this case, all probability distributions are approximated by discrete measures (supported by the nodes of the grid). From (5), we can obtain a discrete formulation of the JKO of the entropy as follows. Suppose that and are discrete, i.e., of the form and where the coefficients are known but not the . Then, a discrete reformulation of (5) is

(26)

where is the cost matrix and for . Moreover, is the discretized , where is the solution to (26). The discrete distribution can be computed in closed form. Introduce the Lagrangian multiplier associated to the constraint of (26). The Lagrangian reads:

(27)

Now fix and . Noting that , the first order optimality condition results in

(28)

Summing Equation (28) for all indices , we get successively

(29)

and

(30)

Let . Since does not depend on ,

(31)

Moreover as is a probability measure, . Therefore, can be rewritten as

(32)

Finally, the discretized is the probability measure such that

(33)

6 Conclusion

We provided a non-asymptotic analysis of a Forward Backward Euler scheme to sample from . The FB scheme can be seen as a proximal gradient algorithm in the Wasserstein space and enjoys the same convergence rates. Therefore, in terms of number of iterations, the FB scheme might be faster that its alternatives, like Langevin algorithm. However, the FB scheme is challenging to implement in practice, and the iteration complexity of the FB scheme is higher than the iteration complexity of Langevin algorithm. Indeed, the implementation of the FB scheme relies on the computation of the JKO operator of the entropy in the Wasserstein space. Therefore, we studied the JKO operator of the entropy between discrete measures and proved a closed form formula.

The entropy is the regularizer term in Problem (1) for every sampling problem formulated as . Therefore, we stress that the JKO operator of the entropy deserves more investigations. Note that our non-asymptotic analysis of the FB scheme could have been performed for any function convex along generalized geodesics and satisfying similar conditions than the entropy, see [1, Eq. 10.1.1a, 10.1.1b]

. Finally, in Euclidean spaces, proximal methods go beyond the proximal gradient algorithm, and include for example accelerated, stochastic, variance reduced and primal dual algorithms 

[12, 29, 5, 17]. The analogues of these methods in the Wasserstein space should lead to improvements and extensions of the FB scheme.

References

Appendix A Proof of Lemma 4.1

The map is a pushforward from to . Moreover, denoting , .

By elementary algebra, for any we have:

(34)

and from the smoothness of ,

(35)

Therefore, combining (34) and (35) multiplied by gives:

(36)

In other words,

(37)

Therefore, if , then is convex and using Brenier’s theorem. Moreover, if then is -strongly convex. In consequence,

Therefore, is injective. Furthermore, using the strong convexity of and [1, Lemma 5.5.3] (see also [1, Th. 6.2.3, Th. 6.2.7]), .

Appendix B Proof of Lemma 4.2

Using [2, Prop. 4.2], is a singleton . Since [1, Lemma 10.1.2] implies and is a strong subgradient of at .

Appendix C Proof of Lemma 4.3

Since , for every ,

Applying the last inequality to and using the transfer lemma () we have

and

(38)

Using the generalized geodesic convexity of the entropy (see [1, Chap. 9] and Equation (12)),

Plugging the last inequality into (38),

(39)

We get the conclusion by letting .

Appendix D Proof of Theorem 4.4

Denote the optimal transport map between and and the strong Fréchet subgradient of the entropy evaluated at defined by Lemma 4.2: . Since , using Brenier’s theorem. We thus have -a.e.:

(40)

We firstly bound the entropy term. By taking