SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

06/03/2020
by   Sinho Chewi, et al.
MIT
0

Stein Variational Gradient Descent (SVGD), a popular sampling algorithm, is often described as the kernelized gradient flow for the Kullback-Leibler divergence in the geometry of optimal transport. We introduce a new perspective on SVGD that instead views SVGD as the (kernelized) gradient flow of the chi-squared divergence which, we show, exhibits a strong form of uniform exponential ergodicity under conditions as weak as a Poincaré inequality. This perspective leads us to propose an alternative to SVGD, called Laplacian Adjusted Wasserstein Gradient Descent (LAWGD), that can be implemented from the spectral decomposition of the Laplacian operator associated with the target density. We show that LAWGD exhibits strong convergence guarantees and good practical performance.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

01/06/2020

Gradient descent algorithms for Bures-Wasserstein barycenters

We study first order methods to compute the barycenter of a probability ...
04/25/2017

Stein Variational Gradient Descent as Gradient Flow

Stein variational gradient descent (SVGD) is a deterministic sampling al...
06/29/2020

Natural Gradient for Combined Loss Using Wavelets

Natural gradients have been widely used in optimization of loss function...
06/06/2021

Complexity Analysis of Stein Variational Gradient Descent Under Talagrand's Inequality T1

We study the complexity of Stein Variational Gradient Descent (SVGD), wh...
06/01/2021

Large-Scale Wasserstein Gradient Flows

Wasserstein gradient flows provide a powerful means of understanding and...
06/23/2021

Sampling with Mirrored Stein Operators

We introduce a new family of particle evolution samplers suitable for co...
11/07/2008

Statistical ranking and combinatorial Hodge theory

We propose a number of techniques for obtaining a global ranking from da...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The seminal paper of Jordan, Kinderlehrer, and Otto [jordan1998variational] has profoundly reshaped our understanding of sampling algorithms. What is now commonly known as the JKO scheme

interprets the evolution of marginal distributions of a Langevin diffusion as a gradient flow of a Kullback-Leibler (KL) divergence over the Wasserstein space of probability measures. This optimization perspective on Markov Chain Monte Carlo (MCMC) has not only renewed our understanding of algorithms based on Langevin diffusions 

[Dal17, Ber18, cheng2018langevin, Wib18, durmus2019lmcconvex, vempala2019langevin], but has also fueled the discovery of new MCMC algorithms inspired by the diverse and powerful optimization toolbox [martin2012newtonmcmc, simsekli2016quasinewtonlangevin, cheng2017underdamped, Ber18, hsieh2018mirrored, Wib18, ma2019there, Wib19prox, chewi2020exponential, DalRiou2020, zhang2020mirror].

The Unadjusted Langevin Algorithm (ULA) [dalalyan2017theoretical, durmus2017nonasymptoticlangevin] is the most common discretization of the Wasserstein gradient flow for the KL divergence, but it is unclear whether it is the most effective one. In fact, ULA is asymptotically biased, which results in slow convergence and often requires ad-hoc adjustments [dwivedi2019log]. To overcome this limitation, various methods that track the Wasserstein gradient flow more closely have been recently developed [Ber18, Wib18, salim2020proximal].

Let denote a functional over the Wasserstein space of distributions. The Wasserstein gradient flow of may be described as the deterministic and time-inhomogeneous Markov process

started at a random variable

and evolving according to , where denotes the distribution of . Here is the Wasserstein gradient of at . If , where is a given target distribution on , it is known [ambrosio2008gradient, villani2009ot, santambrogio2017euclidean] that . Therefore, a natural discretization of the Wasserstein gradient flow with step size , albeit one that cannot be implemented since it depends on the distribution of , is:

While

can, in principle, be estimated by evolving a large number of particles

, estimation of

is hindered by the curse of dimensionality and this approach still faces significant computational challenges despite attempts to improve the original JKO scheme 

[salim2020proximal, wang2020informationnewton].

A major advance in this direction was achieved by allowing for approximate Wasserstein gradients. More specifically, Stein Variational Gradient Descent (SVGD), recently proposed by [liu2016stein] (see Section 2 for more details), consists in replacing by its image under the integral operator associated to a chosen kernel and defined by for . This leads to the following process:

()

where we apply the integral operator individually to each coordinate of the Wasserstein gradient. In turn, this kernelization trick overcomes most of the above computational bottleneck. Building on this perspective, [duncan2019geometrysvgd] introduced a new geometry, different from the Wasserstein geometry and which they call the Stein geometry, in which () becomes the gradient flow of the KL divergence. However, despite this recent advance, the theoretical properties of SVGD as a sampling algorithm as well as guidelines for the choice of the kernel are still largely unexplored.

In this work, we revisit the above view of SVGD as a kernelized gradient flow of the KL divergence over Wasserstein space that was put forward in [liu2017stein].

Our contributions. We introduce, in Section 2.3, a new perspective on SVGD by viewing it as kernelized gradient flow of the chi-squared divergence rather than the KL divergence. This perspective is fruitful in two ways. First, it uses a single integral operator —as opposed to (), which requires a family of integral operators , —providing a conceptually clear guideline for choosing , namely: should be chosen to make approximately equal to the identity operator. Second, under the idealized choice , we show that this gradient flow converges exponentially fast in KL divergence as soon as the target distribution satisfies a Poincaré inequality. In fact, our results are stronger than exponential convergence and they highlight strong uniform ergodicity: the gradient flow forgets the initial distribution after a finite time that is at most half of the Poincaré constant. To establish this exponential convergence under a relatively weak condition (Poincaré inequality), we employ the following technique. While the gradient flow aims at minimizing the chi-squared divergence by following the curve in Wasserstein space with steepest descent, we do not track its progress with the objective function itself, the chi-squared divergence, but instead we track it with the KL divergence. This is in a sense dual to argument employed in [chewi2020exponential], where the chi-squared divergence is used to track the progress of a gradient flow on the KL divergence. A more standard analysis relying on Łojasiewicz inequalities also yields rates of convergence on the chi-squared divergence under stronger assumptions such as a log-Sobolev inequality, and log-concavity. These results establish the first finite-time theoretical guarantees for SVGD in an idealized setting.

Figure 1: Sampling from a mixture of two 2D Gaussians with . See Appendix C.

Beyond providing a better understanding of SVGD, our novel perspective is instrumental in the development of a new sampling algorithm, which we call Laplacian Adjusted Wasserstein Gradient Descent () and present in Section 4. Although LAWGD is challenging to implement in high dimensions, we show that it possesses a striking theoretical property: assuming that the target distribution satisfies a Poincaré inequality, LAWGD converges exponentially fast, with no dependence on the Poincaré constant. This scale invariance has been recently demonstrated for the Newton-Langevin diffusion [chewi2020exponential], but under the additional assumption that is log-concave. A successful implementation of LAWGD hinges on the spectral decomposition of a certain differential operator which is within reach of modern PDE solvers. As a proof of concept, we show that LAWGD performs well in one or two dimensions using a naïve finite differences method and leave the question of applying more sophisticated numerical solvers open for future research.

Related work. Since its introduction in [liu2016stein], a number of variants of SVGD have been considered. They include a stochastic version [li2019stochastic], a version that approximates the Newton direction in Wasserstein space [detommaso2018stein], a version that uses matrix kernels [wang2019stein], an accelerated version [liu2019understanding], and a hybrid with Langevin [zhang2018stochastic]. Several works have studied theoretical properties of SVGD, including its interpretation as a gradient flow under a modified geometry [liu2017stein, duncan2019geometrysvgd], and its asymptotic convergence [lu2019scaling].

Notation. In this paper, all probability measures are assumed to have densities w.r.t. Lebesgue measure; therefore, we frequently abuse notation by identifying a probability measure with its Lebesgue density. For a differentiable kernel , we denote by (resp. ) the gradient of the kernel w.r.t. the first (resp. second) argument. When describing particle algorithms, we use a subscript to denote the time index and brackets to denote the particle index, i.e., refers to the th particle at time (or iteration number) .

2 SVGD as a kernelized Wasserstein gradient flow

2.1 Wasserstein gradient flows

In this section, we review the theory of gradient flows on the space

of probability measures absolutely continuous w.r.t. Lebesgue measure and possessing a finite second moment, equipped with the

-Wasserstein metric . We refer readers to [villani2003topics, santambrogio2015ot, santambrogio2017euclidean] for introductory treatments of optimal transport, and to [ambrosio2008gradient, villani2009ot] for detailed treatments of Wasserstein gradient flows.

Let be a functional defined on Wasserstein space. We say that a curve of probability measures is a Wasserstein gradient flow for the functional if it satisfies

(2.1)

in a weak sense. Here, is the Wasserstein gradient of the functional at , where is the first variation of at , defined by

and denotes the usual (Euclidean) gradient. Hence, the Wasserstein gradient, at each , is a map from to .

Using the continuity equation, we can give an Eulerian interpretation to the evolution equation (2.1) (see [santambrogio2015ot, §4] and [ambrosio2008gradient, §8]

). Given a family of vector fields

, let be a curve in with random initial point , and such that is an integral curve of the vector fields , that is, . If we let denote the law of , then evolves according to the continuity equation

(2.2)

Comparing (2.1) and (2.2), we see that (2.1) describes the evolution of the marginal law of the curve with and .

Wasserstein calculus provides the following (formal) calculation rule: the Wasserstein gradient flow for the functional dissipates at the rate . More generally, for a curve evolving according to the continuity equation (2.2), the time-derivative of is given by .

In this paper, we are primarily concerned with two functionals: the Kullback-Leibler (KL) divergence , and the chi-squared divergence (see, e.g., [tsybakov2009nonparametric]). It is a standard exercise [ambrosio2008gradient, santambrogio2015ot] to check that Wasserstein gradients of these functionals are, respectively,

(2.3)

2.2 SVGD as a kernelized gradient flow of the KL divergence

SVGD111Throughout this paper, we call SVGD the generalization of the original method of [liu2016stein, liu2017stein] that was introduced in [wang2019stein]. is achieved by replacing the Wasserstein gradient of the KL divergence with , leading to the particle evolution equation ().

Recalling that , we get

(2.4)

where, in the second identity, we used integration by parts. This expression shows that rather than having to estimate the distribution , it is sufficient to estimate the expectation . This is the key to the computational tractability of SVGD. Indeed, the kernelized gradient flow can implemented by drawing particles and following the coupled dynamics

With this, we can simply estimate the expectation with respect to with an average over all particles. Discretizing the resulting process in time, we obtain the SVGD algorithm:

(2.5)

2.3 SVGD as a kernelized gradient flow of the chi-squared divergence

Recall from Section 2.1 that by the continuity equation, the particle evolution equation () translates into the following PDE that describes the evolution of the distribution of :

()

We make the simple observation that

Thus, the continuous-dynamics of SVGD, as given in (), can equivalently be expressed as

()

To interpret this equation, we recall that the Wasserstein gradient of the chi-squared divergence at is (by (2.3)), so the gradient flow for the chi-squared divergence is

()

Comparing () and (), we see that (up to a factor of ), can be understood as the flow obtained by replacing the gradient of the chi-squared divergence, , by .

Although () and () are equivalent ways of expressing the same dynamics, the formulation of () presents a significant advantage: it involves a kernel integral operator that does not change with time and depends only on the target distribution .

3 Chi-squared gradient flow

In this section, study the idealized case where taken to be the identity operator. In this case, () reduces to the gradient flow . The existence, uniqueness, and regularity of this flow are studied in [ohta2011generalizedentropies, ohta2013generalizedentropiesii] and [ambrosio2008gradient, Theorem 11.2.1].

The rate of convergence of the gradient flow of the KL divergence is closely related to two functional inequalities: the Poincaré inequality controls the rate of exponential convergence in chi-squared divergence ([pavliotis2014stochastic, Theorem 4.4],  [chewi2020exponential]) while a log-Sobolev inequality characterizes the rate of exponential convergence of the KL divergence [bakry2014markov, Theorem 5.2.1]. In this section, we show that these inequalities also guarantee exponential rates of convergence of .

Recall that satisfies a Poincaré inequality with constant if

()

while satisfies a log-Sobolev inequality with constant

()

for all locally Lipschitz for which .

We briefly review some facts regarding the strength of these assumptions. It is standard that the log-Sobolev inequality is stronger than the Poincaré inequality: () implies () with constant . In turn, if is -strongly log-concave, i.e. , then it implies the validity of () with , and thus a Poincaré inequality holds too. However, a Poincaré inequality is in general much weaker than strong log-concavity. For instance, if

denotes the largest eigenvalue of the covariance matrix of

, then it is currently known that satisfies a Poincaré inequality as soon as it is log-concave, with , where is a dimensional constant [bobkovIsoperimetricAnalyticInequalities1999, alonso2015kls, leevempala2017poincare], and the well-known Kannan-Lovász-Simonovitz (KLS) conjecture [kls1995] asserts that does not actually depend on the dimension.

Our first result shows that a Poincaré inequality suffices to establish exponential decay of the KL divergence along . In fact, we establish a remarkable property, which we call strong uniform ergodicity: under a Poincaré inequality, forgets its initial distribution after a time of no more than . Uniform ergodicity is central in the theory of Markov processes [MeyTwe09, Ch. 16] but is often limited to compact state spaces. Moreover, this theory largely focuses on total variation, so the distance from the initial distribution to the target distribution is trivially bounded by .

Theorem 1.

Assume that satisfies a Poincaré inequality () with constant and let denote the law of . Assume that . Then,

(3.1)

In fact, a stronger convergence result holds:

(3.2)
Proof.

Given the Wasserstein gradients (2.3) in Section 2.1, we get that satisfies

Applying the Poincaré inequality () with , we get

where, in the last inequality, we use the fact that (see [tsybakov2009nonparametric, §2.4]). The bound (3.1) follows by applying Grönwall’s inequality.

To prove (3.2), we use the stronger inequality (see [tsybakov2009nonparametric, §2.4]). Our differential inequality now reads:

where . Grönwall’s inequality now yields

Note that whenever . Thus, if , we get so

which, together with (3.1), completes the proof of (3.2). ∎

Remark 1.

In [chewi2020exponential], it was observed that the chi-squared divergence decays exponentially fast along the gradient flow for the KL divergence, provided that satisfies a Poincaré inequality. This observation is made precise and more general in [matthes2009fourthorder] where it is noted that the gradient flow of a functional dissipates a different functional at the same rate that the gradient flow of dissipates the functional . A similar method is used to study the thin film equation in [carrillo2002thinfilm] and [carlen2011functional, §5].

Since we are studying the gradient flow of the chi-squared divergence, it is natural to ask whether converges to in chi-squared divergence as well. In the next results, we show quantitative decay of the chi-squared divergence along the gradient flow under a Poincaré inequality (), but we obtain only a polynomial rate of decay. However, if we additionally assume either that is log-concave or that it satisfies a log-Sobolev inequality (), then we obtain exponential decay of the chi-squared divergence along .

Theorem 2.

Suppose that satisfies a Poincaré inequality (). Then, provided , the law of satisfies

If we further assume that is log-concave, then

Proof.

The proof is deferred to Appendix B. ∎

Under the stronger assumption (), we can show strong uniform ergodicity as in Theorem 1.

Theorem 3.

Assume that satisfies a log-Sobolev inequality (). Let denote the law of , and assume that . Then, for all ,

Proof.

The proof is deferred to Appendix B. ∎

Convergence in chi-squared divergence was studied in recent works such as [cao2019renyi, vempala2019langevin, chewi2020exponential]. From standard comparisons between information divergences (see [tsybakov2009nonparametric, §2.4]), it implies convergence in total variation distance, Hellinger distance, and KL divergence. Moreover, recent works have shown that the Poincaré inequality () yields transportation-cost inequalities which bound the -Wasserstein distance by powers of the chi-squared divergence [ding2015quadratictransport, ledoux2018remarks, chewi2020exponential, liu2020quadratictransport], so we obtain convergence in the -Wasserstein distance as well. In particular, we mention that [chewi2020exponential] uses the chi-squared gradient flow () to prove a transportation-cost inequality.

4 Laplacian Adjusted Wasserstein Gradient Descent (LAWGD)

While the previous section leads to a better understanding of the convergence properties of SVGD in the case that is the identity operator, it is still unclear how to choose the kernel to approach this idealized setup. For with a general kernel , the calculation rules of Section 2.1 together with the method of the previous section yield the formula

for the dissipation of the KL divergence along . From this, a natural way to proceed is to seek an inequality of the form

(4.1)

Applying this inequality to each coordinate of separately and using a Poincaré inequality would then allow us to conclude as in the proof of Theorem 1. The inequality (4.1) can be interpreted as a positive lower bound on the smallest eigenvalue of the operator . However, this approach is doomed to fail; under mild conditions on the kernel , it is a standard fact that the eigenvalues of form a sequence converging to , so no such spectral gap can hold.222It is enough that is a symmetric kernel with , and that is not discrete (so that is infinite-dimensional); see [bakry2014markov, Appendix A.6].

This suggests that any approach which seeks to prove finite-time convergence results for in the spirit of Theorem 1

must exploit finer properties of the eigenspaces of the operator

. Motivated by this observation, we develop a new algorithm called Laplacian Adjusted Wasserstein Gradient Descent () in which the kernel is chosen carefully so that is the inverse of the generator of the Langevin diffusion that has as invariant measure.

More precisely, the starting point for our approach is the following integration-by-parts formula, which is a crucial component of the theory of Markov semigroups [bakry2014markov]:

(4.2)

where . The operator is the (negative) generator of the standard Langevin diffusion with stationary distribution  [pavliotis2014stochastic, §4.5]. We refer readers to Appendix A for background on the spectral theory of .

In order to use (4.2), we replace by the vector field . The new dynamics follow the evolution equation

()

The vector field in the above continuity equation may also be written

Replacing by an empirical average over particles and discretizing the process in time, we again obtain an implementable algorithm, which we give as Algorithm 1.

A careful inspection of Algorithm 1 reveals that the update equation for the particles in Algorithm 1 does not involve the potential directly, unlike the SVGD algorithm (2.5); thus, the kernel for must contain all the information about .

Our choice for the kernel is guided by the following observation (based on (4.2)):

As a result, we choose to ensure that . This choice yields

(4.3)
1:procedure LAWGD()
2:     draw particles
3:     for  do
4:         for  do
5:              
6:         end for
7:     end for
8:     return
9:end procedure
Algorithm 1 LAWGD

It remains to see which kernel implements . To that end, assume that has a discrete spectrum and let ,

be its eigenvalue-eigenfunction pairs where

s are arranged in nondecreasing order. Assume further that (which amounts to a Poincaré inequality; see Appendix A) and define the following spectral kernel:

(4.4)

We now show that this choice of kernel endows with a remarkable property: it converges to the target distribution exponentially fast, with a rate which has no dependence on the Poincaré constant. Moreover, akin to —see (3.2)—it also also exhibit strong uniform ergodicity.

Theorem 4.

Assume that has a discrete spectrum and that satisfies a Poincaré inequality () with some finite constant. Let be the law of with the kernel described above. Then, for all ,

Proof.

In light of (4.3), the proof is identical to that of Theorem 1. ∎

The convergence rate in Theorem 4 has no dependence on the target measure. This scale-invariant convergence also appears in [chewi2020exponential], where it is shown for the Newton-Langevin diffusion with a strictly log-concave target measure . In Theorem 4, we obtain similar guarantees under the much weaker assumption of a Poincaré inequality; indeed, there are many examples of non-log-concave distributions which satisfy a Poincaré inequality [vempala2019langevin].

5 Experiments

Figure 2:

Samples from the standard Gaussian distribution generated by

, with kernel approximated by Hermite polynomials. For details, see Appendix C.

To implement Algorithm 1, we numerically approximate the kernel given in (4.4). When is the standard Gaussian distribution on , the eigendecomposition of the operator in (4.2) is known explicitly in terms of the Hermite polynomials [bakry2014markov, §2.7.1], and we approximate the kernel via a truncated sum: (Figure 2) involving the smallest eigenvalues of .

In the general case, we implement a basic finite difference (FD) method to approximate the eigenvalues and eigenfunctions of . We obtain better numerical results by first transforming the operator into the Schrödinger operator , where . If is an eigenfunction of with eigenvalue (normalized such that ), then is an eigenfunction of also with eigenvalue (and normalized such that ); see [bakry2014markov, §1.15.7].

Figure 3:

and SVGD run with constant step size for a mixture of three Gaussians. Both kernel density estimators use the same bandwidth.

On a grid of points (with spacing ), if we replace the Laplacian with the FD operator (in 1D), then the FD Schrödinger operator

can be represented as a sparse matrix, and its eigenvalues and (unit) eigenvectors are found with standard linear algebra solvers.

When the potential is known only up to an additive constant, then the approximate eigenfunctions produced by this method are not normalized correctly; instead, they satisfy for some constant (which is the same for each eigenfunction). In turn, this causes the kernel in to be off by a multiplicative constant. For implementation purposes, however, this constant is absorbed in the step size of Algorithm 1. We also note that the eigenfunctions are differentiated using a FD approximation.

To demonstrate, we sample from a mixture of three Gaussians: . We compare with SVGD using the RBF kernel and median-based bandwidth as in [liu2016stein]. We approximate the eigenfunctions and eigenvalues using a finite difference scheme, on 256 grid points evenly spaced between and . Constant step sizes for  and SVGD are tuned and the algorithms are run for 5000 iterations, and the samples are initialized to be uniform on . The results are displayed in Figure 3. All 256 discrete eigenfunctions and eigenvalues are used.

6 Open questions

We conclude this paper with some interesting open questions. The introduction of the chi-squared divergence as an objective function allows us to obtain both theoretical insights about SVGD and a new algorithm, . This perspective opens the possibility of identifying other functionals defined over Wasserstein space and that yield gradient flows which are amenable to mathematical analysis and efficient computation. Towards this goal, an intriguing direction is to develop alternative methods, besides kernelization, which provide effective implementations of Wasserstein gradient flows. Finally, we note that provides a hitherto unexplored connection between sampling and computing the spectral decomposition of the Schrödinger operator, the latter of which has been intensively studied in numerical PDEs. We hope our work further stimulates research at the intersection of these communities.

Acknowledgments.
Philippe Rigollet was supported by NSF awards IIS-1838071, DMS-1712596, DMS-TRIPODS-1740751, and ONR grant N00014-17- 1-2147. Sinho Chewi and Austin J. Stromme were supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program. Thibaut Le Gouic was supported by ONR grant N00014-17-1-2147 and NSF IIS-1838071.

Appendix A Review of spectral theory

In this paper, we consider elliptic differential operators of the form , where is a continuously differentiable potential. In this section, we provide a brief review of the spectral theory of these operators, and we refer to [evans2010pde, §6.5] for a standard treatment.

The operator (when suitably interpreted) is a linear operator defined on a domain . For any locally Lipschitz function , integration by parts shows that

Therefore, has a non-negative spectrum. Also, we have , so that is always an eigenvalue of . We say that has a discrete spectrum if it has a countable sequence of eigenvalues and corresponding eigenfunctions which form a basis of . The eigenfunctions can be chosen to be orthogonal and normalized such that ; we always assume this is the case. Then, can be expressed as

The operator has a discrete spectrum under the following condition ([friedrichsSpektraltheorieHalbbeschrankterOperatoren1934][reedsimon1978vol4, Theorem XIII.67][bakry2014markov, Corollary 4.10.9]):

where . Moreover, under this condition we also have as . For example, this condition is satisfied for for , but not for . In fact, for , the spectrum of is not discrete [bakry2014markov, §4.1.1].

The Poincaré inequality () is interpreted as a spectral gap inequality, since it asserts that . Thus, under a Poincaré inequality, is bijective. Moreover, if it has a discrete spectrum, its inverse satisfies

Appendix B Proofs of the convergence guarantees for the chi-squared flow

In this section, we give proofs of the convergence results we stated in Section 3.

Proof of Theorem 2 (non-log-concave case).

According to [chewi2020exponential, Proposition 1], the Poincaré inequality implies the following inequality for the chi-squared divergence:

(B.1)

Since the Wasserstein gradient of at is given by (see Section 2.1), it yields

Solving the above differential inequality yields

which implies the desired result. ∎

We now prepare for the proof of exponentially fast convergence in chi-squared divergence for log-concave measures. The key to proving such results lies in differential inequalities of the form

(B.2)

which may be interpreted as a Polyak-Łojasiewicz (PL) inequality [karimi2016linear] for . PL inequalities are well-known in the optimization literature, and can be even used when the objective is not convex [CheMauRig20]. In contrast, the preceding proof uses the weaker inequality (B.1), which may be interpreted as a Łojasiewicz inequality [lojasiewicz1963propriete].

To see that a PL inequality readily yields exponential convergence, observe that

Together with Grönwall’s inequality, the differential inequality yields .

In order to prove a PL inequality of the type (B.2), we require two ingredients. The first one is a transportation-cost inequality for the chi-squared divergence proven in [liu2020quadratictransport], building on the works [ding2015quadratictransport, ledoux2018remarks]. It asserts that if satisfies a Poincaré inequality (), then the following inequality holds:

(B.3)

For the second ingredient, we use an argument of [otto2000generalization] to show that if satisfies a chi-squared transportation-cost inequality such as (B.3), and in addition is log-concave, then it satisfies an inequality of the type (B.2). We remark that the converse statement, that is, if satisfies a PL inequality (B.2) then it satisfies an appropriate chi-squared transportation-cost inequality, was proven in [chewi2020exponential] without the additional assumption of log-concavity. It implies that for log-concave distributions, the PL inequality (B.2) and the chi-squared transportation-cost inequality (B.3) are, in fact, equivalent.

Theorem 5.

Let be log-concave, and assume that for some and a constant ,

Then,

(B.4)

where satisfies .

Proof.

Following [otto2000generalization], let be the optimal transport map from to . Since is displacement convex [ohta2011generalizedentropies, ohta2013generalizedentropiesii] and has Wasserstein gradient at (c.f. Section 2.1), the “above-tangent” formulation of displacement convexity ([villani2003topics, Proposition 5.29]) yields

where we used the Cauchy-Schwarz inequality for the last inequality. Rearranging the above display and using the transportation-cost inequality assumed in the statement of theorem, we get

The result follows by rearranging the terms. ∎

Proof of Theorem 2 (log-concave case).

From the transportation-cost inequality (B.3) and Theorem 5 with , we obtain

This PL inequality together with Grönwall’s inequality readily yields the result. ∎

We conclude this section with the proof of Theorem 3, which shows exponential convergence of in chi-squared divergence under the assumption of a log-Sobolev inequality () (but without the assumption of log-concavity).

Proof of Theorem 3.

We first claim that

(B.5)

Indeed, applying (), we obtain