The seminal paper of Jordan, Kinderlehrer, and Otto [jordan1998variational] has profoundly reshaped our understanding of sampling algorithms. What is now commonly known as the JKO scheme
interprets the evolution of marginal distributions of a Langevin diffusion as a gradient flow of a Kullback-Leibler (KL) divergence over the Wasserstein space of probability measures. This optimization perspective on Markov Chain Monte Carlo (MCMC) has not only renewed our understanding of algorithms based on Langevin diffusions[Dal17, Ber18, cheng2018langevin, Wib18, durmus2019lmcconvex, vempala2019langevin], but has also fueled the discovery of new MCMC algorithms inspired by the diverse and powerful optimization toolbox [martin2012newtonmcmc, simsekli2016quasinewtonlangevin, cheng2017underdamped, Ber18, hsieh2018mirrored, Wib18, ma2019there, Wib19prox, chewi2020exponential, DalRiou2020, zhang2020mirror].
The Unadjusted Langevin Algorithm (ULA) [dalalyan2017theoretical, durmus2017nonasymptoticlangevin] is the most common discretization of the Wasserstein gradient flow for the KL divergence, but it is unclear whether it is the most effective one. In fact, ULA is asymptotically biased, which results in slow convergence and often requires ad-hoc adjustments [dwivedi2019log]. To overcome this limitation, various methods that track the Wasserstein gradient flow more closely have been recently developed [Ber18, Wib18, salim2020proximal].
Let denote a functional over the Wasserstein space of distributions. The Wasserstein gradient flow of may be described as the deterministic and time-inhomogeneous Markov process
started at a random variableand evolving according to , where denotes the distribution of . Here is the Wasserstein gradient of at . If , where is a given target distribution on , it is known [ambrosio2008gradient, villani2009ot, santambrogio2017euclidean] that . Therefore, a natural discretization of the Wasserstein gradient flow with step size , albeit one that cannot be implemented since it depends on the distribution of , is:
can, in principle, be estimated by evolving a large number of particles, estimation of
is hindered by the curse of dimensionality and this approach still faces significant computational challenges despite attempts to improve the original JKO scheme[salim2020proximal, wang2020informationnewton].
A major advance in this direction was achieved by allowing for approximate Wasserstein gradients. More specifically, Stein Variational Gradient Descent (SVGD), recently proposed by [liu2016stein] (see Section 2 for more details), consists in replacing by its image under the integral operator associated to a chosen kernel and defined by for . This leads to the following process:
where we apply the integral operator individually to each coordinate of the Wasserstein gradient. In turn, this kernelization trick overcomes most of the above computational bottleneck. Building on this perspective, [duncan2019geometrysvgd] introduced a new geometry, different from the Wasserstein geometry and which they call the Stein geometry, in which () becomes the gradient flow of the KL divergence. However, despite this recent advance, the theoretical properties of SVGD as a sampling algorithm as well as guidelines for the choice of the kernel are still largely unexplored.
In this work, we revisit the above view of SVGD as a kernelized gradient flow of the KL divergence over Wasserstein space that was put forward in [liu2017stein].
Our contributions. We introduce, in Section 2.3, a new perspective on SVGD by viewing it as kernelized gradient flow of the chi-squared divergence rather than the KL divergence. This perspective is fruitful in two ways. First, it uses a single integral operator —as opposed to (), which requires a family of integral operators , —providing a conceptually clear guideline for choosing , namely: should be chosen to make approximately equal to the identity operator. Second, under the idealized choice , we show that this gradient flow converges exponentially fast in KL divergence as soon as the target distribution satisfies a Poincaré inequality. In fact, our results are stronger than exponential convergence and they highlight strong uniform ergodicity: the gradient flow forgets the initial distribution after a finite time that is at most half of the Poincaré constant. To establish this exponential convergence under a relatively weak condition (Poincaré inequality), we employ the following technique. While the gradient flow aims at minimizing the chi-squared divergence by following the curve in Wasserstein space with steepest descent, we do not track its progress with the objective function itself, the chi-squared divergence, but instead we track it with the KL divergence. This is in a sense dual to argument employed in [chewi2020exponential], where the chi-squared divergence is used to track the progress of a gradient flow on the KL divergence. A more standard analysis relying on Łojasiewicz inequalities also yields rates of convergence on the chi-squared divergence under stronger assumptions such as a log-Sobolev inequality, and log-concavity. These results establish the first finite-time theoretical guarantees for SVGD in an idealized setting.
Beyond providing a better understanding of SVGD, our novel perspective is instrumental in the development of a new sampling algorithm, which we call Laplacian Adjusted Wasserstein Gradient Descent () and present in Section 4. Although LAWGD is challenging to implement in high dimensions, we show that it possesses a striking theoretical property: assuming that the target distribution satisfies a Poincaré inequality, LAWGD converges exponentially fast, with no dependence on the Poincaré constant. This scale invariance has been recently demonstrated for the Newton-Langevin diffusion [chewi2020exponential], but under the additional assumption that is log-concave. A successful implementation of LAWGD hinges on the spectral decomposition of a certain differential operator which is within reach of modern PDE solvers. As a proof of concept, we show that LAWGD performs well in one or two dimensions using a naïve finite differences method and leave the question of applying more sophisticated numerical solvers open for future research.
Related work. Since its introduction in [liu2016stein], a number of variants of SVGD have been considered. They include a stochastic version [li2019stochastic], a version that approximates the Newton direction in Wasserstein space [detommaso2018stein], a version that uses matrix kernels [wang2019stein], an accelerated version [liu2019understanding], and a hybrid with Langevin [zhang2018stochastic]. Several works have studied theoretical properties of SVGD, including its interpretation as a gradient flow under a modified geometry [liu2017stein, duncan2019geometrysvgd], and its asymptotic convergence [lu2019scaling].
Notation. In this paper, all probability measures are assumed to have densities w.r.t. Lebesgue measure; therefore, we frequently abuse notation by identifying a probability measure with its Lebesgue density. For a differentiable kernel , we denote by (resp. ) the gradient of the kernel w.r.t. the first (resp. second) argument. When describing particle algorithms, we use a subscript to denote the time index and brackets to denote the particle index, i.e., refers to the th particle at time (or iteration number) .
2 SVGD as a kernelized Wasserstein gradient flow
2.1 Wasserstein gradient flows
In this section, we review the theory of gradient flows on the space
of probability measures absolutely continuous w.r.t. Lebesgue measure and possessing a finite second moment, equipped with the-Wasserstein metric . We refer readers to [villani2003topics, santambrogio2015ot, santambrogio2017euclidean] for introductory treatments of optimal transport, and to [ambrosio2008gradient, villani2009ot] for detailed treatments of Wasserstein gradient flows.
Let be a functional defined on Wasserstein space. We say that a curve of probability measures is a Wasserstein gradient flow for the functional if it satisfies
in a weak sense. Here, is the Wasserstein gradient of the functional at , where is the first variation of at , defined by
and denotes the usual (Euclidean) gradient. Hence, the Wasserstein gradient, at each , is a map from to .
Using the continuity equation, we can give an Eulerian interpretation to the evolution equation (2.1) (see [santambrogio2015ot, §4] and [ambrosio2008gradient, §8]
). Given a family of vector fields, let be a curve in with random initial point , and such that is an integral curve of the vector fields , that is, . If we let denote the law of , then evolves according to the continuity equation
Wasserstein calculus provides the following (formal) calculation rule: the Wasserstein gradient flow for the functional dissipates at the rate . More generally, for a curve evolving according to the continuity equation (2.2), the time-derivative of is given by .
In this paper, we are primarily concerned with two functionals: the Kullback-Leibler (KL) divergence , and the chi-squared divergence (see, e.g., [tsybakov2009nonparametric]). It is a standard exercise [ambrosio2008gradient, santambrogio2015ot] to check that Wasserstein gradients of these functionals are, respectively,
2.2 SVGD as a kernelized gradient flow of the KL divergence
SVGD111Throughout this paper, we call SVGD the generalization of the original method of [liu2016stein, liu2017stein] that was introduced in [wang2019stein]. is achieved by replacing the Wasserstein gradient of the KL divergence with , leading to the particle evolution equation ().
Recalling that , we get
where, in the second identity, we used integration by parts. This expression shows that rather than having to estimate the distribution , it is sufficient to estimate the expectation . This is the key to the computational tractability of SVGD. Indeed, the kernelized gradient flow can implemented by drawing particles and following the coupled dynamics
With this, we can simply estimate the expectation with respect to with an average over all particles. Discretizing the resulting process in time, we obtain the SVGD algorithm:
2.3 SVGD as a kernelized gradient flow of the chi-squared divergence
Recall from Section 2.1 that by the continuity equation, the particle evolution equation () translates into the following PDE that describes the evolution of the distribution of :
We make the simple observation that
To interpret this equation, we recall that the Wasserstein gradient of the chi-squared divergence at is (by (2.3)), so the gradient flow for the chi-squared divergence is
3 Chi-squared gradient flow
In this section, study the idealized case where taken to be the identity operator. In this case, () reduces to the gradient flow . The existence, uniqueness, and regularity of this flow are studied in [ohta2011generalizedentropies, ohta2013generalizedentropiesii] and [ambrosio2008gradient, Theorem 11.2.1].
The rate of convergence of the gradient flow of the KL divergence is closely related to two functional inequalities: the Poincaré inequality controls the rate of exponential convergence in chi-squared divergence ([pavliotis2014stochastic, Theorem 4.4], [chewi2020exponential]) while a log-Sobolev inequality characterizes the rate of exponential convergence of the KL divergence [bakry2014markov, Theorem 5.2.1]. In this section, we show that these inequalities also guarantee exponential rates of convergence of .
Recall that satisfies a Poincaré inequality with constant if
while satisfies a log-Sobolev inequality with constant
for all locally Lipschitz for which .
We briefly review some facts regarding the strength of these assumptions. It is standard that the log-Sobolev inequality is stronger than the Poincaré inequality: () implies () with constant . In turn, if is -strongly log-concave, i.e. , then it implies the validity of () with , and thus a Poincaré inequality holds too. However, a Poincaré inequality is in general much weaker than strong log-concavity. For instance, if
denotes the largest eigenvalue of the covariance matrix of, then it is currently known that satisfies a Poincaré inequality as soon as it is log-concave, with , where is a dimensional constant [bobkovIsoperimetricAnalyticInequalities1999, alonso2015kls, leevempala2017poincare], and the well-known Kannan-Lovász-Simonovitz (KLS) conjecture [kls1995] asserts that does not actually depend on the dimension.
Our first result shows that a Poincaré inequality suffices to establish exponential decay of the KL divergence along . In fact, we establish a remarkable property, which we call strong uniform ergodicity: under a Poincaré inequality, forgets its initial distribution after a time of no more than . Uniform ergodicity is central in the theory of Markov processes [MeyTwe09, Ch. 16] but is often limited to compact state spaces. Moreover, this theory largely focuses on total variation, so the distance from the initial distribution to the target distribution is trivially bounded by .
where, in the last inequality, we use the fact that (see [tsybakov2009nonparametric, §2.4]). The bound (3.1) follows by applying Grönwall’s inequality.
In [chewi2020exponential], it was observed that the chi-squared divergence decays exponentially fast along the gradient flow for the KL divergence, provided that satisfies a Poincaré inequality. This observation is made precise and more general in [matthes2009fourthorder] where it is noted that the gradient flow of a functional dissipates a different functional at the same rate that the gradient flow of dissipates the functional . A similar method is used to study the thin film equation in [carrillo2002thinfilm] and [carlen2011functional, §5].
Since we are studying the gradient flow of the chi-squared divergence, it is natural to ask whether converges to in chi-squared divergence as well. In the next results, we show quantitative decay of the chi-squared divergence along the gradient flow under a Poincaré inequality (), but we obtain only a polynomial rate of decay. However, if we additionally assume either that is log-concave or that it satisfies a log-Sobolev inequality (), then we obtain exponential decay of the chi-squared divergence along .
The proof is deferred to Appendix B. ∎
Under the stronger assumption (), we can show strong uniform ergodicity as in Theorem 1.
The proof is deferred to Appendix B. ∎
Convergence in chi-squared divergence was studied in recent works such as [cao2019renyi, vempala2019langevin, chewi2020exponential]. From standard comparisons between information divergences (see [tsybakov2009nonparametric, §2.4]), it implies convergence in total variation distance, Hellinger distance, and KL divergence. Moreover, recent works have shown that the Poincaré inequality () yields transportation-cost inequalities which bound the -Wasserstein distance by powers of the chi-squared divergence [ding2015quadratictransport, ledoux2018remarks, chewi2020exponential, liu2020quadratictransport], so we obtain convergence in the -Wasserstein distance as well. In particular, we mention that [chewi2020exponential] uses the chi-squared gradient flow () to prove a transportation-cost inequality.
4 Laplacian Adjusted Wasserstein Gradient Descent (LAWGD)
While the previous section leads to a better understanding of the convergence properties of SVGD in the case that is the identity operator, it is still unclear how to choose the kernel to approach this idealized setup. For with a general kernel , the calculation rules of Section 2.1 together with the method of the previous section yield the formula
Applying this inequality to each coordinate of separately and using a Poincaré inequality would then allow us to conclude as in the proof of Theorem 1. The inequality (4.1) can be interpreted as a positive lower bound on the smallest eigenvalue of the operator . However, this approach is doomed to fail; under mild conditions on the kernel , it is a standard fact that the eigenvalues of form a sequence converging to , so no such spectral gap can hold.222It is enough that is a symmetric kernel with , and that is not discrete (so that is infinite-dimensional); see [bakry2014markov, Appendix A.6].
This suggests that any approach which seeks to prove finite-time convergence results for in the spirit of Theorem 1
must exploit finer properties of the eigenspaces of the operator. Motivated by this observation, we develop a new algorithm called Laplacian Adjusted Wasserstein Gradient Descent () in which the kernel is chosen carefully so that is the inverse of the generator of the Langevin diffusion that has as invariant measure.
More precisely, the starting point for our approach is the following integration-by-parts formula, which is a crucial component of the theory of Markov semigroups [bakry2014markov]:
where . The operator is the (negative) generator of the standard Langevin diffusion with stationary distribution [pavliotis2014stochastic, §4.5]. We refer readers to Appendix A for background on the spectral theory of .
In order to use (4.2), we replace by the vector field . The new dynamics follow the evolution equation
The vector field in the above continuity equation may also be written
Replacing by an empirical average over particles and discretizing the process in time, we again obtain an implementable algorithm, which we give as Algorithm 1.
A careful inspection of Algorithm 1 reveals that the update equation for the particles in Algorithm 1 does not involve the potential directly, unlike the SVGD algorithm (2.5); thus, the kernel for must contain all the information about .
Our choice for the kernel is guided by the following observation (based on (4.2)):
As a result, we choose to ensure that . This choice yields
It remains to see which kernel implements . To that end, assume that has a discrete spectrum and let ,
be its eigenvalue-eigenfunction pairs wheres are arranged in nondecreasing order. Assume further that (which amounts to a Poincaré inequality; see Appendix A) and define the following spectral kernel:
We now show that this choice of kernel endows with a remarkable property: it converges to the target distribution exponentially fast, with a rate which has no dependence on the Poincaré constant. Moreover, akin to —see (3.2)—it also also exhibit strong uniform ergodicity.
The convergence rate in Theorem 4 has no dependence on the target measure. This scale-invariant convergence also appears in [chewi2020exponential], where it is shown for the Newton-Langevin diffusion with a strictly log-concave target measure . In Theorem 4, we obtain similar guarantees under the much weaker assumption of a Poincaré inequality; indeed, there are many examples of non-log-concave distributions which satisfy a Poincaré inequality [vempala2019langevin].
To implement Algorithm 1, we numerically approximate the kernel given in (4.4). When is the standard Gaussian distribution on , the eigendecomposition of the operator in (4.2) is known explicitly in terms of the Hermite polynomials [bakry2014markov, §2.7.1], and we approximate the kernel via a truncated sum: (Figure 2) involving the smallest eigenvalues of .
In the general case, we implement a basic finite difference (FD) method to approximate the eigenvalues and eigenfunctions of . We obtain better numerical results by first transforming the operator into the Schrödinger operator , where . If is an eigenfunction of with eigenvalue (normalized such that ), then is an eigenfunction of also with eigenvalue (and normalized such that ); see [bakry2014markov, §1.15.7].
On a grid of points (with spacing ), if we replace the Laplacian with the FD operator (in 1D), then the FD Schrödinger operator
can be represented as a sparse matrix, and its eigenvalues and (unit) eigenvectors are found with standard linear algebra solvers.
When the potential is known only up to an additive constant, then the approximate eigenfunctions produced by this method are not normalized correctly; instead, they satisfy for some constant (which is the same for each eigenfunction). In turn, this causes the kernel in to be off by a multiplicative constant. For implementation purposes, however, this constant is absorbed in the step size of Algorithm 1. We also note that the eigenfunctions are differentiated using a FD approximation.
To demonstrate, we sample from a mixture of three Gaussians: . We compare with SVGD using the RBF kernel and median-based bandwidth as in [liu2016stein]. We approximate the eigenfunctions and eigenvalues using a finite difference scheme, on 256 grid points evenly spaced between and . Constant step sizes for and SVGD are tuned and the algorithms are run for 5000 iterations, and the samples are initialized to be uniform on . The results are displayed in Figure 3. All 256 discrete eigenfunctions and eigenvalues are used.
6 Open questions
We conclude this paper with some interesting open questions. The introduction of the chi-squared divergence as an objective function allows us to obtain both theoretical insights about SVGD and a new algorithm, . This perspective opens the possibility of identifying other functionals defined over Wasserstein space and that yield gradient flows which are amenable to mathematical analysis and efficient computation. Towards this goal, an intriguing direction is to develop alternative methods, besides kernelization, which provide effective implementations of Wasserstein gradient flows. Finally, we note that provides a hitherto unexplored connection between sampling and computing the spectral decomposition of the Schrödinger operator, the latter of which has been intensively studied in numerical PDEs. We hope our work further stimulates research at the intersection of these communities.
Philippe Rigollet was supported by NSF awards IIS-1838071, DMS-1712596, DMS-TRIPODS-1740751, and ONR grant N00014-17- 1-2147. Sinho Chewi and Austin J. Stromme were supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program. Thibaut Le Gouic was supported by ONR grant N00014-17-1-2147 and NSF IIS-1838071.
Appendix A Review of spectral theory
In this paper, we consider elliptic differential operators of the form , where is a continuously differentiable potential. In this section, we provide a brief review of the spectral theory of these operators, and we refer to [evans2010pde, §6.5] for a standard treatment.
The operator (when suitably interpreted) is a linear operator defined on a domain . For any locally Lipschitz function , integration by parts shows that
Therefore, has a non-negative spectrum. Also, we have , so that is always an eigenvalue of . We say that has a discrete spectrum if it has a countable sequence of eigenvalues and corresponding eigenfunctions which form a basis of . The eigenfunctions can be chosen to be orthogonal and normalized such that ; we always assume this is the case. Then, can be expressed as
The operator has a discrete spectrum under the following condition ([friedrichsSpektraltheorieHalbbeschrankterOperatoren1934], [reedsimon1978vol4, Theorem XIII.67], [bakry2014markov, Corollary 4.10.9]):
where . Moreover, under this condition we also have as . For example, this condition is satisfied for for , but not for . In fact, for , the spectrum of is not discrete [bakry2014markov, §4.1.1].
Appendix B Proofs of the convergence guarantees for the chi-squared flow
In this section, we give proofs of the convergence results we stated in Section 3.
Proof of Theorem 2 (non-log-concave case).
According to [chewi2020exponential, Proposition 1], the Poincaré inequality implies the following inequality for the chi-squared divergence:
Since the Wasserstein gradient of at is given by (see Section 2.1), it yields
Solving the above differential inequality yields
which implies the desired result. ∎
We now prepare for the proof of exponentially fast convergence in chi-squared divergence for log-concave measures. The key to proving such results lies in differential inequalities of the form
which may be interpreted as a Polyak-Łojasiewicz (PL) inequality [karimi2016linear] for . PL inequalities are well-known in the optimization literature, and can be even used when the objective is not convex [CheMauRig20]. In contrast, the preceding proof uses the weaker inequality (B.1), which may be interpreted as a Łojasiewicz inequality [lojasiewicz1963propriete].
To see that a PL inequality readily yields exponential convergence, observe that
Together with Grönwall’s inequality, the differential inequality yields .
In order to prove a PL inequality of the type (B.2), we require two ingredients. The first one is a transportation-cost inequality for the chi-squared divergence proven in [liu2020quadratictransport], building on the works [ding2015quadratictransport, ledoux2018remarks]. It asserts that if satisfies a Poincaré inequality (), then the following inequality holds:
For the second ingredient, we use an argument of [otto2000generalization] to show that if satisfies a chi-squared transportation-cost inequality such as (B.3), and in addition is log-concave, then it satisfies an inequality of the type (B.2). We remark that the converse statement, that is, if satisfies a PL inequality (B.2) then it satisfies an appropriate chi-squared transportation-cost inequality, was proven in [chewi2020exponential] without the additional assumption of log-concavity. It implies that for log-concave distributions, the PL inequality (B.2) and the chi-squared transportation-cost inequality (B.3) are, in fact, equivalent.
Let be log-concave, and assume that for some and a constant ,
where satisfies .
Following [otto2000generalization], let be the optimal transport map from to . Since is displacement convex [ohta2011generalizedentropies, ohta2013generalizedentropiesii] and has Wasserstein gradient at (c.f. Section 2.1), the “above-tangent” formulation of displacement convexity ([villani2003topics, Proposition 5.29]) yields
where we used the Cauchy-Schwarz inequality for the last inequality. Rearranging the above display and using the transportation-cost inequality assumed in the statement of theorem, we get
The result follows by rearranging the terms. ∎
Proof of Theorem 2 (log-concave case).
We conclude this section with the proof of Theorem 3, which shows exponential convergence of in chi-squared divergence under the assumption of a log-Sobolev inequality () (but without the assumption of log-concavity).