Accelerated First-order Methods on the Wasserstein Space for Bayesian Inference

07/04/2018 ∙ by Chang Liu, et al. ∙ 0

We consider doing Bayesian inference by minimizing the KL divergence on the 2-Wasserstein space P_2. By exploring the Riemannian structure of P_2, we develop two inference methods by simulating the gradient flow on P_2 via updating particles, and an acceleration method that speeds up all such particle-simulation-based inference methods. Moreover we analyze the approximation flexibility of such methods, and conceive a novel bandwidth selection method for the kernel that they use. We note that P_2 is quite abstract and general so that our methods can make closer approximation, while it still has a rich structure that enables practical implementation. Experiments show the effectiveness of the two proposed methods and the improvement of convergence by the acceleration method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian inference is an important topic in the machine learning community for its power in modeling and reasoning uncertainty among data. Its task is to get access to the posterior distribution

given data for a Bayesian model. As is intractable for general Bayesian models, various methods are developed for approximation. Variational inference methods (VIs) try to approximate by a tractable distribution

that minimizes the Kullback-Leibler divergence (KLD) to

. A parametric distribution family is typically chosen for tractability (e.g. wainwright2008graphical ; hoffman2013stochastic ) and the problem can be efficiently solved by classical optimization methods. But this restricts the flexibility of the approximation thus the closeness to

suffers. Markov chain Monte Carlo methods (MCMCs) (e.g. 

geman1987stochastic ; neal2011mcmc ; welling2011bayesian ; ding2014bayesian ) aim to directly draw samples from . Although they are asymptotically accurate, a large sample size is required due to the autocorrelation between samples, which makes a big cost at test time.

Recently, there appear methods that minimize the KLD on distribution spaces that are more abstract and general than parametric families. In algorithmic appearance, such methods iteratively update a set of samples, or particles, so that the set of particles gradually becomes representative for . They can achieve a better approximation than classical VIs because of the greater flexibility of particles over parametric forms, and they are more particle-efficient than MCMCs, in the sense that a better approximation can be achieved with the same sample size, since they take the particle interaction into account and focus on finite-particle performance. Stein Variational Gradient Descent (SVGD) liu2016stein is a remarkable instance. It updates particles along the gradient flow (GF) (steepest descending curves) of the KLD on the distribution manifold , whose tangent space is the reproducing kernel Hilbert space of a kernel liu2017stein . Its unique benefits make it a popular method: it has been modified for Riemannian support space liu2017riemannian and structured posterior zhuo2017message , and applied to deep generative models wang2016learning ; pu2017vae

and Bayesian reinforcement learning 

liu2017steinpolicy ; haarnoja2017reinforcement . Another instance is the particle optimization method (PO) chen2017particle where particles are updated by solving an optimization problem known as the variational formulation of the Langevin dynamics jordan1998variational . The algorithm resembles the Polyak’s momentum polyak1964some version of SVGD. We call such methods particle-simulation-based variational inference methods (PSVIs) to highlight the difference to VIs and MCMCs.

In this work, we propose to minimize the KLD on the 2-Wasserstein space

, which is general enough to contain all distributions with finite second-order moment, while still tractable for computation thanks to its Riemannian structure. We contribute to the field of PSVI in three folds: (i) We make a theoretical analysis on the flexibility of PSVIs and find out that the particle update rule can be expressed by either a density function or an optimization problem over a family of test functions, and PSVIs have the flexibility up to either a smoothed density or a smoothed test function family. Typically kernel is used to satisfy this restriction (Section 

3.1). (ii) We propose two PSVIs: Gradient Flow with Smoothed Density / test Function (GFSD/GFSF), by simulating the GF of the KLD on (Section 3.2). We also develop a principled bandwidth selection method (Section 3.2.3) for them based on their relation to the Langevin dynamics (Remark 4). (iii) We develop an acceleration method for PSVIs based on the Riemannian version liu2017accelerated of the Nesterov’s acceleration method nesterov1983method , by studying the exponential map and the parallel transport on (Section 4). The effectiveness of GFSD/GFSF and the acceleration effect is observed in experiments.

Related work   As mensioned, SVGD simulates the GF on , while ours on . We note in Remark 3 that is a more natural and well-defined manifold of distributions, while

is not guaranteed for the existence of the tangent vector of every smooth curve on it. Moreover,

provides essential ingredients to develop the bandwidth selection method as well as the acceleration method. Algorithmically, GFSD/GFSF does not average gradients for each particle, enabling every particle to find the high-probability region near it more efficiently in early stage.

The idea of the PO method is essentially to simulate the GF of the KLD on as well, but their simulation is based on the discretization using the minimal movement scheme (MMS) (ambrosio2008gradient , Def 2.0.6) on , while ours is based on simulating the associated dynamics on the support space (Section 2.1), which has an intuitive physics picture and easy to implement. To find the desired tangent vector on , the PO method adds a noise to the SVGD tangent vector in , which is unnatural and may affect convergence. Although the PO algorithm takes the form of Polyak’s momentum acceleration polyak1964some of SVGD for each particle, it requires further theoretical interpretation to be an actual acceleration method, since SVGD is not an optimization method for each particle. Additionally, Nesterov’s acceleration is known to be more stable than Polyak’s momentum sutskever2013importance .

2 Preliminaries

In this part, we briefly introduce the Wasserstein space as a Riemannian manifold and the gradient flow on it, as well as the theory of SVGD. We use bold symbols (e.g. ) to represent probability measures and regular ones (e.g. ) for the corresponding density functions when exist.

2.1 as a Riemannian Manifold

The concept of Wasserstein space rises from the optimal transport problem. Let be the support space, be the typical Euclidean distance, and be the space of probability measures on . We define where is the space of joint probability measures with and as the marginals. It is well known that is a metric space (villani2008optimal , Def 6.4), which is called the 2-Wasserstein space, and is called the 2-Wasserstein distance. We use to denote for brevity. More precisely, is a length space (villani2008optimal , Thm 7.21), which inspires the exploration for any possible Riemannian structures on (e.g. otto2001geometry ). It is an appealing task since a Riemannian structure provides rich ingredients that enable explicit calculation thus practical implementation of quantities of interests, e.g. the gradient of a function on .

In the sense of Riemannian manifold, a smooth curve defines a tangent vector at some point on it as the differentiation along the curve (e.g. do1992riemannian , Def 2.6). Specifically, for an absolutely continuous (AC) curve (roughly a “smooth” curve) , it defines a tangent vector at as: for any function of the form with (compactly supported smooth function). All such form a vector space: the tangent space . On the other hand, another notion of differentiating is the weak derivative , which is the generalization of the typical derivative through the rule of integration by parts: . The two notions are related by considering as a weak derivative: . The weak derivative description brings the vector field representation of tangent vectors in :

Lemma 1 (Continuity Equation (villani2008optimal , Thm 13.8; ambrosio2008gradient , Thm 8.3.1, Prop 8.4.5)).

Denote the Lebesgue measure on as . For any AC curve on , there exists a time-dependent Borel vector field such that for -a.e. , (where the overline means closure), and the continuity equation holds in the weak sense: Moreover, such a is -a.e. unique.

We denote (similarly for ) if both notions refer to a same tangent vector on , from their own perspectives. Furthermore, the above space of is recognized as the tangent space at (ambrosio2008gradient , Def 8.4.1). With the inner product on : , the Riemannian distance recovers the metric-space-sense distance due to the Benamou-Brenier formula benamou2000computational : . Now can be treated as a Riemannian manifold.

The vector field representation has a solid physical intuition. Consider at time there is a set of particles that distributes as . Let the -th particle move with velocity for any moment . If , will distribute as . This intuition can be formally stated:

Lemma 2 (Simulating a curve on (ambrosio2008gradient , Prop 8.1.8)).

Let be an AC curve of AC measure (wrt the Lebesgue measure on , if not specified). Then for -a.e. , there exists a globally defined differentiable map such that where , and , which means for any Borel subset , .

2.2 Gradient Flows on

There are various ways to define the gradient flow (GF) of a function on metric spaces (e.g. ambrosio2008gradient , Def 11.1.1; villani2008optimal , Def 23.7) following the intuition of steepest descending curves, and they all coincide (e.g. villani2008optimal , Prop 23.1, Rem 23.4) on Riemannian manifolds. Particularly, the GF defined by the MMS, which the PO method uses, is equivalent to the GF on Riemannian manifolds (ambrosio2008gradient , Thm 11.1.6; erbar2010heat , Lemma 2.7). Now consider the Kullback–Leibler divergence (KLD) (or relative entropy in some literatures) that we want to minimize in this paper:

where is the Radon-Nikodym derivative. A curve of the GF of the KLD is characterized by (villani2008optimal , Thm 23.18; ambrosio2008gradient , Example 11.1.2) where . Note that the GF of cannot be defined where is not AC wrt . When is geodesically -convex on (e.g. when is -log-concave on (villani2008optimal , Thm 17.15)), enjoys the exponential convergence (villani2008optimal , Thm 23.25, Thm 24.7; ambrosio2008gradient , Thm 11.1.4), as expected. We indistinguishably write and when the measures are AC.

2.3 Stein Variational Gradient Descent (SVGD)

SVGD liu2016stein is a Bayesian inference method that updates particles by a proper vector field so that the the distribution of particles (assumed AC) approaches the posterior (typically AAC) as fast as possible in terms of . Specifically, is found by maximizing where with is the varying density of moving particles with velocity . By choosing where is the reproducing kernel Hilbert space (RKHS) of a kernel , the optimal can be solved explicitly: . Samples are then updated using

, which is estimated by averaging over the samples.

The SVGD vector field is then interpreted as the tangent vector of the GF of the KLD on the -Wasserstein space  liu2017stein , whose tangent space is taken as (weak derivative description)111 Or the vector field description: the quotient space where . . However, we note that this may not be a natural tangent space:

Remark 3.

Lemma 1 guarantees that for any AC curve that passes , the unique existence of a tangent vector of the curve at in is guaranteed. But there is no such guarantee in . Therefore tangent vectors in may not be able to exactly fit a GF on at .

3 Bayesian Inference via Gradient Flow on

We aim to do Bayesian inference via simulating the gradient flow (GF) of on which approaches to , the target posterior distribution. For typical inference tasks is AC, thus is any measure on the GF since it should be AC wrt , as mentioned. Therefore, we can use density functions to write the vector field form of the GF of the KLD introduced in Section 2.2:

The simulation of the GF follows the physical intuition detailed in Section 2.1, i.e. update samples of by then is a set of samples of .

Remark 4.

We note that the deterministic dynamics with above, produces the same evolution rule for the probability density of as the Langevin dynamics: , where is the Brownian motion, due to the Fokker-Planck equation. This meets the known recognition of the Langevin dynamics as the GF of the KLD on  (e.g. jordan1998variational ).

Now we engage in estimating . We first analyze the requirements of two possible ways to do this and further the flexibility of PSVIs, then propose our methods GFSD/GFSF following the two ways. We fix an arbitrary time and drop the subscript for notation brevity.

3.1 Smoothing Density or Smoothing Test Functions

The key obstacle to estimating is the approximation of . We know that is a finite set of samples of , but we cannot directly approximate by the empirical distribution222 is not AC and rigorously is not a density function. We informally treat it as a special function to avoid verbose language. where is the Dirac delta function, since is not AC thus the GF of the KLD is undefined at . To well-define the vector field, a straightforward way is to smooth with a smooth kernel on , leading to the AC approximation (where “” denotes convolution).

Beside the expression in density, we shall show in Section 3.2.2 that the vector field can also be characterized by the optimization problem of the form . By noting the equality (due to the interchangeability of integral and summation), we claim that smoothing density is equivalent to smoothing test function in this optimization formulation. Furthermore, we show that when is a Gaussian kernel, smoothing test functions in is equivalent to taking test functions from the RKHS of .

Proposition 5.

For Gaussian kernel and , is isometrically isomorphism to the RKHS of .

Proof is provided in Appendix A1. As we shall analyze below, other PSVIs also have to face this issue. We claim that for all PSVIs, either density or test function has to be a smoothed one, and PSVIs have this extend of flexibility. This is an intrinsic requirement by the KLD that they minimize.

Case study: SVGD   As introduced in Section 2.3, SVGD identifies its vector field by solving

(1)

where is required by the condition of Stein’s identity liu2017stein . We first claim that neither smoothing density nor smoothing test function will disable the problem to identify the vector field, as the GF of is not defined at .

Proposition 6.

For and , problem (1) has no optimal solution. In fact the supremum of the objective is infinite, indicating that a maximizing sequence of tend to be ill-posed.

Proof is provided in Appendix A2. Intuitively, without a restriction on the sharpness of the test functions, could be extremely peaked around the samples, which produces unreasonably large vectors at the samples. If we choose to smooth the test function with a Gaussian kernel , we only need to choose from according to Proposition 5, which is exactly what SVGD does. This in turn is equivalent to smoothing the density by noticing .

SVGD makes no assumption on the form of the density as long as its samples are known, but it actually transmits the restriction on to the test function . The choice for in is not just for a tractable solution, but more importantly, for guaranteeing a valid vector field. There is no free lunch in the approximation flexibility. This claim also holds for the particle optimization (PO) method chen2017particle , since it adopts the vector field of SVGD for the KLD term in its optimization objective in each step.

3.2 Simulating the Gradient Flow on

Now we propose two practical methods based on smoothing density and smoothing test functions for simulating the gradient flow (GF) on as Bayesian inference methods. A novel bandwidth selection method is also proposed.

3.2.1 Gradient Flow with Smoothed Density (GFSD)

By directly estimating with the smoothed density , we compute the vector field as . We call this method the Gradient Flow with Smoothed Density (GFSD).

We note that GFSD is closely related to SVGD: by the method of variation, the optimal solution to problem (1) with and is proportional to the GFSD vector field. This indicates that the smoothing density version of SVGD coincides with GFSD. In fact, the revised optimization problem characterizes the gradient of the KLD on .

3.2.2 Gradient Flow with Smoothed Test Functions (GFSF)

For the problematic part , or equivalently , we treat it as an equality that holds in the weak sense, which means 333 We also consider scalar-valued test functions and smooth them in , which gives the same result, as shown in Appendix A4. . We take and smooth the test function, which means taking from according to Proposition 5. To enforce the equality to hold, we require the desired to solve the following optimization problem:

(2)

The closed-form solution of the problem is in matrix form, where , , and (see Appendix A3). So the total vector field in matrix form can be expressed as , where . We call this method the Gradient Flow with Smoothed test Functions (GFSF). Note that the inverse does not incur a serious computational cost since PSVIs are particle-efficient and we do not need a large particle size.

An interesting relation of GFSF to SVGD is that the SVGD vector field can be expressed as . We also note that the GFSF estimate of coincides with the method of li2017gradient , which is derived by using Stein’s identity (a more general form of the weak derivative) and choosing one particular test function. In the work, gradient in the data space is estimated to train implicit generative models.

For both GFSD and GFSF, the only information needed on is the gradient , which is tractable for Bayesian inference tasks. Unlike SVGD, the gradient term in both our methods is not weighted-averaged among all particles, thus each particle moves more efficiently towards the high-probability region around it. Moreover, due to the relation to the Langevin dynamics (LD) (Remark 4), both methods can adopt stochastic gradient for large scale inference tasks, in the same way that LD does welling2011bayesian .

3.2.3 Bandwidth Selection via Heat Equation

Since kernel is involved to smooth either density or test functions, the bandwidth of the kernel will affect the estimation of the vector field. SVGD chooses the bandwidth with a median method, which is based on a numerical consideration. Here we propose a novel method for selecting the bandwidth by exploring the dynamics of the gradient flow.

As noted in Remark 4, the deterministic dynamics of GF and the Langevin dynamics realizes the same rule of density evolution. Particularly, both the deterministic dynamics and the Brownian motion produce the evolution rule of the heat equation (HE) . So we would like to select the bandwidth of the smoothing kernel by enforcing the vector field to recover the effect of HE. Specifically, we explicitly express the dependence of on the samples as when needed, which is an approximation on the current density . After an infinitesimal time , should be approximated by both due to the HE, and by due to the deterministic dynamics. As a result, we require the equality to hold. So in practice, a reasonable bandwidth can be selected by minimizing

(3)

We call it the HE method. Due to the equivalence of smoothing density and smoothing test functions analyzed in Section 3.1, the HE method can also be applied to methods based on smoothing test functions, e.g. SVGD and GFSF. Implementation details are provided in Appendix A5.

4 Accelerated First-order Methods on

The gradient flow methods described in the above section can be regarded as the gradient descent method on Riemannian manifold. Beyond that, there exist accelerated first-order methods on Riemannian manifold, e.g. Riemannian Nesterov’s Accelerated Gradient (RNAG) liu2017accelerated , that enjoy a faster convergence rate. To apply the RNAG method to the Riemannian manifold , two more ingredients are required: exponential map (and its inverse) and parallel transport. Intuitively, the exponential map acts as the addition (displacement) of the point with a vector to get to another point, and the parallel transport relates tangent vectors at different points and should be invoked before using a tangent vector at a different point.

Exponential map on    As indicated by Corollary 7.22 and Theorem 10.38 of villani2008optimal , for an AC measure and , (notation defined in Lemma 2). For its inverse, consider two AC measures that are close. Let be a set of samples of and of , and assume that they are pairwisely close: . Then the optimal transport map from to satisfies . According to Theorem 7.2.2 of ambrosio2008gradient , we have where is the geodesic (in the Riemannian manifold sense) from to , and Proposition 8.4.6 further asserts that (vector field form), while by definition . So we have . Note that only the knowledge on the samples is sufficient for our use to update the samples. Also note that when we apply in the following, the two measures are close since one is derived by an infinitesimal displacement of another, so is each pair of their samples and (see Appendix A6.1, A6.2).

Parallel Transport on    We first note that there have been formal researches on the parallel transport on  lott2008some ; lott2017intrinsic , but the result requires differentiating the vector field thus hard to implement.

Figure 1: Illustration of the Schild’s ladder method. Figure inspired by kheyfets2000schild .

We propose to use the Schild’s ladder method ehlers1972geometry ; kheyfets2000schild , which provides a tractable first-order approximation of the parallel transport . As shown in Fig. 1, given , and , the procedure to approximate is (i) find the point ; (ii) find the midpoint of the geodesic from to : ; (iii) extrapolate the geodesic from to by doubling the length to find ; (iv) the approximator is taken as . Note that the Schild’s ladder method only requires the exponential map and its inverse.

Following the procedure, we find that for AC measures with pairwisely close samples , the approximation to the parallel transport satisfies (See Appendix A6.1). By applying the RNAG algorithm liu2017accelerated on (details in Appendix A6.2), we have the Wasserstein Nesterov’s Accelerated Gradient method (WNAG), as presented in Alg. 1. To highlight the difference, we call the vanilla implementation of PSVIs as Wasserstein Gradient Descent method (WGD). We emphasize that one cannot directly apply the vanilla Nesterov’s acceleration method nesterov1983method on the update rule of every single particle, since each particle is not optimizing any function. Although it needs more investigation to see whether the algorithm suits for SVGD, we can apply it algorithmically, and it also makes a salient improvement in experiments.

The tools developed here also make it possible to apply other optimization techniques on Riemannian manifold to benefit PSVIs, e.g. Riemannian BFGS gabay1982minimizing ; qi2010riemannian ; yuan2016riemannian

and Riemannian stochastic variance reduction gradient 

zhang2016riemannian . We leave these extensions as future work.

1:  Select acceleration factor ; Initialize with distinct values; let ;
2:  for  do
3:     Determine bandwidth using either the median or the HE method for ;
4:     for do
5:        Find the value of the vector field at by SVGD/GFSD/GFSF;
6:        ;
7:        .
8:     end for
9:  end for
Algorithm 1 Wasserstein Nesterov’s Acceleration Gradient method (WNAG)

5 Experiments44footnotemark: 4

55footnotetext: Codes and data available at http://ml.cs.tsinghua.edu.cn/~changliu/awgf/

5.1 Toy Experiments

We first investigate the validity of GFSD and GFSF as well as the benefit of the HE method by showing 200 samples that they produce for a toy bimodal distribution (following rezende2015variational ), and compare with SVGD, and the particle optimization method (PO) chen2017particle as an independent PSVI method. As shown in Fig. 2, when using the median method for bandwidth, samples of GFSD and GFSF tend to collapse, since the median method cannot make

act the same as the Brownian motion, which is responsible for diversity. With the HE method for bandwidth, GFSD and GFSF produce well-aligned samples with a reasonable diversity. Amazingly, we can see that the samples are uniformly distributed along the contour of the target density, indicating that they form a better set of samples to represent the density. We also note that the PO method does not improve the ultimate sample distribution of SVGD, while our HE method can also make samples of SVGD and PO align more neatly.

Figure 2: The validity of GFSD/GFSF and the effect of the HE method. Samples are plotted as grey dots and the common target distribution as blue shade. The columns correspond to SVGD, PO, GFSD, GFSF methods respectively, and the rows correspond to the median and HE method for kernel bandwidth. All methods are run for 400 iterations with the same initialization.

5.2 Bayesian Logistic Regression

We conduct the standard Bayesian logistic regression experiment on the Covertype dataset, following the same settings as 

liu2016stein , except we average the results over 10 random trials. The SVGD-WGD method uses AdaGrad with momentum for adjusted step size so that it is identical to the vanilla SVGD method. Other methods use a shrinking step size scheme in the face of stochastic gradient. We select parameters for all methods for the fastest convergence with a stable result. Although the PO method does not strictly count for an acceleration method for PSVIs, we treat it as an empirical acceleration implementation here.

We first find that both the median method and the HE method for bandwidth selection achieve similar results in all cases, and we skip showing the comparison. We then show the improvement of iteration convergence rate by the WNAG acceleration over WGD and the comparison with the PO acceleration in Fig. 3. It is shown that WNAG gains a salient speed-up over WGD for all of SVGD, GFSD and GFSF methods, and it outperforms the PO method which does not make a significant improvement over WGD, as also shown in chen2017particle . We also note that GFSD and GFSF methods achieve a comparable result as SVGD, indicating their validity.

(a) SVGD
(b) GFSD
(c) GFSF
Figure 3: Acceleration effect of WNAG over WGD with comparison to PO on the Bayesian logistic regression on the Covertype dataset. Each curve is averaged over 10 runs with random 80%train-20%test split of the dataset and the mini-batch size is taken as 50, following liu2016stein .

5.3 Bayesian Neural Network

We test all the methods on the Bayesian neural network, following the settings described in 

liu2016stein , except we run all methods for the same amount of iterations. For the SVGD-WGD method, we take either the reported result in liu2016stein or the result we observe, whichever is better. Table 1 presents the results on one of the datasets, Kin8nm, that Liu et al. liu2016stein use, and Appendix A7 shows more results on other datasets. We observe that the WNAG acceleration method saliently outperforms the WGD and PO methods for all of SVGD, GFSD and GFSF on all datasets. The PO method, as an empirical acceleration method, also improves the performance, but not to the extent of the WNAG method. The GFSD and GFSF methods also achieve better performance than SVGD in most cases.

Method Avg. Test RMSE Avg. Test LL
SVGD GFSD (Ours) GFSF (Ours) SVGD GFSD (Ours) GFSF (Ours)
WGD 0.0840.002 0.0800.003 0.0830.002 1.0420.016 1.0870.029 1.0440.016
PO 0.0780.002 0.0810.002 0.0800.002 1.1140.022 1.0670.017 1.0730.016
WNAG (Ours) 0.0700.002 0.0710.001 0.0700.001 1.1670.015 1.1670.017 1.1900.014
Table 1: Results on Bayesian neural network on the Kin8nm dataset. Results are averaged over 20 runs with random 90%train-10%test split and the mini-batch size is taken as 100, following liu2016stein .

6 Conclusion

We consider doing Bayesian inference by minimizing the KLD on the 2-Wasserstein space , whose Riemannian structure enables us to do develop various methods. We analyze the flexibility of such particle-simulation-based variational inference methods (PSVIs) of up to a smoothed density or a smoothed test function family, and develop two PSVIs by simulating the gradient flow on by smoothing density and test functions, respectively, with a principled kernel selection method HE. We also propose an acceleration method WNAG for PSVIs by exploring the Riemannian structure of . Experiments show the validity of GFSD/GFSF with the HE method, and the improved iteration convergence rate of the WNAG method for all PSVIs.

References

  • [1] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  • [2] Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84(3):375–393, 2000.
  • [3] Changyou Chen and Ruiyi Zhang. Particle optimization in stochastic gradient mcmc. arXiv preprint arXiv:1711.10927, 2017.
  • [4] Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven. Bayesian sampling using stochastic gradient thermostats. In Advances in neural information processing systems, pages 3203–3211, 2014.
  • [5] Manfredo Perdigao Do Carmo. Riemannian Geometry. 1992.
  • [6] J Ehlers, F Pirani, and A Schild. The geometry of free fall and light propagation, in the book “general relativity”(papers in honour of jl synge), 63–84, 1972.
  • [7] Matthias Erbar et al. The heat equation on manifolds as a gradient flow in the wasserstein space. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 46, pages 1–23. Institut Henri Poincaré, 2010.
  • [8] Daniel Gabay. Minimizing a differentiable function over a differential manifold. Journal of Optimization Theory and Applications, 37(2):177–219, 1982.
  • [9] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. In

    Readings in Computer Vision

    , pages 564–584. Elsevier, 1987.
  • [10] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.
  • [11] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • [12] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
  • [13] Arkady Kheyfets, Warner A Miller, and Gregory A Newton. Schild’s ladder parallel transport procedure for an arbitrary connection. International Journal of theoretical Physics, 39(12):2891–2898, 2000.
  • [14] Ondrej Kováčik and Jiří Rákosník. On spaces lp̂(x) and wk̂, p(x). Czechoslovak Mathematical Journal, 41(4):592–618, 1991.
  • [15] Yingzhen Li and Richard E Turner. Gradient estimators for implicit models. arXiv preprint arXiv:1705.07107, 2017.
  • [16] Chang Liu and Jun Zhu. Riemannian stein variational gradient descent for bayesian inference. arXiv preprint arXiv:1711.11216, 2017.
  • [17] Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3118–3126, 2017.
  • [18] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pages 2378–2386, 2016.
  • [19] Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. arXiv preprint arXiv:1704.02399, 2017.
  • [20] Yuanyuan Liu, Fanhua Shang, James Cheng, Hong Cheng, and Licheng Jiao. Accelerated first-order methods for geodesically convex optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pages 4875–4884, 2017.
  • [21] John Lott. Some geometric calculations on wasserstein space. Communications in Mathematical Physics, 277(2):423–437, 2008.
  • [22] John Lott. An intrinsic parallel transport in wasserstein space. Proceedings of the American Mathematical Society, 145(12):5329–5340, 2017.
  • [23] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11), 2011.
  • [24] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
  • [25] Felix Otto. The geometry of dissipative evolution equations: the porous medium equation. 2001.
  • [26] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  • [27] Yunchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, and Lawrence Carin. Vae learning via stein variational gradient descent. In Advances in Neural Information Processing Systems, pages 4239–4248, 2017.
  • [28] Chunhong Qi, Kyle A Gallivan, and P-A Absil. Riemannian bfgs algorithm with applications. In Recent advances in optimization and its applications in engineering, pages 183–192. Springer, 2010.
  • [29] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
  • [30] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.
  • [31] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.

    On the importance of initialization and momentum in deep learning.

    In International conference on machine learning, pages 1139–1147, 2013.
  • [32] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • [33] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
  • [34] Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.
  • [35] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011.
  • [36] Xinru Yuan, Wen Huang, P-A Absil, and Kyle A Gallivan.

    A riemannian limited-memory bfgs algorithm for computing the matrix geometric mean.

    Procedia Computer Science, 80:2147–2157, 2016.
  • [37] Hongyi Zhang, Sashank J Reddi, and Suvrit Sra. Riemannian svrg: fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pages 4592–4600, 2016.
  • [38] Ding-Xuan Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of computational and Applied Mathematics, 220(1):456–463, 2008.
  • [39] Jingwei Zhuo, Chang Liu, Jiaxin Shi, Jun Zhu, Ning Chen, and Bo Zhang. Message passing stein variational gradient descent. arXiv preprint arXiv:1711.04425, 2017.

Appendix

A1: Proof of Proposition 5

Proof.

Note that (e.g. [14], Thm 2.11), and the map is continuous. So we have . On the other hand, due to Proposition 4.46 and Theorem 4.47 of [30], the map is an isometric isomorphism between and , the reproducing kernel Hilbert space of . This completes the proof. ∎

A2: Proof of Proposition 6

Proof.

To show that the problem below has no solution,

(4)

we will find a sequence of functions satisfying conditions in (4) while the objective goes to infinity.

We assume that there exists such that for any , , which is reasonable because it is almost impossible to sample with vanishes in every neighborhood of .

Denoting for any -dimensional vector function and for any real-valued function , the objective can be written as

(5)

For every , we can define a function correspondingly, such that , which means and

Rewrite (5) in term of ,

(6)

where and . We will now construct a sequence to show the problem

(7)

has no solution, then induce a sequence by for problem (4).

Define a sequence of functions

We have with , where is the Gamma function. Note that when ,

(8)
()
()

therefore,