 # Riemannian Stein Variational Gradient Descent for Bayesian Inference

We develop Riemannian Stein Variational Gradient Descent (RSVGD), a Bayesian inference method that generalizes Stein Variational Gradient Descent (SVGD) to Riemann manifold. The benefits are two-folds: (i) for inference tasks in Euclidean spaces, RSVGD has the advantage over SVGD of utilizing information geometry, and (ii) for inference tasks on Riemann manifolds, RSVGD brings the unique advantages of SVGD to the Riemannian world. To appropriately transfer to Riemann manifolds, we conceive novel and non-trivial techniques for RSVGD, which are required by the intrinsically different characteristics of general Riemann manifolds from Euclidean spaces. We also discover Riemannian Stein's Identity and Riemannian Kernelized Stein Discrepancy. Experimental results show the advantages over SVGD of exploring distribution geometry and the advantages of particle-efficiency, iteration-effectiveness and approximation flexibility over other inference methods on Riemann manifolds.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

Bayesian inference is the central task for learning a Bayesian model to extract knowledge from data. The task is to estimate the posterior distribution of latent variables of the model given observed data. It has been in the focus of machine learning for decades, with quite a lot of methods emerging. Variational inference methods (VIs) aim to approximate the posterior by a tractable variational distribution. Traditional VIs typically use a statistical model, usually a parametric distribution family, as the variational distribution, and we call them model-based VIs (M-VIs). They cast the inference problem as an optimization problem, which can be efficiently solved by various techniques. However, due to the restricted coverage of the chosen distribution family (e.g. the mean-field form), there would be a gap blocking the approximation from getting any closer. Monte Carlo methods (MCs) estimate the posterior by directly drawing samples from it. Asymptotically accurate as they are, their performance for finite samples is not guaranteed and usually take effect slowly, especially for the widely applicable thus commonly used Markov Chain MCs (MCMCs), due to the positive autocorrelation of their samples.

Recently, a set of particle-based VIs (P-VIs) have been proposed. P-VIs use a certain number of samples, or particles, to represent the variational distribution, and update the particles by solving an optimization problem. Similar to MCs, this non-parametric particle form gives them great flexibility to reduce the gap of M-VIs, and beyond MCs, the optimization-based update rule makes them effective in iteration: every iteration is guaranteed to make progress. Although there are convergence analyses for some particular MCMCs from non-stationary to stationary, the principle of a general MCMC only guarantees that the sampler will remain in the stationary distribution. Moreover, MCs usually require a large sample size to take effect, while P-VIs achieve similar performance with much fewer particles, since their principle aims at finite sample performance. This particle-efficiency would save the storage of the inference result, and reduce the time for tasks afterwards such as test and prediction. Table 1 presents a comparison of the three kinds of inference methods.

Stein variational gradient descent (SVGD) [Liu and Wang2016]

is an outstanding example of P-VIs. SVGD updates particles by imposing a continuous-time dynamics on them that leads the variational distribution to evolve towards the posterior. Although the dynamics is restricted in a kernel-related space for tractability, the theory of SVGD makes no assumption on the variational distribution, indicating the best flexibility. SVGD has been applied to develop advanced inference methods (wang2016learning wang2016learning; pu2017stein pu2017stein) as well as reinforcement learning (liu2017stein liu2017stein; haarnoja2017reinforcement haarnoja2017reinforcement). Other instances of P-VIs include normalizing flows (NF)

[Rezende and Mohamed2015] and particle mirror descent (PMD) [Dai et al.2016]

. NF uses a series of invertible transformations to adjust particles from a simple tractable distribution to fit the posterior. However, the invertibility requirement restricts its flexibility. PMD adopts the mirror descent method to formulate an optimization problem and uses a weighted kernel density estimator for the variational distribution, which is still a restricting assumption.

Another issue of Bayesian inference is the collaboration with Riemann manifold. This consideration bears its importance in two ways: (i) the posterior of some models itself is a distribution on a given Riemann manifold, e.g. the spherical admixture model (SAM) [Reisinger et al.2010]

has its posterior on hyperspheres; (ii) latent variables in a Bayesian model is a natural coordinate system (c.s.) of the Riemann manifold of likelihood distribution, so we can also conduct Bayesian inference on the distribution manifold with improved efficiency with the help of information geometry (amari2007methods amari2007methods; amari2016information amari2016information). Much progress has been made recently for both Riemannian considerations. In the spirit of (i), bonnabel2013stochastic bonnabel2013stochastic and zhang2016riemannian zhang2016riemannian develop scalable and stable optimization methods on Riemann manifold to enhance inference for M-VIs, and brubaker2012family brubaker2012family, byrne2013geodesic byrne2013geodesic and liu2016stochastic liu2016stochastic develop efficient and scalable MCMCs on Riemann manifold. In the spirit of (ii), hoffman2013stochastic hoffman2013stochastic use natural gradient for M-VI, girolami2011riemann girolami2011riemann and ma2015complete ma2015complete develop efficient MCMCs with Riemann structure, and li2016preconditioned li2016preconditioned apply the Riemannian MCMCs to the inference of Bayesian neural network. Little work has been done to enhance P-VIs with Riemann structure. gemici2016normalizing gemici2016normalizing attempt to generalize NF to Riemann manifold, but their method cannot be used for manifolds with no global c.s., such as hypersphere. Although the method can be implemented in the almost-global c.s., unbounded distortion near the boundary of the c.s. would unavoidably occur, which would cause numerical instability.

In this work, we develop Riemannian Stein Variational Gradient Descent (RSVGD), the extension of SVGD to Riemann manifold. Our method can be applied for both approximating the posterior on Riemann manifold (i), and efficient inference by exploring the Riemann structure of distributions (ii). RSVGD inherits the significant advantages of SVGD, bringing the benefits to the field of inference on Riemann manifold, such as particle-efficiency and zero variational assumption. Technically, it is highly non-trivial to extent the idea to Riemann manifold, as many subtle properties of Riemann manifold must be carefully considered, which may lead to completely different treatment. We first review SVGD as an evolution under a dynamics of a flow and generalize the deduction to Riemann manifold. Then we solve for the optimal dynamics by a novel method, where the treatment of SVGD fails due to the intrinsically different properties of general Riemann manifold from Euclidean space. The expression in the embedded space is also derived for application to manifolds with no global c.s. like hyperspheres, which does not require choosing a c.s. and introduce no numerical problems. As side products, we also develop Riemannian Stein’s identity and Riemannian kernelized Stein discrepancy, as an extension of the corresponding concepts. Finally, we apply RSVGD to the troublesome inference task of SAM, with its unique advantages validated in experiments.

## Preliminaries

### Riemann Manifolds

We briefly introduce basic concepts of Riemann manifold. For more details please refer to common textbooks e.g. do1992riemannian do1992riemannian; abraham2012manifolds abraham2012manifolds.

#### Basics

Denote an -dimensional Riemann manifold. By definition at every point there exists a local coordinate system (c.s.) , where is open and contains , and a homeomorphism between and . Denote and as the set of functions that are smooth around and all over

, respectively. A tangent vector

at is a linear functional that satisfies . Intuitively is the directional derivative of at along the direction of . All such forms an -dimensional linear space , called the tangent space at . Its natural basis under is defined as (also denoted as ). We can then express in component: , where we adopt Einstein’s convention that duplicated subscript and superscript are summed out. A vector field on specifies at every a tangent vector smoothly with respect to (w.r.t.) . Denote as the set of all such . A Riemann structure is equipped to if , is endowed with an inner product (and is smooth in ). In , for , , where .

An object called Riemann volume form can be used to define a measure on via integral: the measure of a compact region is defined as

. Thus for any probability distribution absolutely continuous w.r.t.

, we can define its probability density function (p.d.f.)

w.r.t. : . In the sequel, we require distributions such that their p.d.f. are smooth functions on , and we would say “distribution with p.d.f. ” as “distribution ”.

#### Flows, Dynamics and Evolving Distributions

These concepts constitute the fundamental idea of SVGD and our RSVGD. The notion of flow arises from the following fact: for a vector field and a fixed point , there exist a subset containing , and a one-parameter transformation where , such that is the identity map on , and for and , (do1992riemannian, do1992riemannian, Page 28). We call the local flow of around . Under some condition, e.g. has compact support, there exists a local flow that is global (i.e. ), which is called the flow of .

We refer a dynamics here as a rule governing the motion on over time . Specifically, a dynamics gives the position at any future time for any given initial position . If and is smooth for all , we call it a continuous-time dynamics. Obviously, a flow can be used to define a continuous-time dynamics: . Due to the correspondence between a vector field and a flow, we can also define a dynamics by a vector field , and denote it as . Thus we would say “the dynamics defined by the flow of ” as “dynamics ” in the following.

Let a random variable (r.v.) obeying some distribution

move under dynamics from time to , which acts as a transformation on the r.v. Then the distribution of the transformed r.v. also evolves along time , and we denote it as , and call it an evolving distribution under dynamics . Suppose there is a set of particles that distributes as . Let each particle move individually under dynamics for time , then the new set of particles distributes as .

### Reynolds Transport Theorem

Reynolds transport theorem helps us to relate an evolving distribution to the corresponding dynamics. It is a generalization of the rule of differentiation under integral, and is the foundation of fluid mechanics. Let and be its flow. For smooth and any open subset , the theorem states that

 ddt∫Ft(U)ftμg=∫Ft(U)(∂ft∂t+div(ftX))μg,

where is the divergence of a vector field. In any local c.s., , where

, commonly called Riemann metric tensor, is the

matrix comprised of in that c.s., and is its determinant. More details can be found in e.g. (romano2007continuum, romano2007continuum, Page 164); (frankel2011geometry, frankel2011geometry, Page 142); (abraham2012manifolds, abraham2012manifolds, Page 469).

### Stein Variational Gradient Descent (SVGD)

We review SVGD [Liu and Wang2016] from a perspective inspiring our generalization work to Riemann manifold. SVGD is a particle-based variational inference method. It updates particles by applying an appropriate dynamics on them so that the distribution of the particles, an evolving distribution, approaches (in the KL-divergence sense) the target distribution. Note that in this case the manifold is the most common Euclidean space . Denote the evolving distribution of the particles under dynamics as , and let be the target distribution. The first key result of SVGD is

 −ddtKL(qt||p)=Eqt[X⊤∇logp+∇⊤X], (1)

which measures the rate of to approach . As our desiderata is to maximize , a desirable dynamics should maximize this approaching rate. This is analogous to the process of finding gradient when we want to maximize a function : , where is the directional derivative along direction . Similarly, we call the directional derivative along vector field , and [Liu and Wang2016] claims that the direction and magnitude of functional gradient of coincides with the maximizer and maximum of the directional derivative, and the dynamics of the functional gradient (itself a vector field) is the desired one. Simulating this dynamics and repeating the procedure updates the particles to be more and more representative for the target distribution.

The above result reveals the relation between the directional derivative and dynamics . To find the functional gradient, we get the next task of solving the maximization problem. Note that in Euclidean case, the tangent space at any point is isometrically isomorphic to , so can be described as smooth functions on . liu2016stein liu2016stein take from the product space of the reproducing kernel Hilbert space (RKHS) of some smooth kernel on . is a Hilbert space of some smooth functions on .Its key property is that for any , , and . In the maximization problem can be solved in closed form, which gives the functional gradient:

 X∗(⋅)=Eqt(A)[K(A,⋅)∇logp(A)+∇K(A,⋅)]. (2)

Note that the sample distribution appears only in the form of expectation, which can be estimated by merely samples from . This releases the assumption on the explicit form of thus gives it great flexibility. It is a benefit of using KL-divergence to measure the difference between two distributions.

## Riemannian SVGD

We now present Riemannian Stein Variational Gradient Descent (RSVGD), which has many non-trivial considerations beyond SVGD and requires novel treatments. We first derive the Riemannian counterpart of the directional derivative, then conceive a novel technique to find the functional gradient, in which case the SVGD technique fails. Finally we express RSVGD in the embedded space of the manifold and give an instance for hyperspheres, which is directly used for our application.

### Derivation of the Directional Derivative

Now is a general Riemann manifold. We first derive a useful tool for deriving the directional derivative.

###### Lemma 1 (Continuity Equation on Riemann Manifold).

Let be the evolving distribution under dynamics , where . Then

 ∂pt∂t=−div(ptX)=−X[pt]−ptdiv(X). (3)

See Appendix111Appendix available at http://ml.cs.tsinghua.edu.cn/~changliu/rsvgd/Liu-Zhu-appendix.pdf A1 for derivation, which involves Reynolds transport theorem.

We now present our first key result, the directional derivative of 222 Appendix A2 presents the well-definedness of the KL-divergence on Riemann manifold . w.r.t. vector field .

###### Theorem 2 (Directional Derivative).

Let be the evolving distribution under dynamics , and a fixed distribution. Then the directional derivative is

 −ddtKL(qt||p)=Eqt[div(pX)/p]=Eqt[X[logp]+div(X)].

See Appendix A3 for proof. Theorem 2 is an extension of the key result of SVGD Eqn. (1) to general Riemann manifold case. In SVGD terms, we call the generalized Stein’s operator.

Appendix A4 further discusses in detail the corresponding condition for Stein’s identity to hold in this case. Stein’s identity refers to the equality for any such that . Stein class is the set of that makes Stein’s identity hold. Stein’s identity indicates that when reaches its optimal configuration , the directional derivative along any direction in the Stein class is zero, in analogy to the well-known zero gradient condition in optimization. These concepts play an important rule when using the functional gradient to measure the distance between two distributions, i.e. the Stein discrepancy [Liu, Lee, and Jordan2016].

### Derivation of the Functional Gradient

Now that we have the directional derivative expressed in terms of , we get the maximization problem to find the functional gradient:

 maxX∈X,∥X∥X=1J(X):=Eq[X[logp]+div(X)], (4)

where we omit the subscript of since it is fixed when optimizing w.r.t. . Ideally should be , but for a tractable solution we may restrict to some subset of which is at least a normed space. Once we get the maximizer of , denoted by , we have the functional gradient .

#### Requirements

Before we choose an appropriate , we first list three requirements on a reasonable (or equivalently on since they differ only in scale). The first two requirements arise from special properties of general Riemann manifolds, which are so different from Euclidean spaces that make SVGD technique fail.

• R1: is a valid vector field on ;

• R2: is coordinate invariant;

• R3: can be expressed in closed form, where appears only in terms of expectation.

We require R1 since all the above deductions are based on vector fields. In the Euclidean case, any set of smooth functions satisfies R1. But in general Riemann manifold, it is not enough that all the components of a vector field is smooth in any coordinate system (c.s.). For example, due to the hairy ball theorem (abraham2012manifolds, abraham2012manifolds, Theorem 8.5.13), vector fields on an even-dimensional hypersphere must have one zero-vector-valued point (critical point), which goes beyond the above condition. This disables the idea to use the SVGD technique in the coordinate space of a manifold.

R2 is required to avoid ambiguity or arbitrariness of the solution. Coordinate invariance is a key concept in the area of differential manifold. By definition the most basic way to access a general manifold is via c.s., so we can define an object on the manifold by its expression in c.s. If the form of the expression in any c.s. is the same, or equivalently the expression in a certain form refers to the same object in any c.s., we say that the form of the object is coordinate invariant. For intersecting c.s. and , if the expression in each c.s. with a same form gives different results, then the object is ambiguous on . One may argue that it is possible to use one certain c.s. to uniquely define the object, whatever the form in that c.s., but firstly for manifolds without global c.s., e.g. hyperspheres, we cannot define an object globally on the manifold in this manner, and secondly for manifolds that have global c.s., the choice of the certain c.s. may be arbitrary, since all the c.s. are equivalent to each other: there is no objective reason for choosing one specific c.s. other than other c.s.

Expression is coordinate invariant, while is not. Suppose the above two expressions are written in c.s. . Let be another c.s. with and the corresponding objects on it. On , , but while we expect , which do not match for general c.s. Consequently, the functional gradient of SVGD Eqn. (2) is not coordinate invariant. Again, the SVGD technique does not meet our demand here.

R3 cuts off the computational burden for optimizing w.r.t. , and releases the assumption on the form of so as to adopt the benefit of SVGD of flexibility.

Before presenting our solution, we would emphasize the difficulty by listing some possible attempts that actually fail. First note that is a subspace of , which is a linear space on both and . But in either case is infinite dimensional and it is intractable to express vector fields in it. Secondly, as the treatment of SVGD, one may consider expressing vector fields component-wise. This idea specifies a tangent vector at each point, thus would easily violate R1 and R2, due to the intrinsically different characteristic of general Riemann manifolds from Euclidean space: tangent spaces at different points are not the same. R1 and R2 focus on global properties, one has to link tangent spaces at different points. Taking a c.s. could uniformly express the tangent vectors on the c.s., but this is not enough as stated above. One may also consider transforming tangent vectors at different points to one certain pivot point by parallel transport, but the arbitrariness of the pivot point would dissatisfy R2. The third possibility is to view a vector field as a map from to , the tangent bundle of , but is generally only a manifold but not a linear space, so we cannot apply theories of learning vector-valued functions (e.g. micchelli2005learning micchelli2005learning).

#### Solution

We first present our solution, then check the above requirements. Note that theories of kernel and reproducing kernel Hilbert space (RKHS) (see e.g. steinwart2008support steinwart2008support, Chapter 4) also apply to manifold case. Let be a smooth kernel on and its RKHS. We require such that zero function is the only constant function in . Such kernels include the commonly used Gaussian kernel (steinwart2008support, steinwart2008support, Corollary 4.44).

We choose , where is the gradient of a smooth function, which is a valid vector field. In any c.s., , where is the -th entry of . The following result indicates that can be made into an inner product space.

###### Lemma 3.

With a proper inner product, is isometrically isomorphic to , thus a Hilbert space.

###### Proof.

Define , which is linear: . For any that satisfy , we have , , where the last equality holds for our requirement on . So thus is injective. By definition of , is surjective, so it is an isomorphism between and .

Define , . By the linearity of and that is an inner product, one can easily verify that is an inner product on , and that is an isometric isomorphism. ∎

Next we present our second key result, that the objective can be cast as an inner product in .

###### Theorem 4.

For defined above and the objective in Eqn. (4), we have , where

and is the Beltrami-Laplace operator.

###### Proof (sketch).

Use to express the vector field . Expand in some c.s., then cast partial derivatives into the form of inner product in based on the results of zhou2008derivative zhou2008derivative. Rearranging terms by the linearity of inner product and expectation, can be written as an inner product in , thus an inner product in due to their isomorphism. See Appendix A5 for details. ∎

In the form of an inner product, the maximization problem (4) solves as: . Then the functional gradient is .

###### Corollary 5.

Now we check our solution for the three requirements. Since , , , expectation, and the action of tangent vector on smooth function are all coordinate invariant objects, is coordinate invariant thus a valid smooth function on . So its gradient is a valid vector field (R1) and also coordinate invariant (R2). R3 is obvious from Eqn. (5).

As a bonus, we present the optimal value of the objective:

where , and notations with prime “” take as argument and others take . We call it Riemannian Kernelized Stein Discrepancy (RKSD) between distributions and , a generalization of Kernelized Stein Discrepancy (liu2016kernelized liu2016kernelized; chwialkowski2016kernel chwialkowski2016kernel).

After deriving the functional gradient , we can simulate its dynamics in any c.s. (denoted as ) by for each component , which is a 1st-order approximation of the flow of , where

 +gab∂a∂bK]. (6)

This is the update rule of particles for inference tasks on Euclidean space, where the expectation is estimated by averaging over current particles, and the Riemann metric tensor is taken as the subtraction of the Fisher information of likelihood with the Hessian of prior p.d.f., as adopted by girolami2011riemann girolami2011riemann. Note that is the entry of the inverse matrix.

### Expression in the Embedded Space

From the above discussion, we can simulate the optimal dynamics in c.s. But it is not always the most convenient approach. For some manifolds, like hyperspheres or Stiefel manifold [James1976], on one hand, they have no global c.s. so we have to change c.s. constantly while simulating, and we have to compute , and in each c.s. It is even hard to find a c.s. of a Stiefel manifold. On the other hand, such manifolds are defined as a subset of some Euclidean space, which is a natural embedded space. This motivates us to express the dynamics of the functional gradient in the embedded space and simulate in it.

Formally, an embedding of a manifold is a smooth injection for some . For a Riemann manifold, is isometric if in any c.s. , where is for with . The Hausdorff measure on induces a measure on , w.r.t. which we have p.d.f. . We recognize that for isometric embedding, .

Since is injective, can be well-defined, so can for any c.s. The following proposition gives a general expression of the functional gradient in an isometrically embedded space.

###### Proposition 6.

Let all the symbols take argument in the isometrically embedded space (with orthonormal basis ) via composed with or . We have ,

 ^f′=Eq[ (∇log(p√|G|))⊤(In−NN⊤)(∇K) +∇⊤∇K−tr(N⊤(∇∇⊤K)N) +((M⊤∇)⊤(G−1M⊤))(∇K)], (7)

where

is the identity matrix,

, , is the set of orthonormal basis of the orthogonal complement of (the -pushed-forward tangent space, an -dimensional linear subspace of ), and is the trace of a matrix.

See Appendix A6.1 for derivation. Note that does not depend on the choice of c.s. of while and do, but the final result does not. Simulating the dynamics is quite different from the coordinate space case due to the constraint of . A 1st-order approximation of the flow is

 y(t+ε)=Expy(t)(ε^X(y(t))),

where is the exponential map at , which maps to the end point of moving along the geodesic determined by for unit time (parameter of the geodesic). In , is moved along the straight line in the direction of for length , and in , is moved along the great circle (orthodrome) tangent to at for length .

#### Instantiation for Hyperspheres

We demonstrate the above result with the instance of .

###### Proposition 7.

For isometrically embedded in with orthonormal basis , we have ,

 ^f′=Eq[ (∇logp)⊤(∇K)+∇⊤∇K−y⊤(∇∇⊤K)y −(y⊤∇logp+n−1)y⊤∇K]. (8)

Note that the form of the expression does not depend on any c.s. of . The exponential map on is given by

 Expy(v)=ycos(∥v∥)+(v/∥v∥)sin(∥v∥).

See Appendix A6.2 for more details. Appendix A6.3 further provides the expression for the product manifold of hyperspheres , which is the specific manifold on which the inference task of Spherical Admixture Model is defined.

## Experiments

We find the naïve mini-batch implementation of RSVGD does not perform well in experiments, which may require further investigation on the impact of gradient noise on the dynamics. So we only focus on the full-batch performance of RSVGD in experiments333 Codes and data available at http://ml.cs.tsinghua.edu.cn/~changliu/rsvgd/ .

### Bayesian Logistic Regression

We test the empirical performance of RSVGD on Euclidean space (Eqn. 6

) by the inference task of Bayesian logistic regression (BLR). BLR generates latent variable from prior

, and for each datum

, draws its label from Bernoulli distribution

, where . The inference task is to estimate the posterior . Note that for this model, as is shown in Appendix A7, all the quantities involving Riemann metric tensor has a closed form, except , which can be solved by either direct numerical inversion, or an iterative method over the dataset.

Kernel As stated before, in the Euclidean case we essentially conduct RSVGD on the distribution manifold. To specify a kernel on it, we note that the Euclidean space of the latent variable is a global c.s. of the manifold. By the mapping , any kernel on is a kernel on the manifold (steinwart2008support steinwart2008support, Lemma 4.3). In this sense we choose the Gaussian kernel on . Furthermore, we implement the kernel for both SVGD and RSVGD as the summation of several Gaussian kernels with different bandwidths, which is still a valid kernel (steinwart2008support steinwart2008support, Lemma 4.5). We find it slightly better than the median trick of SVGD in our experiments.

Setups We compare RSVGD with SVGD (full-batch) for test accuracy along iteration. We fix and use 100 particles for both methods. We use the Splice19 dataset (1,000 training entries, 60 features), one of the benchmark datasets compiled by mika1999fisher mika1999fisher, and the Covertype dataset (581,012 entries, 54 features) also used by liu2016stein liu2016stein. RSVGD updates particles by the aforementioned 1st-order flow approximation, which is effectively the vanilla gradient descent, while SVGD uses the recommended AdaGrad with momentum. We also tried gradient descent for SVGD, with no better outcomes.

Results We see from Fig. 1 that RSVGD makes more effective updates than SVGD on both datasets, indicating the benefit of RSVGD to utilize the distribution geometry for Bayesian inference. Although AdaGrad with momentum, which SVGD uses, counts for a method for estimating the geometry empirically [Duchi, Hazan, and Singer2011], our method provides a more principled solution, with more precise results.

We investigate the advantages of RSVGD on Riemann manifold (Eqn. 7) by the inference task of Spherical Admixture Model (SAM) [Reisinger et al.2010], which is a topic model for data on hyperspheres, e.g. tf-idf feature of documents. The model first generates corpus mean and topics , then for document , generates its topic proportion and content , where is the von Mises-Fisher distribution [Mardia and Jupp2000] for random variable on hyperspheres, and is the Dirichlet distribution.

The inference task of SAM is to estimate the posterior of the topics . Note that the topics lies in the product manifold of hyperspheres .

Kernel Like the Gaussian kernel in the Euclidean case, we use the vMF kernel on hyperspheres. Note that the vMF kernel on is the restriction of the Gaussian kernel on : for , so it is a valid kernel on (steinwart2008support steinwart2008support, Lemma 4.3). We also recognized that the arcsine kernel is also a kernel on , due to the non-negative Taylor coefficients of the arcsine function (steinwart2008support steinwart2008support, Lemma 4.8). But unfortunately it does not work well in experiments.

For a kernel on , we set . Again we use the summed kernel trick for RSVGD.

Setups The manifold constraint isolates the task from most prevalent inference methods, including SVGD. A mean-field variational inference method (VI) is proposed by the original work of SAM. liu2016stochastic liu2016stochastic present more methods based on advanced Markov chain Monte Carlo methods (MCMCs) on manifold, including Geodesic Monte Carlo (GMC) [Byrne and Girolami2013], and their proposed Stochastic Gradient Geodesic Monte Carlo (SGGMC), a scalable mini-batch method. We implement the two MCMCs in both the standard sequential (-seq) way, and the parallel (-par) way: run multiple chains and collect the last sample on each chain. To apply RSVGD for inference, as with GMC and SGGMC cases, we adopt the framework of liu2016stochastic liu2016stochastic, which directly provides an estimate of the all-we-need information

. We run all the methods on the 20News-different dataset (1,666 training entries, 5,000 features) with default hyperparameters as the same as

[Liu, Zhu, and Song2016]

. We use epoch (the amount of visit to the dataset) instead of iteration since SGGMCb is run with mini-batch.

Results Fig. 2(a) shows that RSVGD makes the most effective progress along epoch, indicating its iteration-effectiveness. VI converges fast but to a less satisfying state due to its restrictive variational assumption, while RSVGD, by the advantage of high approximation flexibility, achieves a better result comparable to MCMC results. GMC methods make consistent but limited progress, where 200 epochs is too short for them. Although SGGMC methods perform outstandingly in [Liu, Zhu, and Song2016], they are embarrassed here by limited particle size. Both fed on full-batch gradient, SGGMCf has a more active dynamics than GMC thus can explore a broader region, but still not as efficient as RSVGD. The -seq and -par versions of an MCMC behave similarly, although -par seems better at early stage since it has more particles.

Fig. 2(b) presents the results over various numbers of particles, where we find the particle-efficiency of RSVGD. For fewer particles, SGGMCf methods have the chance to converge well in 200 epochs thus can be better than RSVGD. For more particles MCMCs make less salient progress since the positive autocorrelation limits the representativeness of limited particles.

## Conclusion

We develop Riemannian Stein Variational Gradient Descent (RSVGD), an extension of SVGD [Liu and Wang2016] to Riemann manifold. We generalize the idea of SVGD and derive the directional derivative on Riemann manifold. To solve for a valid and close-formed functional gradient, we first analyze the requirements by Riemann manifold and illustrate the failure of SVGD techniques in our case, then propose our solution and validate it. Experiments show the benefit of utilizing distribution geometry on inference tasks on Euclidean space, and the advantages of particle-efficiency, iteration-effectiveness and approximation flexibility on Riemann manifold.

Possible future directions include exploiting Riemannian Kernelized Stein Discrepancy which would be more appropriate with a properly chosen manifold. Mini-batch version of RSVGD for scalability is also an interesting direction, since the naïve implementation does not work well as mentioned. Applying RSVGD to a broader stage is also promising, including Euclidean space tasks like deep generative models and Bayesian neural networks (with Riemann metric tensor estimated in the way of [Li et al.2016]), and Riemann manifold tasks like Bayesian matrix completion [Song and Zhu2016] on Stiefel manifold.

#### Acknowledgements

This work is supported by the National NSF of China (Nos. 61620106010, 61621136008, 61332007) and Tiangong Institute for Intelligent Computing. We thank Jingwei Zhuo and Yong Ren for inspiring discussions.

## References

• [Abraham, Marsden, and Ratiu2012] Abraham, R.; Marsden, J. E.; and Ratiu, T. 2012. Manifolds, tensor analysis, and applications, volume 75. Springer Science & Business Media.
• [Amari and Nagaoka2007] Amari, S.-I., and Nagaoka, H. 2007. Methods of information geometry, volume 191. American Mathematical Soc.
• [Amari2016] Amari, S.-I. 2016. Information geometry and its applications. Springer.
• [Bonnabel2013] Bonnabel, S. 2013. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control 58(9):2217–2229.
• [Brubaker, Salzmann, and Urtasun2012] Brubaker, M. A.; Salzmann, M.; and Urtasun, R. 2012. A family of mcmc methods on implicitly defined manifolds. In

Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS)

, 161–172.
• [Byrne and Girolami2013] Byrne, S., and Girolami, M. 2013. Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics 40(4):825–845.
• [Chwialkowski, Strathmann, and Gretton2016] Chwialkowski, K.; Strathmann, H.; and Gretton, A. 2016. A kernel test of goodness of fit. In International Conference on Machine Learning, 2606–2615.
• [Dai et al.2016] Dai, B.; He, N.; Dai, H.; and Song, L. 2016. Provable bayesian inference via particle mirror descent. In Artificial Intelligence and Statistics, 985–994.
• [Do Carmo1992] Do Carmo, M. P. 1992. Riemannian Geometry.
• [Duchi, Hazan, and Singer2011] Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12(Jul):2121–2159.
• [Frankel2011] Frankel, T. 2011. The geometry of physics: an introduction. Cambridge University Press.
• [Gemici, Rezende, and Mohamed2016] Gemici, M. C.; Rezende, D.; and Mohamed, S. 2016. Normalizing flows on riemannian manifolds. arXiv preprint arXiv:1611.02304.
• [Girolami and Calderhead2011] Girolami, M., and Calderhead, B. 2011. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(2):123–214.
• [Haarnoja et al.2017] Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165.
• [Hoffman et al.2013] Hoffman, M. D.; Blei, D. M.; Wang, C.; and Paisley, J. 2013. Stochastic variational inference. The Journal of Machine Learning Research 14(1):1303–1347.
• [James1976] James, I. M. 1976. The topology of Stiefel manifolds, volume 24. Cambridge University Press.
• [Li et al.2016] Li, C.; Chen, C.; Carlson, D. E.; and Carin, L. 2016. Preconditioned stochastic gradient langevin dynamics for deep neural networks. In AAAI, volume 2,  4.
• [Liu and Wang2016] Liu, Q., and Wang, D. 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Processing Systems, 2370–2378.
• [Liu et al.2017] Liu, Y.; Ramachandran, P.; Liu, Q.; and Peng, J. 2017. Stein variational policy gradient. arXiv preprint arXiv:1704.02399.
• [Liu, Lee, and Jordan2016] Liu, Q.; Lee, J. D.; and Jordan, M. I. 2016. A kernelized stein discrepancy for goodness-of-fit tests. In Proceedings of the International Conference on Machine Learning (ICML).
• [Liu, Zhu, and Song2016] Liu, C.; Zhu, J.; and Song, Y. 2016. Stochastic gradient geodesic mcmc methods. In Advances In Neural Information Processing Systems, 3009–3017.
• [Ma, Chen, and Fox2015] Ma, Y.-A.; Chen, T.; and Fox, E. 2015. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, 2917–2925.
• [Mardia and Jupp2000] Mardia, K. V., and Jupp, P. E. 2000. Distributions on spheres. Directional Statistics 159–192.
• [Micchelli and Pontil2005] Micchelli, C. A., and Pontil, M. 2005. On learning vector-valued functions. Neural computation 17(1):177–204.
• [Mika et al.1999] Mika, S.; Ratsch, G.; Weston, J.; Scholkopf, B.; and Mullers, K.-R. 1999. Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop., 41–48. IEEE.
• [Pu et al.2017] Pu, Y.; Gan, Z.; Henao, R.; Li, C.; Han, S.; and Carin, L. 2017. Stein variational autoencoder. arXiv preprint arXiv:1704.05155.
• [Reisinger et al.2010] Reisinger, J.; Waters, A.; Silverthorn, B.; and Mooney, R. J. 2010. Spherical topic models. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), 903–910.
• [Rezende and Mohamed2015] Rezende, D., and Mohamed, S. 2015. Variational inference with normalizing flows. In Proceedings of The 32nd International Conference on Machine Learning, 1530–1538.
• [Romano2007] Romano, G. 2007. Continuum mechanics on manifolds. Lecture notes University of Naples Federico II, Naples, Italy 1–695.
• [Sherman and Morrison1950] Sherman, J., and Morrison, W. J. 1950. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Annals of Mathematical Statistics 21(1):124–127.
• [Song and Zhu2016] Song, Y., and Zhu, J. 2016. Bayesian matrix completion via adaptive relaxed spectral regularization. In The 30th AAAI Conference on Artificial Intelligence (AAAI-16).
• [Steinwart and Christmann2008] Steinwart, I., and Christmann, A. 2008. Springer Science & Business Media.
• [Wang and Liu2016] Wang, D., and Liu, Q. 2016. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722.
• [Zhang, Reddi, and Sra2016] Zhang, H.; Reddi, S. J.; and Sra, S. 2016. Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, 4592–4600.
• [Zhou2008] Zhou, D.-X. 2008. Derivative reproducing properties for kernel methods in learning theory. Journal of computational and Applied Mathematics 220(1):456–463.

## Appendix

### A1. Proof of Lemma 1 (Continuity Equation on Riemann Manifold)

Let be the flow of . compact, consider the integral . Since a particle in at time will always in at time and vice versa, the integral, i.e. the portion of particles in at time , is equal to the portion of particles in at time for any time . So it is a constant. Reynolds transport theorem gives

 0=ddt∫Ft(U)ptμg=∫Ft(U)(∂pt∂t+div(ptX))μg

for any and , so the integrand must be zero and we derived the conclusion.

### A2. Well-definedness of KL-divergence on Riemann Manifold

We define the KL-divergence between two distributions on by their p.d.f.  and w.r.t. volume form as:

 KL(q||p):=∫Mqμlog(qμ/pμ)μ.

To make this notion well-defined, we need to show that the right hand side of the definition is invariant of . Let be another volume form. Since , and lie on the same 1-dimensional linear space (the space of -forms at ), we have s.t. . Such a construction gives a smooth function . By the definition of p.d.f., . So , which indicates that the integral is independent of the chosen volume form.

### A3. Proof of Theorem 2

To formally prove Theorem 2, we first deduce a lemma, which gives the p.d.f. of the distribution transformed by a diffeomorphism on (an invertible smooth transformation on ).

###### Lemma 8 (Transformed p.d.f.).

Let be an orientation-preserving diffeomorphism on , and the p.d.f. of a distribution on . Denote as the p.d.f. of the distribution of the -transformed random variable from the one obeying , i.e. the transformed p.d.f. Then in any local coordinate system (c.s.) ,

 p[ϕ]=(p√|G|)∘ϕ−1√|G|∣∣Jacϕ−1∣∣, (9)

where is the Riemann metric tensor in and is its determinant, and is the Jacobian determinant of . The right hand side is coordinate invariant.

###### Proof.

Let be a compact subset of , and be a local c.s. of with coordinate chart . On one hand, due to the definition of , we have . On the other hand, we can invoke the theorem of global change of variables on manifold (abraham2012manifolds, abraham2012manifolds, Theorem 8.1.7), which gives

 ∫Upμg=∫ϕ(U)ϕ−1∗(pμg)=∫ϕ(U)(p∘ϕ−1)ϕ−1∗(μg) (10) = = ∫ϕ(U)(p√|G|)∘ϕ−1√|G||Jacϕ−1|μg (11)

, where is the pull-back of on the -forms on . Combining both hands and noting the arbitrariness of , we get the desired conclusion. ∎

Let be the flow of . For any evolving distribution under dynamics , by its definition, we have . Due to the property of flow that for any , , we have .

Now, for a fixed , we let be the evolving distribution under that satisfies , the target distribution. For sufficiently small , is a diffeomorphism on . Equipped with all these knowledge, we begin the final deduction:

 −ddt∣∣∣t=t0KL(qt||p)=−ddt∣∣∣t=0∫Mqt0+tlogqt0+tpt0μg (Treat qt0+t as (qt0)[Ft] and apply Eqn. (9)) = −ddt∣∣∣t=0∫M(qt0√|G|)∘F−1t√|G|∣∣JacF−1t∣∣ ⋅(log(qt0√|G|)∘F−1t√|G|+log∣∣JacF−1t∣∣−logpt0)μg (Apply F−1t to the entire integral and invoke the theorem of global change of variables Eqn. (10)) = −ddt∣∣∣t=0∫F−1t(M)([(qt0√|G|)∘F−1t√|G|∣∣JacF−1t∣∣ ⋅(log(qt0√|G|)∘F−1t√|G|+log∣∣JacF−1t∣∣−logpt0)]∘Ft)F∗t(μg) (F−1t(M)=M since F−1t is a diffeomorphism on M. |JacF−1t|∘Ft=|JacFt|−1. See Eqn. (11) for the expression of F∗t(μg), the pull-back of Ft on μg) = −ddt∣∣∣t=0∫Mqt0√|G|√|G|∘Ft|JacFt|−1⋅(logqt0√|G|√|G|∘Ft −log|JacFt|−log(pt0∘Ft))⋅√|G|∘Ft√|G||JacFt|μg (Rearange terms) = −ddt∣∣∣t=0∫Mqt0[logqt0−log((pt0√|G|)∘Ft√|G||JacFt|)]μg (Note the property of flow: Ft=F−1−t. Treat pt0−t as (pt0)[F−t] and apply Eqn. (9) inversely) = −ddt∣∣∣t=0∫Mqt0[logqt0−logpt0−t]μg (M is unchanged over time t (otherwise an integral over the boundary would appear)) = ∫Mqt0∂∂t(logpt0−t)∣∣∣t=0μg=−∫Mqt0∂∂t(logpt0+t)∣∣∣t=0μg (Refer to Eqn. (3)) = ∫M(qt0/pt0)div(pt0X)μg=Eqt0[div(pt0X)/pt