1 Introduction
A flexible approach to density estimation is to parameterize an unnormalized density function, or energy function. In particular, unnormalized models with energy parameterized by deep neural networks have been successfully applied to density estimation
(Wenliang et al., 2019; Saremi et al., 2018) and learning implicit autoencoding models (Song et al., 2019).Parameter estimation for such unnormalized models is highly nontrivial: the maximum likelihood objective is intractable, due to the presence of a normalization term. Score matching (Hyvärinen, 2005) is a popular alternative, yet applying score matching to complex unnormalized models can be difficult, as the objective involves the secondorder derivative of the energy, rendering gradientbased optimization infeasible. In practice, people turn to scalable approximations of the score matching objective (Song et al., 2019; Hyvarinen, 2007; Vincent, 2011; Raphan and Simoncelli, 2011), or other objectives such as the kernelized Stein discrepancy (KSD; Liu et al., 2016b; Liu and Wang, 2017). So far, approximations to these objectives are developed on a casebycase basis, leaving important applications unaddressed; for example, there is a lack of scalable learning methods for unnormalized models on manifolds (Mardia et al., 2016).
In this work, we present a unifying perspective to this problem, and derive scalable approximations for a variety of learning objectives including score matching. We start by interpreting these objectives as the initial velocity
of certain distributionspace gradient flows, which are simulated by common samplers. This novel interpretation leads to a scalable approximation algorithm for all such objectives, reminiscent to singlestep contrastive divergence (CD1).
We refer to any objective with the above interpretation as above as a “minimum velocity learning objective”, a term coined in the unpublished work (Movellan, 2007). Movellan (2007) focused on the specific case of score matching; in contrast, our formulation generalizes theirs by lifting the concept of velocity from data space to distribution space, thus applies to different objectives as the choice of distribution space varies. For example, our method applies to score matching and Riemannian score matching when we choose the 2Wasserstein space, and to KSD when we choose the Wasserstein space (Liu, 2017); we can also derive instances of the minimum velocity learning objective when the distributionspace gradient flow corresponds to less wellstudied samplers, such as (Zhang et al., 2018; Lu et al., 2019). Another gap we fill in is the development of a practically applicable algorithm, which we will discuss shortly.
Our algorithm is connected to previous work using CD1 to estimate the gradient of certain objectives (Hyvarinen, 2007; Movellan, 2007; Liu and Wang, 2017); however, there are important differences. From a theoretical perspective, we provide a unified derivation for all such objectives, including those not considered in previous work; our gradientflowbased derivation is also simpler, and leads to an improved understanding of this approach. From an algorithmic perspective, we directly approximate the objective function instead of its gradient, enabling the use of regularization like earlystopping. More importantly, we identify an
infinitevariance problem
in the approximate score matching objective, which has previously rendered the approximation impractical (Hyvarinen, 2007; Saremi et al., 2018); we further present a simple fix. As a side product of our work, our fix also applies to denoising score matching (Raphan and Simoncelli, 2011; Vincent, 2011), another score matching approximation that suffers from this problem.One important application of our method is in learning unnormalized models on manifolds, as our method leads to a scalable approximation for the Riemannian score matching objective. Density estimation on manifolds is needed in areas such as image analysis (Srivastava et al., 2007), geology (Davis and Sampson, 1986) and bioinformatics (Boomsma et al., 2008). Moreover, our approximation leads to flexible inference schemes for variational and Wasserstein autoencoders with manifoldvalued latent variables, as it enables gradient estimation for implicit variational distributions on manifolds. Autoencoders with a manifoldvalued latent space can capture the distribution of certain types of data better. For example, a hyperbolic latent space could be more suitable when the data has a hierarchical structure (Mathieu et al., 2019; Ovinnikov, 2019), and a hyperspherical prior could be more suitable for directional data (Davidson et al., 2018). As we shall see in experiments, our method improves the performance of manifoldlatent VAEs and WAEs.
The rest of this paper is organized as follows: Section 2 reviews the preliminary knowledge: manifolds, gradient flows and their connection to common sampling algorithms. We present our method in Section 3 and its applications in Section 4. Section 5 contains a review of the related work, and Section 6 contains experiments. We provide our conclusions in Section 7.
2 Preliminaries
2.1 Manifolds, Flows and the 2Wasserstein Space
We recall concepts from differential manifolds that will be needed below.
A (differential) manifold is a topological space locally diffeomorphic to an Euclidean or Hilbert space. A manifold is covered by a set of charts, which enables the use of coordinates locally, and specifies a set of basis in the local tangent space. A Riemannian manifold further possesses a Riemannian structure, which assigns to each tangent space an inner product structure. The Riemannian structure can be described using coordinates w.r.t. local charts.
The manifold structure enables us to differentiate a function along curves. Specifically, consider a curve , and a smooth function . At , a
tangent vector
describes the velocity of passing ; the differential of the function at , denoted as , is a linear map from to , such that for allA tangent vector field assigns to each a tangent vector . It determines a flow, a set of curves which all have as their velocity. On Riemannian manifolds, the gradient of a smooth function is a tangent vector field such that for all . It determines the gradient flow, which generalizes the Euclideanspace notion .
We will work with two types of manifolds: the data space
when we apply our method to manifoldvalued data, and the space of probability distributions over
. On the space of distributions, we are mostly interested in the 2Wasserstein space , a Riemannian manifold. The following properties of will be useful for our purposes (Villani, 2008):
[leftmargin=*]

Its tangent space can be identified as a subspace of the space of vector fields on ; the Riemannian metric of is defined as
(1) for all ; the inner product on the right hand side above is determined by the Riemannian structure of .

The gradient of the KL divergence functional in is
(2)
We will also consider a few other spaces of distributions, including the WassersteinFisherRao space (Lu et al., 2019), and the Wasserstein space introduced in (Liu, 2017).
On the data space, we need to introduce the notion of density, i.e. the Radon–Nikodym derivative w.r.t. a suitable base measure. The Hausdorff measure is one such choice; it reduces to the Lebesgue measure when
. In most cases, distributions on manifolds are specified using their density w.r.t. the Hausdorff measure; e.g. “uniform” distributions has constant densities in this sense.
Finally, the data space will be embedded in ; we refer to realvalued functions on the space of distributions as functionals; we denote the functional as ; we adopt the Einstein summation convention, and omit the summation symbol when an index appears both as subscript and superscript on one side of an equation, e.g. .
2.2 Posterior Sampling by Simulation of Gradient Flows
Now we review the sampling algorithms considered in this work. They include diffusionbased MCMC, particlebased variational inference, and other stochastic interacting particle systems.
Riemannian Langevin Dynamics
Suppose our target distribution has density w.r.t. the Hausdorff measure of . In a local chart , let be the coordinate matrix of its Riemannian metric. Then the Riemannian Langevin dynamics corresponds to the following stochastic differential equation in the chart^{1}^{1}1 (3) differs from definitions in some works (e.g. Ma et al., 2015). This is because we define as the density w.r.t. the Hausdorff measure of , while they use the Lebesgue measure. See also (Xifara et al., 2014; Hsu, 2008). :
(3) 
where
(4) 
and is the coordinate of the matrix . It is known (Villani, 2008) that the Riemannian Langevin dynamics is the gradient flow of the KL functional in the 2Wasserstein space .
Particlebased Samplers
A range of samplers approximate the gradient flow of in various spaces, using deterministic or stochastic interacting particle systems.^{2}^{2}2 There are other particlebased samplers (Liu et al., 2019b, a; Taghvaei and Mehta, 2019) corresponding to accelerated gradient flows. However, as we will be interested in the initial velocity of the flow, they do not lead to new MVL objectives. For instance, Stein variational gradient descent (SVGD; Liu and Wang, 2016) simulates the gradient flow in the socalled Wasserstein space (Liu, 2017), which replaces the Riemannian structure in with the RKHS inner product. Birthdeath accelerated Langevin dynamics (Lu et al., 2019) is a stochastic interacting particle system that simulates to the gradient flow of in the WassersteinFisherRao space. Finally, the stochastic particleoptimization sampler (SPOS; Zhang et al., 2018; Chen et al., 2018) combines the dynamics of SVGD and Langevin dynamics; as we will show in Appendix B.1, SPOS also has a gradient flow structure.
3 Wasserstein Minimum Velocity Learning
In this section, we present our framework, which concerns all learning objectives of the following form:
(5) 
where is defined as the gradient flow of in a suitable space of probability measures (e.g. the 2Wasserstein space). We refer to any such objective as a “minimum velocity learning (MVL) objective”; as we shall see below, equals the initial velocity of the gradient flow , in the corresponding distribution space.
In the following subsections, we will first set up the problem, and motivate the use of (5) by connecting it to score matching; then we present our approximation to (5), and its variancereduced version; we also address the infinitevariance issue in two previous approximators for the score matching objective. Finally, we briefly discuss other instances of the MVL objective that our method can be applied to.
3.1 Score Matching and a Wasserstein Space View
Consider parameter estimation in the unnormalized model . Maximum likelihood estimation is intractable, due to the presence of the normalizing constant . Score matching circumvents this issue by minimizing the Fisher divergence
(6) 
which does not depend on the normalization constant. While (6) involves the unknown term, Hyvärinen (2005) shows that it equals
(7) 
plus a constant independent of . Thus we can estimate the Fisher divergence at the cost of introducing a secondorder derivative.
Unfortunately, optimization w.r.t. secondorder derivatives is prohibitively expensive when the energy is parameterized by deep neural networks, and scalable approximation to the score matching objective must be developed. Our work starts by observing
where the gradient and norm are defined in , and the manifold inherits the Riemannian metric from . This follows directly from (1)(2).
Now let be the gradient flow of , i.e. . Then
(8) 
Therefore, score matching is a special case of the MVL objective (5), when the space of distributions is chosen as .
3.2 Approximating the MVL Objective
While the MVL objective has a closedform expression, it usually involves secondorder derivatives. In this subsection, we will derive an efficient approximation scheme for the MVL objective. Our approximation will only involve firstorder derivatives, thus it can be easily implemented using automatic differentiation softwares (e.g. TensorFlow).
First, observe that (8) holds regardless of the chosen space of distributions. Denote , , so , then we can transform the above into
(9) 
As the first term in (9) is independent of , the MVL objective is always equivalent to the second term. We will approximate the second term by simulating a modified gradient flow: let be the distribution obtained by running the sampler targeting . Then
(10) 
(10) can be approximated by replacing the limit with a fixed , and running the corresponding sampler starting from a minibatch of training data. The approximation becomes unbiased when .
3.2.1 A Control Variate
We have derived an estimator of (10) with vanishing bias. However, the estimator will suffer from high variance when the sampler used in the MVL objective consists of Itô diffusion. Fortunately, we can solve this problem with a control variate.
To illustrate the problem as well as our solution, suppose corresponds to Langevin dynamics, and (without loss of generality) we use a batch size of in estimation. Our estimator is then
where is sampled from the training data, and . By Taylor expansion^{3}^{3}3We need to expand to the second order when the increment is a discretization of some Itô diffusion., equals
(11) 
and as ,
Now we can see the need for a control variate. In this LD example, the control variate will remove the infinitevariance term; More generally, our control variate is always the inner product of and the diffusion term in the sampler.
Wrapping up, our approximate MVL objective is calculated as follows:

[leftmargin=*]

Sample a minibatch of input .

Run a single step of the sampling algorithm on targeting , with a stepsize of . Denote the resulted state as .

Return plus the control variate.
The approximation becomes unbiased as , and has variance^{4}^{4}4 under mild assumptions controlling the growth of (e.g. bounded by a polynomial), so that the residual term in (11) will have bounded variance when averaged over . regardless of .
3.3 On CD1 and Denoising Score Matching: Pitfalls and Fixes
As a side product, we show that our variance analysis explains the pitfall of two wellknown approximations to the score matching objective: CD1 (Hyvarinen, 2007) and denoising score matching (Vincent, 2011, DSM). Both approximations become unbiased as a stepsize hyperparameter , but did not match the performance of exact score matching in practice, as witnessed in Hyvarinen (2007); Saremi et al. (2018); Song et al. (2019). We propose novel control variates for these approximators. As we will show in Section 6.1, the variancereduced versions of the approximations have comparable performance to the exact score matching objective.
Dsm
DSM considers the objective
(12) 
The first two terms inside the norm represent a noise corrupted sample, and represents a “singlestep denoising direction” (Raphan and Simoncelli, 2011). It is proved that the optimal satisfies , where is the density of the corrupted distribution (Raphan and Simoncelli, 2011; Vincent, 2011).
Consider the stochastic estimator of (12). We assume a batch size of , and denote the data sample as . To keep notations consistent, denote , . Then the estimator is
As is similar to Section 3.2.1, we can show by Taylor expansion (see Appendix A) that
(13)  
(14) 
furthermore, the variance reduced objective
(15) 
is unbiased with finite variance.
CD1 with Langevin Dynamics
Proposed as an approximation to the maximum likelihood estimate, the step contrastive divergence (CD) learning rule updates the model parameter with
(16) 
where is the learning rate, and is obtained from by running steps of MCMC. (16) does not define a valid objective, since also depends on ; however, Hyvarinen (2007) proved that when and the sampler is the Langevin dynamics, (16) recovers the gradient of the score matching objective.
Using the same derivation as in Section 3.2.1, we can see that as the stepsize of the sampler approaches 0 (and is rescaled appropriately), the gradient produced by CD also suffers from infinite variance, and this can be fixed using the same control variate.
However, practical utility of CD1 is still hindered by the fact that it does not correspond to a valid learning objective; consequently, it is impossible to monitor the training process for CD1, or introduce regularizations such as early stopping^{5}^{5}5 In practice, the term is often used to tract the training process of CD. It is not a proper loss; we can see from (9) that when and , is significantly different from the proper score matching (MVL) loss, by a term of . .
3.4 Instances of MVL Objectives
As the previous derivation is independent of the distribution space of choice, we can derive approximations to other learning objectives using samplers other than LD. An important example is the Riemannian score matching objective, which corresponds to Riemannian LD; we will discuss it in detail in Section 4.1. Another example is when we choose the sampler as SVGD. In this case, we will obtain an approximation to the kernelized Stein discrepancy, generalizing the derivation in (Liu and Wang, 2017)
. When the sampling algorithm is chosen as SPOS, the corresponding MVL objective will be an interpolation between KSD and the Fisher divergence. See Appendix
B.2 for derivations. Finally, the use of birthdeath accelerated Langevin dynamics leads to a novel learning objective.In terms of applications, our work focuses on learning neural energybased models, and these objectives do not improve over score matching in this aspect. However, these derivations are useful since they generalize previous discussions, and establish new connections between sampling algorithms and learning objectives. It is also possible that these approximate objectives could be useful in other scenarios, such as learning kernel exponential family models
(Sriperumbudur et al., 2017), improving the training of GANs (Liu and Wang, 2017) or amortized variational inference methods (Ruiz and Titsias, 2019).4 Applications
We now present applications of our work, including a scalable learning algorithm for unnormalized models on manifolds, as well as its application on learning implicit autoencoders with manifoldvalued priors.
4.1 MVL on Riemannian Manifolds
Density estimation on manifolds is needed in many application areas. While it is natural to consider unnoramlized models on manifolds, there has been a lack of scalable learning methods. Here we address this issue, by applying our method to obtain a scalable approximation to the Riemannian score matching objective (Mardia et al., 2016).
Given the data manifold , we define an unnormalized model on it by parameterizing the log density w.r.t. the Hausdorff measure, and define the density as . The Riemannian score matching objective will have the same form as (6); although the norm in (6) is now determined by the metric on , and the base measure of the densities has changed.
It is easy to verify that the derivation in Section 3.1 still applies in the manifold case. Thus, the Riemannian score matching objective is a special case of the MVL objective, in which the distribution space is still chosen as . The difference is that is now defined with the nontrivial dataspace metric, and the gradient flow of becomes the Riemannian Langevin dynamics (3). We can approximate the objective by doing a single step of Riemannian LD for small :
(17) 
In (17), is the local coordinates of a sampled data point, is the Riemannian metric, and is obtained by running Riemannian Langevin dynamics^{6}^{6}6 While readers familiar with Riemannian Brownian motion may notice that (18) is only defined before the particle escapes the local chart, this is good enough for our purpose: we are only concerned with infinitesimal time, and escape probability approaches as . See Appendix C. targeting :
(18)  
4.2 Learning Implicit AEs with Manifold Prior
Recently, there is a surge of interest in autoencoding models with manifoldvalued priors. In this section, we present a new training method for implicit autoencoders with manifold priors, based on the above Riemannian score matching algorithm.
Formally, autoencoders model the observed data by marginalizing out a latent code variable, . To enable tractable learning, they define an additional “encoder” distribution . We will consider two types of autoencoders:

[leftmargin=*]

VAEs with implicit encoder, which maximizes the evidence lower bound. is a reparameterized implicit distribution, i.e. for fixed , is defined as the pushforward measure of a simple distribution , by a DNN that takes and as input.

Wasserstein autoencoders (WAEs), which minimizes the 1Wasserstein distance between the model and data distributions by minimizing where is the deterministic decoder, i.e. ; is a userspecified reconstruction error, is the aggregated prior, is a hyperparamter, and is an arbitrary divergence. We use the exclusive KL divergence as .
Both objectives are intractable, as they include the entropy of a latentspace distribution with intractable density: for VAE, and for WAE. However, it is known that to obtain , it suffices to estimate the score function . Specifically, let be the pushforward of by . Then we have
(19) 
Score estimation can be done by fitting an unnormalized model on the distribution , and approximating above with . (For VAE, we will fit a conditional unnormalized model to approximate the conditional entropy.)
A variant of this idea is explored in Song et al. (2019), and outperforms existing learning algorithms for implicit AEs. As argued by (Shi et al., 2018; Li and Turner, 2018), this method is advantageous as it directly estimates the score function of the latentspace distribution, instead of obtaining gradient from density (ratio) estimations; the latter could lead to arbitrary variations in the gradient estimate.
When the latent variables are defined on an embedded manifold (e.g. hyperspheres), we can no longer use the Euclidean score estimators to approximate the learning objective, as the entropy of the latentspace distribution w.r.t. the Lebesgue measure is usually undefined. However, we can still approximate the objective by doing score estimation inside the manifold: let be the density w.r.t. the Hausdorff measure, and be the corresponding relative entropy functional. Then (19) will still hold; see Appendix D. We can estimate the score function in (19) by with an unnormalized model on manifold, learned with the objective (17).
Wrapping up, we obtain an efficient algorithm to train autoencoders with a manifoldvalued prior.
5 Related Work
Our work concerns scalable learning algorithms for unnormalized models. This is a longstanding problem in literature, and some of the previous work is discussed in Section 1. Other notable work includes noise contrastive estimation
(Gutmann and Hyvärinen, 2010) and Parzen score matching (Raphan and Simoncelli, 2011). However, to our knowledge, they have not been applied to complex unnormalized models parameterized by DNNs.Apart from the MVL formulation used in this work, there exists other work on the connection between learning objectives of unnormalized model and infinitesimal actions of sampling dynamics (or other processes):

[leftmargin=*]

The minimum probability flow framework (SohlDickstein et al., 2011) studies the slightly different objective , where is the trajectory of the sampler. It recovers score matching as a special instance, and leads to a tractable learning objective for discrete models.

Lyu (2009) observes a different connection between score matching and (derivative of) KL divergence; specifically they showed , where are obtained by doing Brownian motion starting from or .
As those formulations have different motivations compared with ours, they do not lead to scalable learning objectives for continuous models.
6 Evaluation
6.1 Synthetic Experiments
To demonstrate the proposed estimators have small bias and variance, we first evaluate them on lowdimensional synthetic data. We will also verify that our control variate in Section 3.3 improves the performance of CD1 and DSM.
6.1.1 Approximations to Score Matching
In this section, we evaluate our MVL approximation to the Euclidean score matching objective (7), as well as the variancereduced DSM objective. An experiment evaluating the variancereduced CD1 objective is presented in Appendix E.1.2.
We evaluate the bias and variance of our estimators by comparing them to sliced score matching (SSM), an unbiased estimator for (
7). We choose the data distribution as the 2D banana dataset from Wenliang et al. (2018), and the model distribution as an EBM trained on that dataset. We estimate the squared bias with a stochastic upper bound using samples; see Appendix E.1.1 for details.The results are shown in Figure 1. We can see that for both estimators, the bias is negligible at . We further use a ztest to compare the mean of the two estimators (for ) with the mean of SSM. The p value is for our estimator and for DSM, indicating there is no significant difference in either case. The variance of the estimators, with and without our control variate, are shown in Fig.1 right. As expected, the variance grows unbounded in absence of the control variate, and is approximately constant when it is added. From the scale of the variance, we can see that it is exactly this variance problem that causes the failure of the original DSM estimator.
6.1.2 Density Estimation on Manifolds
We now evaluate our approximation to the Riemannian score matching objective, by learning neural energybased models on and . The target distributions are mixtures of vonMisesFisher distributions. In Figure 2, we plot the log densities of the ground truth distribution as well as the learned model on . We can see the two functions matches closely, suggesting our method is suitable for density estimation on manifolds. Results on are similar and will be presented in E.1.3; detailed setups are deferred to Appendix E.1.1.
6.2 Implicit AEs with Manifold Prior
We now apply our method to train implicit autoencoding models with manifoldvalued prior. Experiment setups mainly follow Song et al. (2019); see Appendix E.2.
Note that there is an important difference from Song et al. (2019) in our implementation: for (conditional) score estimation, we parameterize an scalar energy function and use as the score estimate, while Song et al. (2019) directly parameterize a vectorvalue network . Since directly using a feedforward network (FFN) for does not work well in practice, we parameterize the energy function as , where is parameterized in the same way as Song et al. (2019). This can be seen as correcting an initial score approximation to make it conservative. In addition to being conceptually desirable (as score functions are conservative fields), this approach leads to significant improvements in the WAE experiments.
6.2.1 Implicit VAEs
We apply our method to train hyperspherical VAEs (Davidson et al., 2018) with implicit encoders on the MNIST dataset. Our encoder and decoder architecture follows Song et al. (2019), with the exception that we normalize so it lies on .
We consider . Baseline methods include hyperspherical VAE with explicit encoders and Euclidean VAEs. We report the test log likelihood estimated with annealed importance sampling (Wu et al., 2016; Neal, 2001)
, as well as its standard deviation across 10 runs.
Euc.  Sph.  Euc.  Sph.  
Exp.  96.45  95.47  90.28  91.32 
Imp.  95.84  94.72  90.33  88.81 
The results are summarized in Table 1. We can see that the implicit hyperspherical VAE trained with our method outperforms all other baselines. Interestingly, the explicit hyperspherical VAE could not match the performance of Euclidean VAE in higher dimensions. This is also observed in Davidson et al. (2018), who (incorrectly) conjectured that the hyperspherical prior is unsuitable in higher dimensions. From our results, we can see that the problem actually lies in the flexibility of variational posteriors. Our method thus unleashes the potential of VAEs with manifoldvalued priors, and might lead to improvements in downstream tasks.
6.2.2 Hyperspherical WAEs
We first evaluate our method on MNIST. We use the uniform distribution as , and choose cross entropy as the reconstruction error. We choose . We use the encoder and decoder architecture in Song et al. (2019); the architecture of the energy network is also similar to their work. We report the Frechet Inception Distance (FID; Heusel et al., 2017).
As the choice of divergence measure in the WAE objective is arbitrary, there are several methods to train WAEs with manifold latent space: using the JensenShannon divergence approximated with a GANlike discriminator (WAEGAN), and using the maximum mean discrepancy (MMD) divergence. We choose WAEGAN as the baseline method, as it outperforms WAEMMD in Tolstikhin et al. (2017). To demonstrate the utility of hyperspherical priors, we also compare with models using normal priors.
Method  Euc.  Sph. 

WAEGAN  24.59  19.81 
Ours  23.80  18.36 
The FID scores are reported in Table 2. We can see that hyperspherical prior leads to better sample quality compared with Euclidean prior, and our method improves the training of WAEs.
To demonstrate our method scales to higher dimensions, we also train hyperspherical WAEs on CIFAR10 and CelebA, with larger . We find that our method is comparable or better than WAEGAN and WAEMMD; see Appendix E.2.1.
7 Conclusion
We present a scalable approximation to a general family of learning objectives for unnormalized models, based on a new connection between these objectives and gradient flows. Our method can be applied to manifold density estimation and training implicit autoencoders with manifold priors.
Acknowledgement
J.Z is the corresponding author. We thank Chang Liu and Jiaxin Shi for comments. This work was supported by the National Key Research and Development Program of China (No. 2017YFA0700904), NSFC Project (Nos. 61620106010, U1811461), Beijing NSF Project (No. L172037), Beijing Academy of Artificial Intelligence (BAAI), a grant from Tsinghua Institute for Guo Qiang, and the NVIDIA NVAIL Program with GPU/DGX Acceleration.
References
 Barp et al. (2019) Barp, A., Briol, F.X., Duncan, A. B., Girolami, M., and Mackey, L. (2019). Minimum stein discrepancy estimators.
 Boomsma et al. (2008) Boomsma, W., Mardia, K. V., Taylor, C. C., FerkinghoffBorg, J., Krogh, A., and Hamelryck, T. (2008). A generative, probabilistic model of local protein structure. Proceedings of the National Academy of Sciences, 105(26):8932–8937.
 Byrne and Girolami (2013) Byrne, S. and Girolami, M. (2013). Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825–845.
 Chen et al. (2018) Chen, C., Zhang, R., Wang, W., Li, B., and Chen, L. (2018). A unified particleoptimization framework for scalable bayesian sampling. arXiv preprint arXiv:1805.11659.
 Davidson et al. (2018) Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. (2018). Hyperspherical variational autoencoders. arXiv preprint arXiv:1804.00891.
 Davis and Sampson (1986) Davis, J. C. and Sampson, R. J. (1986). Statistics and data analysis in geology, volume 646. Wiley New York et al.
 Federer (2014) Federer, H. (2014). Geometric measure theory. Springer.
 Gorham et al. (2019) Gorham, J., Duncan, A. B., Vollmer, S., and Mackey, L. (2019). Measuring sample quality with diffusions. Annals of Applied Probability.
 Gutmann and Hyvärinen (2010) Gutmann, M. and Hyvärinen, A. (2010). Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637.
 Hsu (2002) Hsu, E. P. (2002). Stochastic analysis on manifolds, volume 38. American Mathematical Soc.
 Hsu (2008) Hsu, E. P. (2008). A brief introduction to brownian motion on a riemannian manifold. lecture notes.
 Hutchinson (1990) Hutchinson, M. F. (1990). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in StatisticsSimulation and Computation, 19(2):433–450.

Hyvärinen (2005)
Hyvärinen, A. (2005).
Estimation of nonnormalized statistical models by score matching.
Journal of Machine Learning Research
, 6(Apr):695–709.  Hyvarinen (2007) Hyvarinen, A. (2007). Connections between score matching, contrastive divergence, and pseudolikelihood for continuousvalued variables. IEEE Transactions on neural networks, 18(5):1529–1531.
 Li and Turner (2018) Li, Y. and Turner, R. E. (2018). Gradient estimators for implicit models. In International Conference on Learning Representations.
 Liu et al. (2016a) Liu, C., Zhu, J., and Song, Y. (2016a). Stochastic gradient geodesic mcmc methods. In Advances in Neural Information Processing Systems, pages 3009–3017.
 Liu et al. (2019a) Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J., and Carin, L. (2019a). Understanding and accelerating particlebased variational inference. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4082–4092, Long Beach, California USA. PMLR.
 Liu et al. (2019b) Liu, C., Zhuo, J., and Zhu, J. (2019b). Understanding mcmc dynamics as flows on the wasserstein space. arXiv preprint arXiv:1902.00282.
 Liu (2017) Liu, Q. (2017). Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3115–3123.
 Liu et al. (2016b) Liu, Q., Lee, J., and Jordan, M. (2016b). A kernelized stein discrepancy for goodnessoffit tests. In International conference on machine learning, pages 276–284.

Liu and Wang (2016)
Liu, Q. and Wang, D. (2016).
Stein variational gradient descent: A general purpose bayesian inference algorithm.
In Advances in neural information processing systems, pages 2378–2386.  Liu and Wang (2017) Liu, Q. and Wang, D. (2017). Learning deep energy models: Contrastive divergence vs. amortized mle. arXiv preprint arXiv:1707.00797.
 Lu et al. (2019) Lu, Y., Lu, J., and Nolen, J. (2019). Accelerating Langevin Sampling with Birthdeath. arXiv eprints, page arXiv:1905.09863.
 Lyu (2009) Lyu, S. (2009). Interpretation and generalization of score matching. In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence, pages 359–366. AUAI Press.
 Ma et al. (2015) Ma, Y.A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925.
 Mardia et al. (2016) Mardia, K. V., Kent, J. T., and Laha, A. K. (2016). Score matching estimators for directional distributions. arXiv preprint arXiv:1604.08470.
 Mathieu et al. (2019) Mathieu, E., Lan, C. L., Maddison, C. J., Tomioka, R., and Teh, Y. W. (2019). Continuous hierarchical representations with poincaré variational autoencoders. In Advances in neural information processing systems.
 Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In International Conference on Learning Representations.
 Movellan (2007) Movellan, J. R. (2007). A minimun velocity approach to learning. unpublished.
 Neal (2001) Neal, R. M. (2001). Annealed importance sampling. Statistics and computing, 11(2):125–139.
 Otto (2001) Otto, F. (2001). The geometry of dissipative evolution equations: the porous medium equation.
 Ovinnikov (2019) Ovinnikov, I. (2019). Poincaré wasserstein autoencoder. arXiv preprint arXiv:1901.01427.
 Ramachandran et al. (2017) Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Searching for activation functions. arXiv preprint arXiv:1710.05941.
 Raphan and Simoncelli (2011) Raphan, M. and Simoncelli, E. P. (2011). Least squares estimation without priors or supervision. Neural computation, 23(2):374–420.
 Ruiz and Titsias (2019) Ruiz, F. J. and Titsias, M. K. (2019). A contrastive divergence for combining variational inference and mcmc. arXiv preprint arXiv:1905.04062.
 Saremi et al. (2018) Saremi, S., Mehrjou, A., Schölkopf, B., and Hyvärinen, A. (2018). Deep energy estimator networks. arXiv preprint arXiv:1805.08306.
 Shi et al. (2018) Shi, J., Sun, S., and Zhu, J. (2018). A spectral approach to gradient estimation for implicit distributions. In Proceedings of the 35th International Conference on Machine Learning, pages 4651–4660.
 SohlDickstein et al. (2011) SohlDickstein, J., Battaglino, P., and DeWeese, M. R. (2011). Minimum probability flow learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 905–912. Omnipress.
 Song et al. (2019) Song, Y., Garg, S., Shi, J., and Ermon, S. (2019). Sliced score matching: A scalable approach to density and score estimation. arXiv preprint arXiv:1905.07088.
 Sriperumbudur et al. (2017) Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyvärinen, A., and Kumar, R. (2017). Density estimation in infinite dimensional exponential families. The Journal of Machine Learning Research, 18(1):1830–1888.

Srivastava et al. (2007)
Srivastava, A., Jermyn, I., and Joshi, S. (2007).
Riemannian analysis of probability density functions with applications in vision.
In2007 IEEE Conference on Computer Vision and Pattern Recognition
, pages 1–8. IEEE.  Taghvaei and Mehta (2019) Taghvaei, A. and Mehta, P. (2019). Accelerated flow for probability distributions. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6076–6085, Long Beach, California, USA. PMLR.
 Tolstikhin et al. (2017) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. (2017). Wasserstein autoencoders. arXiv preprint arXiv:1711.01558.
 Villani (2008) Villani, C. (2008). Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg.

Vincent (2011)
Vincent, P. (2011).
A connection between score matching and denoising autoencoders.
Neural computation, 23(7):1661–1674.  Wenliang et al. (2018) Wenliang, L., Sutherland, D., Strathmann, H., and Gretton, A. (2018). Learning deep kernels for exponential family densities. arXiv preprint arXiv:1811.08357.
 Wenliang et al. (2019) Wenliang, L., Sutherland, D., Strathmann, H., and Gretton, A. (2019). Learning deep kernels for exponential family densities. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6737–6746, Long Beach, California, USA. PMLR.
 Wu et al. (2016) Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. (2016). On the quantitative analysis of decoderbased generative models. arXiv preprint arXiv:1611.04273.
 Xifara et al. (2014) Xifara, T., Sherlock, C., Livingstone, S., Byrne, S., and Girolami, M. (2014). Langevin diffusions and the metropolisadjusted langevin algorithm. Statistics & Probability Letters, 91:14–19.
 Zhang et al. (2018) Zhang, J., Zhang, R., and Chen, C. (2018). Stochastic particleoptimization sampling and the nonasymptotic convergence theory. arXiv preprint arXiv:1809.01293.
Supplementary Material
References
 Barp et al. (2019) Barp, A., Briol, F.X., Duncan, A. B., Girolami, M., and Mackey, L. (2019). Minimum stein discrepancy estimators.
 Boomsma et al. (2008) Boomsma, W., Mardia, K. V., Taylor, C. C., FerkinghoffBorg, J., Krogh, A., and Hamelryck, T. (2008). A generative, probabilistic model of local protein structure. Proceedings of the National Academy of Sciences, 105(26):8932–8937.
 Byrne and Girolami (2013) Byrne, S. and Girolami, M. (2013). Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825–845.
 Chen et al. (2018) Chen, C., Zhang, R., Wang, W., Li, B., and Chen, L. (2018). A unified particleoptimization framework for scalable bayesian sampling. arXiv preprint arXiv:1805.11659.
 Davidson et al. (2018) Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. (2018). Hyperspherical variational autoencoders. arXiv preprint arXiv:1804.00891.
 Davis and Sampson (1986) Davis, J. C. and Sampson, R. J. (1986). Statistics and data analysis in geology, volume 646. Wiley New York et al.
 Federer (2014) Federer, H. (2014). Geometric measure theory. Springer.
 Gorham et al. (2019) Gorham, J., Duncan, A. B., Vollmer, S., and Mackey, L. (2019). Measuring sample quality with diffusions. Annals of Applied Probability.
 Gutmann and Hyvärinen (2010) Gutmann, M. and Hyvärinen, A. (2010). Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637.
 Hsu (2002) Hsu, E. P. (2002). Stochastic analysis on manifolds, volume 38. American Mathematical Soc.
 Hsu (2008) Hsu, E. P. (2008). A brief introduction to brownian motion on a riemannian manifold. lecture notes.
 Hutchinson (1990) Hutchinson, M. F. (1990). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in StatisticsSimulation and Computation, 19(2):433–450.

Hyvärinen (2005)
Hyvärinen, A. (2005).
Estimation of nonnormalized statistical models by score matching.
Journal of Machine Learning Research
, 6(Apr):695–709.  Hyvarinen (2007) Hyvarinen, A. (2007). Connections between score matching, contrastive divergence, and pseudolikelihood for continuousvalued variables. IEEE Transactions on neural networks, 18(5):1529–1531.
 Li and Turner (2018) Li, Y. and Turner, R. E. (2018). Gradient estimators for implicit models. In International Conference on Learning Representations.
 Liu et al. (2016a) Liu, C., Zhu, J., and Song, Y. (2016a). Stochastic gradient geodesic mcmc methods. In Advances in Neural Information Processing Systems, pages 3009–3017.
 Liu et al. (2019a) Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J., and Carin, L. (2019a). Understanding and accelerating particlebased variational inference. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4082–4092, Long Beach, California USA. PMLR.
 Liu et al. (2019b) Liu, C., Zhuo, J., and Zhu, J. (2019b). Understanding mcmc dynamics as flows on the wasserstein space. arXiv preprint arXiv:1902.00282.
 Liu (2017) Liu, Q. (2017). Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3115–3123.
 Liu et al. (2016b) Liu, Q., Lee, J., and Jordan, M. (2016b). A kernelized stein discrepancy for goodnessoffit tests. In International conference on machine learning, pages 276–284.

Liu and Wang (2016)
Liu, Q. and Wang, D. (2016).
Stein variational gradient descent: A general purpose bayesian inference algorithm.
In Advances in neural information processing systems, pages 2378–2386.  Liu and Wang (2017) Liu, Q. and Wang, D. (2017). Learning deep energy models: Contrastive divergence vs. amortized mle. arXiv preprint arXiv:1707.00797.
 Lu et al. (2019) Lu, Y., Lu, J., and Nolen, J. (2019). Accelerating Langevin Sampling with Birthdeath. arXiv eprints, page arXiv:1905.09863.
 Lyu (2009) Lyu, S. (2009). Interpretation and generalization of score matching. In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence, pages 359–366. AUAI Press.
 Ma et al. (2015) Ma, Y.A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925.
 Mardia et al. (2016) Mardia, K. V., Kent, J. T., and Laha, A. K. (2016). Score matching estimators for directional distributions. arXiv preprint arXiv:1604.08470.
 Mathieu et al. (2019) Mathieu, E., Lan, C. L., Maddison, C. J., Tomioka, R., and Teh, Y. W. (2019). Continuous hierarchical representations with poincaré variational autoencoders. In Advances in neural information processing systems.
 Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In International Conference on Learning Representations.
 Movellan (2007) Movellan, J. R. (2007). A minimun velocity approach to learning. unpublished.
 Neal (2001) Neal, R. M. (2001). Annealed importance sampling. Statistics and computing, 11(2):125–139.
 Otto (2001) Otto, F. (2001). The geometry of dissipative evolution equations: the porous medium equation.
 Ovinnikov (2019) Ovinnikov, I. (2019). Poincaré wasserstein autoencoder. arXiv preprint arXiv:1901.01427.
 Ramachandran et al. (2017) Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Searching for activation functions. arXiv preprint arXiv:1710.05941.
 Raphan and Simoncelli (2011) Raphan, M. and Simoncelli, E. P. (2011). Least squares estimation without priors or supervision. Neural computation, 23(2):374–420.
 Ruiz and Titsias (2019) Ruiz, F. J. and Titsias, M. K. (2019). A contrastive divergence for combining variational inference and mcmc. arXiv preprint arXiv:1905.04062.
 Saremi et al. (2018) Saremi, S., Mehrjou, A., Schölkopf, B., and Hyvärinen, A. (2018). Deep energy estimator networks. arXiv preprint arXiv:1805.08306.
 Shi et al. (2018) Shi, J., Sun, S., and Zhu, J. (2018). A spectral approach to gradient estimation for implicit distributions. In Proceedings of the 35th International Conference on Machine Learning, pages 4651–4660.
 SohlDickstein et al. (2011) SohlDickstein, J., Battaglino, P., and DeWeese, M. R. (2011). Minimum probability flow learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 905–912. Omnipress.
 Song et al. (2019) Song, Y., Garg, S., Shi, J., and Ermon, S. (2019). Sliced score matching: A scalable approach to density and score estimation. arXiv preprint arXiv:1905.07088.
 Sriperumbudur et al. (2017) Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyvärinen, A., and Kumar, R. (2017). Density estimation in infinite dimensional exponential families. The Journal of Machine Learning Research, 18(1):1830–1888.

Srivastava et al. (2007)
Srivastava, A., Jermyn, I., and Joshi, S. (2007).
Riemannian analysis of probability density functions with applications in vision.
In2007 IEEE Conference on Computer Vision and Pattern Recognition
, pages 1–8. IEEE.  Taghvaei and Mehta (2019) Taghvaei, A. and Mehta, P. (2019). Accelerated flow for probability distributions. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6076–6085, Long Beach, California, USA. PMLR.
 Tolstikhin et al. (2017) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. (2017). Wasserstein autoencoders. arXiv preprint arXiv:1711.01558.
 Villani (2008) Villani, C. (2008). Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg.

Vincent (2011)
Vincent, P. (2011).
A connection between score matching and denoising autoencoders.
Neural computation, 23(7):1661–1674.  Wenliang et al. (2018) Wenliang, L., Sutherland, D., Strathmann, H., and Gretton, A. (2018). Learning deep kernels for exponential family densities. arXiv preprint arXiv:1811.08357.
 Wenliang et al. (2019) Wenliang, L., Sutherland, D., Strathmann, H., and Gretton, A. (2019). Learning deep kernels for exponential family densities. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6737–6746, Long Beach, California, USA. PMLR.
 Wu et al. (2016) Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. (2016). On the quantitative analysis of decoderbased generative models. arXiv preprint arXiv:1611.04273.
 Xifara et al. (2014) Xifara, T., Sherlock, C., Livingstone, S., Byrne, S., and Girolami, M. (2014). Langevin diffusions and the metropolisadjusted langevin algorithm. Statistics & Probability Letters, 91:14–19.
 Zhang et al. (2018) Zhang, J., Zhang, R., and Chen, C. (2018). Stochastic particleoptimization sampling and the nonasymptotic convergence theory. arXiv preprint arXiv:1809.01293.
Appendix A Derivation of (13)(15)
Denote .
(20)  
(21)  
(22)  
(23) 
Notice
which is known as the Hutchinson’s trick (Hutchinson, 1990), so is two times the Fisher divergence . But , so as , the rescaled estimator becomes unbiased with infinite variance; and subtracting (B) from (A) results in a finitevariance estimator.
Appendix B On SPOS and MVL
Notations
In this section, let the parameter space be dimensional, and define as the space of dimensional functions .
While in the main text, we identified the tangent space of as a subspace of for clarity, here we use the equivalent definition following (Otto, 2001). The two definition are connected by the transform for . Using the new definition, the differential of the KL divergence functional is then
b.1 SPOS as Gradient Flow
In this section, we give a formal derivation of SPOS as the gradient flow of the KL divergence functional, with respect to a new metric.
Recall the SPOS sampler targeting distribution (with density) corresponds to the following density evolution:
where
is a hyperparameter, and
is the SVGD update direction (Liu and Wang, 2016; Liu, 2017). Fix , define the integral operator
and define the tensor product operator
accordingly. Then the SVGD update direction satisfies(24) 
which we will derive at the end of this subsection for completeness. Following (24) we have
(25) 
The rest of our derivation follows (Otto, 2001; Liu, 2017): consider the function space where is any square integrable and differentiable function. It connects to the tangent space of if we consider for any . Define on the inner product
(26) 
It then determines a Riemannian metric on the function space. For and , by (25) we have
i.e. with respect to the metric (26), SPOS is the gradient flow minimizing the KL divergence functional.
Derivation of (24)
let be its eigendecomposition (i.e. the Mercer representation). For let where is the coordinate basis in , so becomes an orthonormal basis in . Now we calculate the coordinate of in this basis.
(27) 
is known to satisfy the Stein’s identity
for all . Thus, we can subtract from the right hand side of (27) without changing its value, and it becomes
As the equality holds for all , we completed the derivation of (24).
b.2 MVL Objective Derived from SPOS
By (25) and (26), the MVL objective derived from SPOS is
In the right hand side above, the first term in the summation is the Fisher divergence, and the second is the kernelized Stein discrepancy (Liu et al., 2016b, Definition 3.2).
We note that a similar result for SVGD has been derived in (Liu and Wang, 2017), and our derivations connect to the observation that Langevin dynamics can be viewed as SVGD with a Dirac function kernel (thus SPOS also corresponds to SVGD with generalizedfunctionvalued kernels).
Appendix C Justification of the Use of Local Coordinates in (17)
In this section, we prove in Proposition C.1 that the local coordinate representation lead to valid approximation to the MVL objective in the compact case. We also argue in Remark C.2 that the use of local coordinate does not lead to numerical instability.
Remark C.1.
While a result more general than Proposition C.1 is likely attainable (e.g. by replacing compactness of with quadratic growth of the energy), this is out of the scope of our work; for our purpose, it is sufficient to note that the proposition covers manifolds like , and the local coordinate issue will not exist in manifolds possessing a global chart, such as .
Lemma C.1.
(Theorem 3.6.1 in (Hsu, 2002)) For any manifold , , and a normal neighborhood of , there exists constant such that the first exit time from , of the Riemannian Brownian motion starting from , satisfies
for any .
Proposition C.1.
Proof.
By the tower property of conditional expectation, it suffices to prove the result when for some . Choose a normal neighborhood centered at such that is contained by our current chart, and has distance from the boundary of the chart bounded by some . Let be defined as in Lemma C.1. Recall the Riemannian LD is the sum of a drift and the Riemannian BM. Since is compact and is in , the drift term in the SDE will have norm bounded by some finite . Thus the first exit time of the Riemannian LD is greater than .
Let follow the true Riemannian LD, when , and be such that afterwards.^{7}^{7}7 This is conceptually similar to the standard augmentation used in stochastic process texts; from a algorithmic perspective it can be implemented by modifying the algorithm so that in the very unlikely event when escapes the chart, we return as the corresponding energy. We note that this is unnecessary for manifolds like , since the charts can be extended to and hence . By Hsu (2008), until , follows the local coordinate representation of Riemannian LD (3), thus on the event , would correspond to in (18). As is compact, the continuous energy function is bounded by for some finite . Then for sufficiently small ,
In the above the first term converges to as , and when . Hence the proof is complete. ∎
Remark C.2.
It is argued that simulating diffusionbased MCMC in local coordinates leads to numeric instabilities (Byrne and Girolami, 2013; Liu et al., 2016a). We stress that in our setting of approximating MVL objectives, this is not the case. The reason is that we only need to do a single step of MCMC, with arbitrarily small stepsize. Therefore, we could use different stepsize for each sample, based on the magnitude of and in their locations. We can also choose different local charts for each sample, which is justified by the proposition above.
Appendix D Derivation of (19) in the Manifold Case
In this section we derive (19), when the latentspace distribution is defined on a dimensional manifold embedded in some Euclidean space, and is the relative entropy w.r.t. the Hausdorff measure. The derivation is largely similar to the Euclidean case, and we only include it here for completeness.
Appendix E Experiment Details and Additional Results
Code will be available at https://github.com/thuml/wmvl.
e.1 Synthetic Experiments
e.1.1 Experiment Details
Experiment Details in Section 6.1.1
The (squared) bias is estimated as follows: denote the SSM estimator and ours as and , respectively. One could verify that both methods estimate (7). Our estimate for the squared bias is now where
Comments
There are no comments yet.