# A Wasserstein Minimum Velocity Approach to Learning Unnormalized Models

Score matching provides an effective approach to learning flexible unnormalized models, but its scalability is limited by the need to evaluate a second-order derivative. In this paper, we present a scalable approximation to a general family of learning objectives including score matching, by observing a new connection between these objectives and Wasserstein gradient flows. We present applications with promise in learning neural density estimators on manifolds, and training implicit variational and Wasserstein auto-encoders with a manifold-valued prior.

## Authors

• 25 publications
• 2 publications
• 1 publication
• 103 publications
• 92 publications
• ### Sliced Score Matching: A Scalable Approach to Density and Score Estimation

Score matching is a popular method for estimating unnormalized statistic...
05/17/2019 ∙ by Yang Song, et al. ∙ 6

• ### TPFA Finite Volume Approximation of Wasserstein Gradient Flows

Numerous infinite dimensional dynamical systems arising in different fie...
01/20/2020 ∙ by Andrea Natale, et al. ∙ 0

• ### On the Latent Space of Wasserstein Auto-Encoders

We study the role of latent space dimensionality in Wasserstein auto-enc...
02/11/2018 ∙ by Paul K. Rubenstein, et al. ∙ 0

• ### Approximate inference with Wasserstein gradient flows

We present a novel approximate inference method for diffusion processes,...
06/12/2018 ∙ by Charlie Frogner, et al. ∙ 0

• ### Generative Modeling with Denoising Auto-Encoders and Langevin Sampling

We study convergence of a generative modeling method that first estimate...
01/31/2020 ∙ by Adam Block, et al. ∙ 1

• ### Wasserstein Distributional Robustness and Regularization in Statistical Learning

A central question in statistical learning is to design algorithms that ...
12/17/2017 ∙ by Rui Gao, et al. ∙ 0

• ### Interpretation and Generalization of Score Matching

Score matching is a recently developed parameter learning method that is...
05/09/2012 ∙ by Siwei Lyu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A flexible approach to density estimation is to parameterize an unnormalized density function, or energy function. In particular, unnormalized models with energy parameterized by deep neural networks have been successfully applied to density estimation

(Wenliang et al., 2019; Saremi et al., 2018) and learning implicit auto-encoding models (Song et al., 2019).

Parameter estimation for such unnormalized models is highly non-trivial: the maximum likelihood objective is intractable, due to the presence of a normalization term. Score matching (Hyvärinen, 2005) is a popular alternative, yet applying score matching to complex unnormalized models can be difficult, as the objective involves the second-order derivative of the energy, rendering gradient-based optimization infeasible. In practice, people turn to scalable approximations of the score matching objective (Song et al., 2019; Hyvarinen, 2007; Vincent, 2011; Raphan and Simoncelli, 2011), or other objectives such as the kernelized Stein discrepancy (KSD; Liu et al., 2016b; Liu and Wang, 2017). So far, approximations to these objectives are developed on a case-by-case basis, leaving important applications unaddressed; for example, there is a lack of scalable learning methods for unnormalized models on manifolds (Mardia et al., 2016).

In this work, we present a unifying perspective to this problem, and derive scalable approximations for a variety of learning objectives including score matching. We start by interpreting these objectives as the initial velocity

of certain distribution-space gradient flows, which are simulated by common samplers. This novel interpretation leads to a scalable approximation algorithm for all such objectives, reminiscent to single-step contrastive divergence (CD-1).

We refer to any objective with the above interpretation as above as a “minimum velocity learning objective”, a term coined in the unpublished work (Movellan, 2007). Movellan (2007) focused on the specific case of score matching; in contrast, our formulation generalizes theirs by lifting the concept of velocity from data space to distribution space, thus applies to different objectives as the choice of distribution space varies. For example, our method applies to score matching and Riemannian score matching when we choose the 2-Wasserstein space, and to KSD when we choose the -Wasserstein space (Liu, 2017); we can also derive instances of the minimum velocity learning objective when the distribution-space gradient flow corresponds to less well-studied samplers, such as (Zhang et al., 2018; Lu et al., 2019). Another gap we fill in is the development of a practically applicable algorithm, which we will discuss shortly.

Our algorithm is connected to previous work using CD-1 to estimate the gradient of certain objectives (Hyvarinen, 2007; Movellan, 2007; Liu and Wang, 2017); however, there are important differences. From a theoretical perspective, we provide a unified derivation for all such objectives, including those not considered in previous work; our gradient-flow-based derivation is also simpler, and leads to an improved understanding of this approach. From an algorithmic perspective, we directly approximate the objective function instead of its gradient, enabling the use of regularization like early-stopping. More importantly, we identify an

infinite-variance problem

in the approximate score matching objective, which has previously rendered the approximation impractical (Hyvarinen, 2007; Saremi et al., 2018); we further present a simple fix. As a side product of our work, our fix also applies to denoising score matching (Raphan and Simoncelli, 2011; Vincent, 2011), another score matching approximation that suffers from this problem.

One important application of our method is in learning unnormalized models on manifolds, as our method leads to a scalable approximation for the Riemannian score matching objective. Density estimation on manifolds is needed in areas such as image analysis (Srivastava et al., 2007), geology (Davis and Sampson, 1986) and bioinformatics (Boomsma et al., 2008). Moreover, our approximation leads to flexible inference schemes for variational and Wasserstein auto-encoders with manifold-valued latent variables, as it enables gradient estimation for implicit variational distributions on manifolds. Auto-encoders with a manifold-valued latent space can capture the distribution of certain types of data better. For example, a hyperbolic latent space could be more suitable when the data has a hierarchical structure (Mathieu et al., 2019; Ovinnikov, 2019), and a hyper-spherical prior could be more suitable for directional data (Davidson et al., 2018). As we shall see in experiments, our method improves the performance of manifold-latent VAEs and WAEs.

The rest of this paper is organized as follows: Section 2 reviews the preliminary knowledge: manifolds, gradient flows and their connection to common sampling algorithms. We present our method in Section 3 and its applications in Section 4. Section 5 contains a review of the related work, and Section 6 contains experiments. We provide our conclusions in Section 7.

## 2 Preliminaries

### 2.1 Manifolds, Flows and the 2-Wasserstein Space

We recall concepts from differential manifolds that will be needed below.

A (differential) manifold is a topological space locally diffeomorphic to an Euclidean or Hilbert space. A manifold is covered by a set of charts, which enables the use of coordinates locally, and specifies a set of basis in the local tangent space. A Riemannian manifold further possesses a Riemannian structure, which assigns to each tangent space an inner product structure. The Riemannian structure can be described using coordinates w.r.t. local charts.

The manifold structure enables us to differentiate a function along curves. Specifically, consider a curve , and a smooth function . At , a

tangent vector

describes the velocity of passing ; the differential of the function at , denoted as , is a linear map from to , such that for all

 (df)c(t0)(dcdt∣∣∣t0)=ddtf(c(t))∣∣∣t0.

A tangent vector field assigns to each a tangent vector . It determines a flow, a set of curves which all have as their velocity. On Riemannian manifolds, the gradient of a smooth function is a tangent vector field such that for all . It determines the gradient flow, which generalizes the Euclidean-space notion .

We will work with two types of manifolds: the data space

when we apply our method to manifold-valued data, and the space of probability distributions over

. On the space of distributions, we are mostly interested in the 2-Wasserstein space , a Riemannian manifold. The following properties of will be useful for our purposes (Villani, 2008):

1. [leftmargin=*]

2. Its tangent space can be identified as a subspace of the space of vector fields on ; the Riemannian metric of is defined as

 ⟨X,Y⟩p:=Ep(u)⟨X(u),Y(u)⟩u, (1)

for all ; the inner product on the right hand side above is determined by the Riemannian structure of .

3. The gradient of the KL divergence functional in is

We will also consider a few other spaces of distributions, including the Wasserstein-Fisher-Rao space (Lu et al., 2019), and the -Wasserstein space introduced in (Liu, 2017).

On the data space, we need to introduce the notion of density, i.e. the Radon–Nikodym derivative w.r.t. a suitable base measure. The Hausdorff measure is one such choice; it reduces to the Lebesgue measure when

. In most cases, distributions on manifolds are specified using their density w.r.t. the Hausdorff measure; e.g. “uniform” distributions has constant densities in this sense.

Finally, the data space will be embedded in ; we refer to real-valued functions on the space of distributions as functionals; we denote the functional as ; we adopt the Einstein summation convention, and omit the summation symbol when an index appears both as subscript and superscript on one side of an equation, e.g. .

### 2.2 Posterior Sampling by Simulation of Gradient Flows

Now we review the sampling algorithms considered in this work. They include diffusion-based MCMC, particle-based variational inference, and other stochastic interacting particle systems.

##### Riemannian Langevin Dynamics

Suppose our target distribution has density w.r.t. the Hausdorff measure of . In a local chart , let be the coordinate matrix of its Riemannian metric. Then the Riemannian Langevin dynamics corresponds to the following stochastic differential equation in the chart111 (3) differs from definitions in some works (e.g. Ma et al., 2015). This is because we define as the density w.r.t. the Hausdorff measure of , while they use the Lebesgue measure. See also (Xifara et al., 2014; Hsu, 2008). :

 dx=V(x)dt+√2G−1(x)dBt (3)

where

 Vi(x)= gij∂j(logp(x)−log|G(x)|2)+∂jgij, (4)

and is the coordinate of the matrix . It is known (Villani, 2008) that the Riemannian Langevin dynamics is the gradient flow of the KL functional in the 2-Wasserstein space .

##### Particle-based Samplers

A range of samplers approximate the gradient flow of in various spaces, using deterministic or stochastic interacting particle systems.222 There are other particle-based samplers (Liu et al., 2019b, a; Taghvaei and Mehta, 2019) corresponding to accelerated gradient flows. However, as we will be interested in the initial velocity of the flow, they do not lead to new MVL objectives. For instance, Stein variational gradient descent (SVGD; Liu and Wang, 2016) simulates the gradient flow in the so-called -Wasserstein space (Liu, 2017), which replaces the Riemannian structure in with the RKHS inner product. Birth-death accelerated Langevin dynamics (Lu et al., 2019) is a stochastic interacting particle system that simulates to the gradient flow of in the Wasserstein-Fisher-Rao space. Finally, the stochastic particle-optimization sampler (SPOS; Zhang et al., 2018; Chen et al., 2018) combines the dynamics of SVGD and Langevin dynamics; as we will show in Appendix B.1, SPOS also has a gradient flow structure.

## 3 Wasserstein Minimum Velocity Learning

In this section, we present our framework, which concerns all learning objectives of the following form:

 Lmvl(θ):=−ddtKL(pt∥qθ)∣∣∣t=0, (5)

where is defined as the gradient flow of in a suitable space of probability measures (e.g. the 2-Wasserstein space). We refer to any such objective as a “minimum velocity learning (MVL) objective”; as we shall see below, equals the initial velocity of the gradient flow , in the corresponding distribution space.

In the following subsections, we will first set up the problem, and motivate the use of (5) by connecting it to score matching; then we present our approximation to (5), and its variance-reduced version; we also address the infinite-variance issue in two previous approximators for the score matching objective. Finally, we briefly discuss other instances of the MVL objective that our method can be applied to.

### 3.1 Score Matching and a Wasserstein Space View

Consider parameter estimation in the unnormalized model . Maximum likelihood estimation is intractable, due to the presence of the normalizing constant . Score matching circumvents this issue by minimizing the Fisher divergence

 DF(p|q):= 12Ep(x)[∥∇xlogp(x)−∇xlogq(x;θ)∥2], (6)

which does not depend on the normalization constant. While (6) involves the unknown term, Hyvärinen (2005) shows that it equals

 Ep(x)[Δlogq(x;θ)+12∥∇logq(x;θ)∥2], (7)

plus a constant independent of . Thus we can estimate the Fisher divergence at the cost of introducing a second-order derivative.

Unfortunately, optimization w.r.t. second-order derivatives is prohibitively expensive when the energy is parameterized by deep neural networks, and scalable approximation to the score matching objective must be developed. Our work starts by observing

where the gradient and norm are defined in , and the manifold inherits the Riemannian metric from . This follows directly from (1)-(2).

Now let be the gradient flow of , i.e. . Then

Therefore, score matching is a special case of the MVL objective (5), when the space of distributions is chosen as .

### 3.2 Approximating the MVL Objective

While the MVL objective has a closed-form expression, it usually involves second-order derivatives. In this subsection, we will derive an efficient approximation scheme for the MVL objective. Our approximation will only involve first-order derivatives, thus it can be easily implemented using automatic differentiation softwares (e.g. TensorFlow).

First, observe that (8) holds regardless of the chosen space of distributions. Denote , , so , then we can transform the above into

As the first term in (9) is independent of , the MVL objective is always equivalent to the second term. We will approximate the second term by simulating a modified gradient flow: let be the distribution obtained by running the sampler targeting . Then

(10) can be approximated by replacing the limit with a fixed , and running the corresponding sampler starting from a mini-batch of training data. The approximation becomes unbiased when .

#### 3.2.1 A Control Variate

We have derived an estimator of (10) with vanishing bias. However, the estimator will suffer from high variance when the sampler used in the MVL objective consists of Itô diffusion. Fortunately, we can solve this problem with a control variate.

To illustrate the problem as well as our solution, suppose corresponds to Langevin dynamics, and (without loss of generality) we use a batch size of in estimation. Our estimator is then

where is sampled from the training data, and . By Taylor expansion333We need to expand to the second order when the increment is a discretization of some Itô diffusion., equals

 12∥∇xE(x+)∥2−Z⊤∇2xE(x+)Z−√2ϵZ⊤∇xE(x+) +o(1), (11)

and as ,

Now we can see the need for a control variate. In this LD example, the control variate will remove the infinite-variance term; More generally, our control variate is always the inner product of and the diffusion term in the sampler.

Wrapping up, our approximate MVL objective is calculated as follows:

1. [leftmargin=*]

2. Sample a mini-batch of input .

3. Run a single step of the sampling algorithm on targeting , with a step-size of . Denote the resulted state as .

4. Return plus the control variate.

The approximation becomes unbiased as , and has variance444 under mild assumptions controlling the growth of (e.g. bounded by a polynomial), so that the residual term in (11) will have bounded variance when averaged over . regardless of .

### 3.3 On CD-1 and Denoising Score Matching: Pitfalls and Fixes

As a side product, we show that our variance analysis explains the pitfall of two well-known approximations to the score matching objective: CD-1 (Hyvarinen, 2007) and denoising score matching (Vincent, 2011, DSM). Both approximations become unbiased as a step-size hyper-parameter , but did not match the performance of exact score matching in practice, as witnessed in Hyvarinen (2007); Saremi et al. (2018); Song et al. (2019). We propose novel control variates for these approximators. As we will show in Section 6.1, the variance-reduced versions of the approximations have comparable performance to the exact score matching objective.

##### Dsm

DSM considers the objective

 Ldsm(θ)=Ep(x)N(z|0,I)∥x+σz−(x+ψθ(x+σz))∥2. (12)

The first two terms inside the norm represent a noise corrupted sample, and represents a “single-step denoising direction” (Raphan and Simoncelli, 2011). It is proved that the optimal satisfies , where is the density of the corrupted distribution (Raphan and Simoncelli, 2011; Vincent, 2011).

Consider the stochastic estimator of (12). We assume a batch size of , and denote the data sample as . To keep notations consistent, denote , . Then the estimator is

 ^Ldsm=∥x+√ϵz−ϵ∇xE(x+√ϵz;θ)−x∥2.

As is similar to Section 3.2.1, we can show by Taylor expansion (see Appendix A) that

 limϵ→0ϵ−2E^Ldsm =2DFisher(p|q)+(const), (13) limϵ→0ϵ−2Var^Ldsm =∞; (14)

furthermore, the variance reduced objective

 ^Ldsm−(ϵ∥z∥2−2ϵ3/2z⊤∇E(x))ϵ2 (15)

is unbiased with finite variance.

##### CD-1 with Langevin Dynamics

Proposed as an approximation to the maximum likelihood estimate, the -step contrastive divergence (CD-) learning rule updates the model parameter with

 θℓ+1←θℓ+ν[Ep∂θE−EpK∂θE], (16)

where is the learning rate, and is obtained from by running steps of MCMC. (16) does not define a valid objective, since also depends on ; however, Hyvarinen (2007) proved that when and the sampler is the Langevin dynamics, (16) recovers the gradient of the score matching objective.

Using the same derivation as in Section 3.2.1, we can see that as the step-size of the sampler approaches 0 (and is re-scaled appropriately), the gradient produced by CD- also suffers from infinite variance, and this can be fixed using the same control variate.

However, practical utility of CD-1 is still hindered by the fact that it does not correspond to a valid learning objective; consequently, it is impossible to monitor the training process for CD-1, or introduce regularizations such as early stopping555 In practice, the term is often used to tract the training process of CD-. It is not a proper loss; we can see from (9) that when and , is significantly different from the proper score matching (MVL) loss, by a term of . .

### 3.4 Instances of MVL Objectives

As the previous derivation is independent of the distribution space of choice, we can derive approximations to other learning objectives using samplers other than LD. An important example is the Riemannian score matching objective, which corresponds to Riemannian LD; we will discuss it in detail in Section 4.1. Another example is when we choose the sampler as SVGD. In this case, we will obtain an approximation to the kernelized Stein discrepancy, generalizing the derivation in (Liu and Wang, 2017)

. When the sampling algorithm is chosen as SPOS, the corresponding MVL objective will be an interpolation between KSD and the Fisher divergence. See Appendix

B.2 for derivations. Finally, the use of birth-death accelerated Langevin dynamics leads to a novel learning objective.

In terms of applications, our work focuses on learning neural energy-based models, and these objectives do not improve over score matching in this aspect. However, these derivations are useful since they generalize previous discussions, and establish new connections between sampling algorithms and learning objectives. It is also possible that these approximate objectives could be useful in other scenarios, such as learning kernel exponential family models

(Sriperumbudur et al., 2017), improving the training of GANs (Liu and Wang, 2017) or amortized variational inference methods (Ruiz and Titsias, 2019).

## 4 Applications

We now present applications of our work, including a scalable learning algorithm for unnormalized models on manifolds, as well as its application on learning implicit auto-encoders with manifold-valued priors.

### 4.1 MVL on Riemannian Manifolds

Density estimation on manifolds is needed in many application areas. While it is natural to consider unnoramlized models on manifolds, there has been a lack of scalable learning methods. Here we address this issue, by applying our method to obtain a scalable approximation to the Riemannian score matching objective (Mardia et al., 2016).

Given the data manifold , we define an unnormalized model on it by parameterizing the log density w.r.t. the Hausdorff measure, and define the density as . The Riemannian score matching objective will have the same form as (6); although the norm in (6) is now determined by the metric on , and the base measure of the densities has changed.

It is easy to verify that the derivation in Section 3.1 still applies in the manifold case. Thus, the Riemannian score matching objective is a special case of the MVL objective, in which the distribution space is still chosen as . The difference is that is now defined with the non-trivial data-space metric, and the gradient flow of becomes the Riemannian Langevin dynamics (3). We can approximate the objective by doing a single step of Riemannian LD for small :

 Lmvl-rld=2ϵ(E(y−;θ)−E(y;θ)−√2ϵ∂iE(y)zicontrol variate). (17)

In (17), is the local coordinates of a sampled data point, is the Riemannian metric, and is obtained by running Riemannian Langevin dynamics666 While readers familiar with Riemannian Brownian motion may notice that (18) is only defined before the particle escapes the local chart, this is good enough for our purpose: we are only concerned with infinitesimal time, and escape probability approaches as . See Appendix C. targeting :

 (y−)i= yi+ϵ(−gij∂jE(y;θ)+log|G(y)|2+∂kgik) +√2ϵzi, (18) z∼ N(0,G−1(y)).

### 4.2 Learning Implicit AEs with Manifold Prior

Recently, there is a surge of interest in auto-encoding models with manifold-valued priors. In this section, we present a new training method for implicit auto-encoders with manifold priors, based on the above Riemannian score matching algorithm.

Formally, auto-encoders model the observed data by marginalizing out a latent code variable, . To enable tractable learning, they define an additional “encoder” distribution . We will consider two types of auto-encoders:

1. [leftmargin=*]

2. VAEs with implicit encoder, which maximizes the evidence lower bound. is a reparameterized implicit distribution, i.e. for fixed , is defined as the pushforward measure of a simple distribution , by a DNN that takes and as input.

3. Wasserstein auto-encoders (WAEs), which minimizes the 1-Wasserstein distance between the model and data distributions by minimizing where is the deterministic decoder, i.e. ; is a user-specified reconstruction error, is the aggregated prior, is a hyperparamter, and is an arbitrary divergence. We use the exclusive KL divergence as .

Both objectives are intractable, as they include the entropy of a latent-space distribution with intractable density: for VAE, and for WAE. However, it is known that to obtain , it suffices to estimate the score function . Specifically, let be the pushforward of by . Then we have

 ∇ϕH[q(z)]=−Eϵ[∇zlogq(z)∇ϕfq(ϵ;ϕ)]. (19)

Score estimation can be done by fitting an unnormalized model on the distribution , and approximating above with . (For VAE, we will fit a conditional unnormalized model to approximate the conditional entropy.)

A variant of this idea is explored in Song et al. (2019), and outperforms existing learning algorithms for implicit AEs. As argued by (Shi et al., 2018; Li and Turner, 2018), this method is advantageous as it directly estimates the score function of the latent-space distribution, instead of obtaining gradient from density (ratio) estimations; the latter could lead to arbitrary variations in the gradient estimate.

When the latent variables are defined on an embedded manifold (e.g. hyper-spheres), we can no longer use the Euclidean score estimators to approximate the learning objective, as the entropy of the latent-space distribution w.r.t. the Lebesgue measure is usually undefined. However, we can still approximate the objective by doing score estimation inside the manifold: let be the density w.r.t. the Hausdorff measure, and be the corresponding relative entropy functional. Then (19) will still hold; see Appendix D. We can estimate the score function in (19) by with an unnormalized model on manifold, learned with the objective (17).

Wrapping up, we obtain an efficient algorithm to train auto-encoders with a manifold-valued prior.

## 5 Related Work

Our work concerns scalable learning algorithms for unnormalized models. This is a longstanding problem in literature, and some of the previous work is discussed in Section 1. Other notable work includes noise contrastive estimation

(Gutmann and Hyvärinen, 2010) and Parzen score matching (Raphan and Simoncelli, 2011). However, to our knowledge, they have not been applied to complex unnormalized models parameterized by DNNs.

Apart from the MVL formulation used in this work, there exists other work on the connection between learning objectives of unnormalized model and infinitesimal actions of sampling dynamics (or other processes):

• [leftmargin=*]

• The minimum probability flow framework (Sohl-Dickstein et al., 2011) studies the slightly different objective , where is the trajectory of the sampler. It recovers score matching as a special instance, and leads to a tractable learning objective for discrete models.

• Many of the objective functions we have considered are also instances of the Stein discrepancy. This interpretation is helpful in establishing theoretical properties (Gorham et al., 2019) and deriving new objectives (Barp et al., 2019).

• Lyu (2009) observes a different connection between score matching and (derivative of) KL divergence; specifically they showed , where are obtained by doing Brownian motion starting from or .

As those formulations have different motivations compared with ours, they do not lead to scalable learning objectives for continuous models.

## 6 Evaluation

### 6.1 Synthetic Experiments

To demonstrate the proposed estimators have small bias and variance, we first evaluate them on low-dimensional synthetic data. We will also verify that our control variate in Section 3.3 improves the performance of CD-1 and DSM.

#### 6.1.1 Approximations to Score Matching

In this section, we evaluate our MVL approximation to the Euclidean score matching objective (7), as well as the variance-reduced DSM objective. An experiment evaluating the variance-reduced CD-1 objective is presented in Appendix E.1.2.

We evaluate the bias and variance of our estimators by comparing them to sliced score matching (SSM), an unbiased estimator for (

7). We choose the data distribution as the 2-D banana dataset from Wenliang et al. (2018), and the model distribution as an EBM trained on that dataset. We estimate the squared bias with a stochastic upper bound using samples; see Appendix E.1.1 for details.

The results are shown in Figure 1. We can see that for both estimators, the bias is negligible at . We further use a z-test to compare the mean of the two estimators (for ) with the mean of SSM. The p value is for our estimator and for DSM, indicating there is no significant difference in either case. The variance of the estimators, with and without our control variate, are shown in Fig.1 right. As expected, the variance grows unbounded in absence of the control variate, and is approximately constant when it is added. From the scale of the variance, we can see that it is exactly this variance problem that causes the failure of the original DSM estimator.

#### 6.1.2 Density Estimation on Manifolds

We now evaluate our approximation to the Riemannian score matching objective, by learning neural energy-based models on and . The target distributions are mixtures of von-Mises-Fisher distributions. In Figure 2, we plot the log densities of the ground truth distribution as well as the learned model on . We can see the two functions matches closely, suggesting our method is suitable for density estimation on manifolds. Results on are similar and will be presented in E.1.3; detailed setups are deferred to Appendix E.1.1.

### 6.2 Implicit AEs with Manifold Prior

We now apply our method to train implicit auto-encoding models with manifold-valued prior. Experiment setups mainly follow Song et al. (2019); see Appendix E.2.

Note that there is an important difference from Song et al. (2019) in our implementation: for (conditional) score estimation, we parameterize an scalar energy function and use as the score estimate, while Song et al. (2019) directly parameterize a vector-value network . Since directly using a feed-forward network (FFN) for does not work well in practice, we parameterize the energy function as , where is parameterized in the same way as Song et al. (2019). This can be seen as correcting an initial score approximation to make it conservative. In addition to being conceptually desirable (as score functions are conservative fields), this approach leads to significant improvements in the WAE experiments.

#### 6.2.1 Implicit VAEs

We apply our method to train hyperspherical VAEs (Davidson et al., 2018) with implicit encoders on the MNIST dataset. Our encoder and decoder architecture follows Song et al. (2019), with the exception that we normalize so it lies on .

We consider . Baseline methods include hyperspherical VAE with explicit encoders and Euclidean VAEs. We report the test log likelihood estimated with annealed importance sampling (Wu et al., 2016; Neal, 2001)

, as well as its standard deviation across 10 runs.

The results are summarized in Table 1. We can see that the implicit hyperspherical VAE trained with our method outperforms all other baselines. Interestingly, the explicit hyperspherical VAE could not match the performance of Euclidean VAE in higher dimensions. This is also observed in Davidson et al. (2018), who (incorrectly) conjectured that the hyperspherical prior is unsuitable in higher dimensions. From our results, we can see that the problem actually lies in the flexibility of variational posteriors. Our method thus unleashes the potential of VAEs with manifold-valued priors, and might lead to improvements in downstream tasks.

#### 6.2.2 Hyperspherical WAEs

We first evaluate our method on MNIST. We use the uniform distribution as , and choose cross entropy as the reconstruction error. We choose . We use the encoder and decoder architecture in Song et al. (2019); the architecture of the energy network is also similar to their work. We report the Frechet Inception Distance (FID; Heusel et al., 2017).

As the choice of divergence measure in the WAE objective is arbitrary, there are several methods to train WAEs with manifold latent space: using the Jensen-Shannon divergence approximated with a GAN-like discriminator (WAE-GAN), and using the maximum mean discrepancy (MMD) divergence. We choose WAE-GAN as the baseline method, as it outperforms WAE-MMD in Tolstikhin et al. (2017). To demonstrate the utility of hyperspherical priors, we also compare with models using normal priors.

The FID scores are reported in Table 2. We can see that hyperspherical prior leads to better sample quality compared with Euclidean prior, and our method improves the training of WAEs.

To demonstrate our method scales to higher dimensions, we also train hyperspherical WAEs on CIFAR-10 and CelebA, with larger . We find that our method is comparable or better than WAE-GAN and WAE-MMD; see Appendix E.2.1.

## 7 Conclusion

We present a scalable approximation to a general family of learning objectives for unnormalized models, based on a new connection between these objectives and gradient flows. Our method can be applied to manifold density estimation and training implicit auto-encoders with manifold priors.

## Acknowledgement

J.Z is the corresponding author. We thank Chang Liu and Jiaxin Shi for comments. This work was supported by the National Key Research and Development Program of China (No. 2017YFA0700904), NSFC Project (Nos. 61620106010, U1811461), Beijing NSF Project (No. L172037), Beijing Academy of Artificial Intelligence (BAAI), a grant from Tsinghua Institute for Guo Qiang, and the NVIDIA NVAIL Program with GPU/DGX Acceleration.

## References

Supplementary Material

## Appendix A Derivation of (13)-(15)

Denote .

 ^Ldsm =∥x+√ϵz−ϵ∇E(x+√ϵz)−x∥2 (20) =ϵ∥z∥2+ϵ2∥∇E(~x)∥2−2ϵ3/2⟨z,∇E(~x)⟩ (21) =ϵ∥z∥2+ϵ2∥∇E(~x)∥2−2ϵ3/2⟨z,∇E(x)+(∇2E(x))(√ϵz)+O(ϵ)⟩ (22) =ϵ2(∥∇E(~x)∥2−2z⊤(∇2E(x))z)A+ϵ∥z∥2−2ϵ3/2z⊤∇E(x)B+o(ϵ2), (23)

Notice

 Ez(z⊤∇2E(x)z)=ΔE(x)

which is known as the Hutchinson’s trick (Hutchinson, 1990), so is two times the Fisher divergence . But , so as , the rescaled estimator becomes unbiased with infinite variance; and subtracting (B) from (A) results in a finite-variance estimator.

## Appendix B On SPOS and MVL

##### Notations

In this section, let the parameter space be -dimensional, and define as the space of -dimensional functions .

While in the main text, we identified the tangent space of as a subspace of for clarity, here we use the equivalent definition following (Otto, 2001). The two definition are connected by the transform for . Using the new definition, the differential of the KL divergence functional is then

### b.1 SPOS as Gradient Flow

In this section, we give a formal derivation of SPOS as the gradient flow of the KL divergence functional, with respect to a new metric.

Recall the SPOS sampler targeting distribution (with density) corresponds to the following density evolution:

 ∂tρt=−∇⋅(ρt(x)(ϕ∗ρt,ϕ(x)+α∇log(ϕ/ρ))νt(x))

where

is a hyperparameter, and

 ϕ∗ρt,ϕ(x):=Eρt(x′)(Sϕ⊗k)(x′,x):=Eρt(x′)[(∇x′logϕ(x′))k(x′,x)+∇x′k(x′,x)]

is the SVGD update direction (Liu and Wang, 2016; Liu, 2017). Fix , define the integral operator

 Kρ[f](x):=Eρ(x′)k(x′,x)f(x),

and define the tensor product operator

accordingly. Then the SVGD update direction satisfies

 ϕ∗ρ,ϕ=K⊗dρ[∇log(ϕ/ρ)], (24)

which we will derive at the end of this subsection for completeness. Following (24) we have

 νt(x)=(αId+K⊗dρ)[∇log(ϕ/ρ)]. (25)

The rest of our derivation follows (Otto, 2001; Liu, 2017): consider the function space where is any square integrable and differentiable function. It connects to the tangent space of if we consider for any . Define on the inner product

 ⟨f,g⟩Hρ,α:=⟨f,(αId+K⊗dρ)−1[g]⟩L2(ρX→Rd). (26)

It then determines a Riemannian metric on the function space. For and , by (25) we have

 ⟨νt,~p⟩Hρ,α=Eρt(x)⟨∇log(ϕ/ρt)(x),~p(x)⟩=−∫logϕρt(∇⋅(~pρ))dx=−(dKLϕ)(s),

i.e. with respect to the metric (26), SPOS is the gradient flow minimizing the KL divergence functional.

##### Derivation of (24)

let be its eigendecomposition (i.e. the Mercer representation). For let where is the coordinate basis in , so becomes an orthonormal basis in . Now we calculate the coordinate of in this basis.

 ⟨ϕ∗ρ,ϕ,ψi,j⟩L2(ρ) =Eρ(x)Eρ(x′)⟨(∇x′logϕ(x′))k(x′,x)+∇x′k(x′,x),ψi,j(x)⟩ =Eρ(x′)[⟨∇x′logϕ(x′),(Kρ[ψi,j])(x′)⟩+∇⋅((Kρ[ψi,j])(x′))] =:Eρ(x′)[Sϕ(Kρ[ψi,j])(x′)]. (27)

is known to satisfy the Stein’s identity

 EρSρ(g)=0

for all . Thus, we can subtract from the right hand side of (27) without changing its value, and it becomes

 Eρ(x′)[Sϕ(Kρ[ψi,j])(x′)]−Eρ(x′)[Sρ(Kρ[ψi,j])(x′)] = Eρ(x′)[⟨∇x′logϕ(x′)ρ(x′),(Kρ[ψi,j])(x′)⟩] = λkEρ(x′)[⟨∇x′logϕ(x′)ρ(x′),ψi,j(x′)⟩].

As the equality holds for all , we completed the derivation of (24).

### b.2 MVL Objective Derived from SPOS

By (25) and (26), the MVL objective derived from SPOS is

In the right hand side above, the first term in the summation is the Fisher divergence, and the second is the kernelized Stein discrepancy (Liu et al., 2016b, Definition 3.2).

We note that a similar result for SVGD has been derived in (Liu and Wang, 2017), and our derivations connect to the observation that Langevin dynamics can be viewed as SVGD with a Dirac function kernel (thus SPOS also corresponds to SVGD with generalized-function-valued kernels).

## Appendix C Justification of the Use of Local Coordinates in (17)

In this section, we prove in Proposition C.1 that the local coordinate representation lead to valid approximation to the MVL objective in the compact case. We also argue in Remark C.2 that the use of local coordinate does not lead to numerical instability.

###### Remark C.1.

While a result more general than Proposition C.1 is likely attainable (e.g. by replacing compactness of with quadratic growth of the energy), this is out of the scope of our work; for our purpose, it is sufficient to note that the proposition covers manifolds like , and the local coordinate issue will not exist in manifolds possessing a global chart, such as .

###### Lemma C.1.

(Theorem 3.6.1 in (Hsu, 2002)) For any manifold , , and a normal neighborhood of , there exists constant such that the first exit time from , of the Riemannian Brownian motion starting from , satisfies

 P(τ≤CL)≤e−L/2

for any .

###### Proposition C.1.

Assume the data manifold is compact, and for all , is in . Let be defined as in (17), following the true Riemannian Langevin dynamics targeting . Then

i.e. (17) recovers true WMVL objective.

###### Proof.

By the tower property of conditional expectation, it suffices to prove the result when for some . Choose a normal neighborhood centered at such that is contained by our current chart, and has distance from the boundary of the chart bounded by some . Let be defined as in Lemma C.1. Recall the Riemannian LD is the sum of a drift and the Riemannian BM. Since is compact and is in , the drift term in the SDE will have norm bounded by some finite . Thus the first exit time of the Riemannian LD is greater than .

Let follow the true Riemannian LD, when , and be such that afterwards.777 This is conceptually similar to the standard augmentation used in stochastic process texts; from a algorithmic perspective it can be implemented by modifying the algorithm so that in the very unlikely event when escapes the chart, we return as the corresponding energy. We note that this is unnecessary for manifolds like , since the charts can be extended to and hence . By Hsu (2008), until , follows the local coordinate representation of Riemannian LD (3), thus on the event , would correspond to in (18). As is compact, the continuous energy function is bounded by for some finite . Then for sufficiently small ,

 12E(~Lmvl_rld)=E(E(¯Xϵ)−E(X0))ϵ =E(E(Xϵ)−E(X0))ϵ+E(E(¯Xϵ)−E(Xϵ))ϵ =E(E(Xϵ)−E(X0))ϵ+E(−E(Xϵ)1{τ≤ϵ})ϵ.

In the above the first term converges to as , and when . Hence the proof is complete. ∎

###### Remark C.2.

It is argued that simulating diffusion-based MCMC in local coordinates leads to numeric instabilities (Byrne and Girolami, 2013; Liu et al., 2016a). We stress that in our setting of approximating MVL objectives, this is not the case. The reason is that we only need to do a single step of MCMC, with arbitrarily small step-size. Therefore, we could use different step-size for each sample, based on the magnitude of and in their locations. We can also choose different local charts for each sample, which is justified by the proposition above.

## Appendix D Derivation of (19) in the Manifold Case

In this section we derive (19), when the latent-space distribution is defined on a -dimensional manifold embedded in some Euclidean space, and is the relative entropy w.r.t. the Hausdorff measure. The derivation is largely similar to the Euclidean case, and we only include it here for completeness.

(19) holds because

 ∇ϕH[qϕ(z)] (i)=−∇ϕEp(ϵ)[logqϕ(f(ϵ,ϕ))] =−Ep(ϵ)[∇ϕlogqϕ(f(ϵ,ϕ))] =−Ep(ϵ)[∇ϕlogqϕ(z)∣∣z=f(ϵ,ϕ)+∇flogq(f(ϵ,ϕ))∇ϕf(ϵ,ϕ)] (ii)=−Ep(ϵ)[∇zlogqϕ(z)∇ϕf(ϵ,ϕ)],

where (i) follows from Theorem 2.10.10 in Federer (2014), and (ii) follows from the same theorem as well as the fact that .

## Appendix E Experiment Details and Additional Results

Code will be available at https://github.com/thu-ml/wmvl.

### e.1 Synthetic Experiments

#### e.1.1 Experiment Details

##### Experiment Details in Section 6.1.1

The (squared) bias is estimated as follows: denote the SSM estimator and ours as and , respectively. One could verify that both methods estimate (7). Our estimate for the squared bias is now where