Sliced Score Matching: A Scalable Approach to Density and Score Estimation

05/17/2019 ∙ by Yang Song, et al. ∙ Stanford University 6

Score matching is a popular method for estimating unnormalized statistical models. However, it has been so far limited to simple models or low-dimensional data, due to the difficulty of computing the trace of Hessians for log-density functions. We show this difficulty can be mitigated by sliced score matching, a new objective that matches random projections of the original scores. Our objective only involves Hessian-vector products, which can be easily implemented using reverse-mode auto-differentiation. This enables scalable score matching for complex models and higher dimensional data. Theoretically, we prove the consistency and asymptotic normality of sliced score matching. Moreover, we demonstrate that sliced score matching can be used to learn deep score estimators for implicit distributions. In our experiments, we show that sliced score matching greatly outperforms competitors on learning deep energy-based models, and can produce accurate score estimates for applications such as variational inference with implicit distributions and training Wasserstein Auto-Encoders.



There are no comments yet.


page 7

page 12

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Score matching (Hyvärinen, 2005) is a statistical estimation method particularly suitable for learning unnormalized models. Unlike maximum likelihood estimation (MLE), the objective of score matching only depends on the derivative of the log-density function (a.k.a., score), which is independent of the normalizing term. However, score matching has been so far limited to small density models and low-dimensional data, because its objective involves the trace of the Hessian for the log-density function, which is expensive to compute for general densities (Martens et al., 2012)

, and especially those parameterized by neural networks, such as deep energy-based models 

(LeCun et al., 2006; Wenliang et al., 2018).

There are several attempts to alleviate this difficulty: Kingma & LeCun (2010)

proposes approximate backpropagation for computing the trace of the Hessian;

Martens et al. (2012) develops curvature propagation, a fast stochastic estimator for the trace in score matching; and Vincent (2011)

transforms score matching to a denoising problem which avoids second-order derivatives. These methods have achieved some success, but may suffer from one or many of the following problems: inconsistent parameter estimation, large estimation variance, and cumbersome implementation.

To avoid these problems, we propose sliced score matching

, a variant of score matching that can scale up to large unnormalized models and high dimensional data. The key intuition is that instead of directly matching the high-dimensional scores, we match their projections along random directions. Our theoretical results show that under some regularity conditions, sliced score matching is a well-defined statistical estimation criterion that yields consistent and asymptotically normal parameter estimates. Moreover, compared to the methods of

Kingma & LeCun (2010) and Martens et al. (2012)

, whose implementations require customized backpropagation for deep networks, the objective of sliced score matching only involves Hessian-vector products, thus can be easily implemented in automatic differentiation frameworks such as TensorFlow 

(Abadi et al., 2016)

and PyTorch 

(Adam et al., 2017).

Since score matching is based on the distance between model and data scores, it can be naturally used as an objective for estimating data score functions (Sasaki et al., 2014; Strathmann et al., 2015). We show that given a set of samples, sliced score matching can be used to train a deep neural network as a score estimator. This observation opens up a new direction of applications for sliced score matching. For example, we show that sliced score estimation can be used to provide accurate score estimates for many tasks, such as variational inference with implicit distributions (Huszár, 2017) and learning Wasserstein Auto-Encoders (WAE) (Tolstikhin et al., 2018).

Finally, we evaluate the performance of sliced score matching on both density and score estimation tasks. For density estimation, experiments on deep kernel exponential families (Wenliang et al., 2018) and NICE flow models (Dinh et al., 2015) show that our method is either more scalable or more accurate than existing score matching variants. When applied to score estimation, our method improves the performance of variational auto-encoders (VAE) with implicit encoders, and can train WAEs without a discriminator or MMD loss by directly optimizing the KL divergence between aggregated posteriors and the prior. In both situations we outperformed kernel-based score estimators (Li & Turner, 2018; Shi et al., 2018) by achieving better test likelihoods and better qualities of random image generation.

2 Background

Given i.i.d. samples from a data distribution , our task is to learn an unnormalized density, , where is from some parameter space . The model’s normalizing constant (a.k.a., partition function) is denoted as , which is assumed to be intractable. A deep energy-based model is an unnormalized model that uses a deep feedforward neural network to parameterize . Let be the normalized density determined by our model, we have

For convenience, we denote the (Stein) score function as , which is the gradient of the log-density function w.r.t. the data point. Note that since , we immediately conclude that the score function of our model does not depend on the intractable partition function .

2.1 Score Matching

Learning unnormalized models with maximum likelihood estimation (MLE) is difficult because of the intractability of the partition function . To avoid the partition function, score matching (Hyvärinen, 2005) minimizes the Fisher divergence between and , which is defined as


Since is independent of , the Fisher divergence avoids the intractable partition function. However, Eq. (1) is still not readily usable for learning unnormalized models, as we usually do not have access to the score function of the data .

By applying integration by parts, Hyvärinen (2005) shows that can be written as (cf., Theorem 1 in Hyvärinen (2005)):


where is a constant that does not depend on , denotes the trace of a matrix, and


is the Hessian of the log-density function. The constant can be ignored and the following unbiased estimator of the remaining terms is used to train our unnormalized model:

where is a shorthand used throughout the paper for a collection of data points sampled i.i.d. from .

Computational difficulty

While the score matching objective avoids the computation of for unnormalized models, it introduces a new computational difficulty: computing the trace of the Hessian of a log-density function, . Suppose that the input data has dimension . A naïve approach of computing the trace of the Hessian requires times more backward passes than computing the gradient (see Alg. 2 in the appendix). For example, the trace could be computed by applying backpropogation times to to get each diagonal term of sequentially. In practice, can be many thousands, which can render score matching too slow for practical purposes. Moreover, Martens et al. (2012) argues from a theoretical perspective that it is unlikely that there exists a computationally efficient algorithm (at the level of a constant number of forward and backward passes) for computing the diagonal of the Hessian defined by an arbitrary computation graph.

2.2 Score Estimation for Implicit Distributions

Besides parameter estimation in unnormalized models, score matching can also be used to estimate scores of implicit distributions, which are distributions that have a tractable sampling process but without a tractable density. Implicit distributions can arise in many situations such as the marginal distribution of a non-conjugate model (Sun et al., 2019)

, non-invertible transformation of random variables 

(Goodfellow et al., 2014), and models defined by complex simulation processes (Tran et al., 2017). In these cases learning and inference often become intractable due to the need of optimizing an objective that involves the intractable density.

In these cases, directly estimating the score function based on i.i.d. samples from an implicit density can be useful. For example, suppose our learning problem involves optimizing the entropy of an implicit distribution. This situation is common when dealing with variational free energies (Kingma & Welling, 2014). Suppose can be reparameterized as , where is a simple random variable independent of , such as a standard normal, and is a deterministic mapping. We can write the gradient of the entropy with respect to as

where is usually easy to compute. The score is intractable but can be approximated by score estimation.

Score matching is an attractive solution for score estimation since (2) naturally serves as an objective if we replace with . There are also other kernel-based score estimators (Li & Turner, 2018; Shi et al., 2018) which we discuss in detail in section 5.2. Score estimation like this can be applied to training VAEs and WAEs, as we shall see in the experiments.

3 Method

In this section, we introduce sliced score matching (SSM), our scalable method designed for density and score estimation with deep energy-based models.

3.1 Sliced Score Matching

Our intuition is that one dimensional problems are usually much easier to solve than high dimensional ones. Inspired by the idea of Sliced Wasserstein distance (Rabin et al., 2012), we consider projecting and onto some random direction and propose to compare their average difference along that random direction. More specifically, we consider the following objective as a replacement of the Fisher divergence in Eq. (1):


Here and are independent, and we require and . Many examples of satisfy these requirements. For instance, can be a multivariate standard normal (

), a multivariate Rademacher distribution (the uniform distribution over

), or a uniform distribution on the hypersphere (recall that refers to the dimension of ).

To eliminate the dependence of on , we use integration by parts as in score matching (cf., the equivalence between Eq. (1) and (2)). Defining


the equivalence is summarized in the following theorem.

Theorem 1.

Under some regularity conditions (Assumption 1-3), we have


where is a constant w.r.t. .

Other than our requirements on , the assumptions are exactly the same as in Theorem 1 of Hyvärinen (2005). We advise the interested readers to read Appendix B.2 for technical statements of the assumptions and a rigorous proof of the theorem.

In practice, given a dataset , we draw projection vectors independently for each point . The collection of all such vectors are denoted as . We then use the following unbiased estimator of


Note that when is a multivariate standard normal or multivariate Rademacher distribution, we have , in which case the second term of can be integrated analytically, and may lead to an estimator with reduced variance:


Empirically, we do find that has better performance than . We refer to this version as sliced score matching with variance reduction (SSM-VR). In fact, we can leverage to create a control variate for guaranteed reduction of variance (Appendix D). is also closely related to Hutchinson’s trace estimator (Hutchinson, 1990), which we will analyze later in Section 4.3.

For sliced score matching, the second derivative term is much less computationally expensive than . Using auto-differentiation systems that support higher order gradients, we can compute it using two gradient operations for a single , as shown in Alg. 1. Similarly, when there are ’s, the total number of gradient operations is . In contrast, assuming the dimension of is and , we typically need gradient operations to compute because each diagonal entry of needs to be computed separately (see Alg. 2 in the appendix).

4: (or )
Algorithm 1 Sliced Score Matching

In practice, we can tune to find the sweet spot between variance and computation. In our experiments, we find that typically is often a good choice.

3.2 Sliced Score Estimation

As mentioned in section 2.2, the task of score estimation is to predict at some test point , given a set of samples . Our sliced score matching objective is particularly suitable for this task since it depends only on the scores.

In practice, we propose to use a vector-valued deep neural network as our score model. Then, substituting into for , we get

and we can optimize the objective to get . Afterwards, can be used as an approximation to .111Note that may not correspond to the gradient of any scalar function. For to represent a gradient, one necessary condition is for all , which may not be satisfied by general networks. However, because Theorem 1 does not depend on the fact that is a gradient, we can still use the estimator in practice.

By the same argument of integration by parts (cf., Eq. (4), (5) and (6)), we have

which implies that minimizing with replaced by is approximately minimizing the average projected difference between and . Hence, should be close to and can serve as a score estimator.

4 Analysis

In this section, we present several theoretical results to justify sliced score matching as a principled objective. We also discuss the connection of sliced score matching to other methods. For readability, we will defer rigorous statements of assumptions and theorems to the Appendix.

4.1 Consistency

One important question to ask for any statistical estimation objective is whether the estimated parameter is consistent under reasonable assumptions. Our results confirm that for any , the objective is consistent under suitable assumptions as .

We need several standard assumptions to prove the results rigorously. Let be the normalized density induced by our unnormalized model . First, we assume is compact (Assumption 6), and our model family is well-specified and identifiable (Assumption 4). These are standard assumptions used for proving the consistency of MLE (van der Vaart, 1998). We also adopt the assumption in Hyvärinen (2005) that all densities are positive (Assumption 5). Finally, we assume that has some Lipschitz properties (Assumption 7, which always holds if and have uniformly bounded derivatives w.r.t. ), and

has bounded higher-order moments (Assumption 

2, true for all considered in the experiments). Then, we can prove the consistency of :

Theorem 2 (Consistency).

Assume the conditions of Theorem 1 are satisfied. Assume further that the assumptions discussed above hold. Let be the true parameter of the data distribution. Then for every ,

Sketch of proof.

We first prove that (Lemma 1 and Theorem 1). Then we prove the uniform convergence of (Lemma 3), which holds regardless of . These two facts lead to consistency. For a complete proof, see Appendix B.3. ∎

Remark 1.

In Hyvärinen (2005), the authors only showed that , which leads to “local consistency” of score matching. This is a weaker notion of consistency compared to our settings.

4.2 Asymptotic Normality

Asymptotic normality results can be very useful for approximate hypothesis testing and comparing different estimators. Below we show that is asymptotically normal when .

In addition to the assumptions in Section 4.1, we need an extra assumption to prove asymptotic normality. We require to have a stronger Lipschitz property (Assumption 9). The assumption holds if and , as functions of , have uniformly bounded third derivatives.

For simplicity, we denote as , where is an arbitrary function. In the following, we will only show the asymptotic normality result for a specific . More results can be found in Appendix B.4.

Theorem 3 (Asymptotic normality, special case).

Assume the assumptions discussed above hold. If is the multivariate Rademacher distribution, we have



Here is the dimension of data; and are p.s.d matrices depending on , and their definitions can be found in Appendix B.4.

Sketch of proof.

Once we get the consistency (Theorem 2), the rest closely follows the proof of asymptotic normality of MLE (van der Vaart, 1998). A rigorous proof can be found in Appendix B.4. ∎

Remark 2.

As expected, larger will lead to smaller asymptotic variance.

Remark 3.

As far as we know, there is no published proof of asymptotic normality for score matching. However, by using the same techniques in our proofs, and under the same assumptions, we can conclude that the asymptotic variance of the score matching estimator is (Corollary 1), which will always be smaller than sliced score matching with multivariate Rademacher projections. However, the gap reduces when increases.

4.3 Connection to Other Methods

Sliced score matching is widely connected to many other methods, and can be motivated from some different perspectives. Here we discuss a few of them.

Connection to NCE

Noise contrastive estimation (NCE), proposed by Gutmann & Hyvärinen (2010), is another principle for training unnormalized statistical models. The method works by comparing with a noise distribution . We consider a special form of NCE which minimizes the following objective


where , and . Assuming to be small, Eq. (10) can be written as the following by Taylor expansion


For derivation, see Proposition 1 in the appendix. A similar derivation can also be found in Gutmann & Hirayama (2011). As a result of (11), if we choose some and take the expectation of (10) w.r.t. , the objective will be equivalent to sliced score matching whenever .

Connection to Hutchinson’s Trick

Hutchinson’s trick (Hutchinson, 1990) is a stochastic algorithm to approximate for any square matrix . The idea is to choose a distribution of random vector such that , and then we have . Hence, using Hutchinson’s trick, we can replace with in the score matching objective . Then the finite sample objective of score matching becomes

which is exactly the sliced score matching objective with variance reduction .

5 Related Work

5.1 Scalable Score Matching

To the best of our knowledge, there are three existing methods that are able to scale up score matching to learning deep models on high dimensional data.

Denoising Score Matching

Vincent (2011) proposes denoising score matching, a variant of score matching that completely circumvents the Hessian. Specifically, consider a noise distribution , and let . Denoising score matching applies the original score matching to the noise-corrupted data distribution , and the objective can be proven to be equivalent to the following

which can be estimated without computing the Hessian. Although denoising score matching is much faster than score matching, it has obvious drawbacks. First, it can only recover the noise corrupted data distribution. Furthermore, choosing the parameters of the noise distribution is highly non-trivial and in practice the performance can be very sensitive to

, and people often have to use heuristics. For example,

Saremi et al. (2018) proposes a heuristic of choosing based on Parzen windows.

Approximate Backpropagation

Kingma & LeCun (2010) proposes a backpropagation method to approximately compute the trace of the Hessian. Because the backpropagation of the full Hessian scales quadratically w.r.t. the layer size, the authors limit backpropagation only to the diagonal so that it has the same cost as the usual backpropagation for computing loss gradients. However, there are no theoretical guarantees for approximation errors. In fact, the authors only did experiments on networks with a single hidden layer, in which case the approximation is exact. Moreover, there is no direct support for the proposed approximate backpropagation method in modern automatic differentiation frameworks.

Curvature Propagation

Martens et al. (2012)

estimates the trace term in score matching by applying curvature propagation to compute an unbiased, complex-valued estimator of the diagonal of the Hessian. The work claims that curvature propagation will have the smallest variance among a class of estimators, which includes the Hutchinson estimator. However, their proof incorrectly evaluates the pseudo-variance of the complex-valued estimator instead of the variance. In practice, curvature propagation can have large variance when the number of nodes in the network is large, because it introduces noise for each node in the network. Finally, implementing curvature propagation also requires manually modifying the backpropagation code, handling complex numbers in neural networks, and will be inefficient for networks of more general structures, such as recurrent neural networks.

5.2 Kernel Score Estimators

Two prior methods for score estimation are based on a generalized form of Stein’s identity (Stein, 1981; Gorham & Mackey, 2017):


where is a continuous differentiable density and is a function satisfying some regularity conditions. Li & Turner (2018) propose to set as the feature map of some kernel in the Stein class (Liu et al., 2016) of , and solve a finite-sample version of (12) to obtain estimates of at the sample points. We refer to this method as Stein in the experiments. Shi et al. (2018) adopt a different approach but also exploit (12). They build their estimator by expanding

as a spectral series of eigenfunctions and solve for the coefficients using (

12). Compared to Stein, their method is argued to have theoretical guarantees and principled out-of-sample estimation at an unseen test point. We refer to their method as Spectral in the experiments.

6 Experiments

Our experiments include two parts: (1) to test the effectiveness of SSM in learning deep models for density estimation, and (2) to test the ability of SSM in providing score estimates for applications such as VAEs with implicit encoders and WAEs. Unless specified explicitly, we choose by default.

6.1 Density Estimation

Figure 1:

Score matching loss after training DKEF models on UCI datasets with different loss functions; lower is better. Results for approximate backprapogation are not shown because losses were larger than

Figure 2: Score matching performance degrades linearly with the data dimension, while efficient approaches have relatively similar performance.

We evaluate SSM and its variance-reduced version (SSM-VR) for density estimation and compare against score matching (SM) and its three scalable baselines: denoising score matching (DSM), approximate backpropagation (approx BP), and curvature propagation (CP). All SSM methods use the multivariate Rademacher distribution as . Our results demonstrate that: (1) SSM is comparable in performance to score matching, (2) SSM outperforms other computationally efficient approximations to score matching, and (3) unlike score matching, SSM scales effectively to high dimensional data.

6.1.1 Deep Kernel Exponential Families


Deep kernel exponential families (DKEF) are unnormalized density models trained using score matching (Wenliang et al., 2018). DKEFs parameterize the unnormalized log density as , where is a mixture of Gaussian kernels evaluated at different inducing points: . The kernel is defined on the feature space of a neural network, and the network parameters are trained along with the inducing points . Further details of the model, which is identical to that in Wenliang et al. (2018), are presented in Appendix C.1.


Following the settings of Wenliang et al. (2018), we trained models on three UCI datasets: Parkinsons, RedWine, and WhiteWine (Dua & Graff, 2017), and used their original code for score matching. To compute the trace term exactly, Wenliang et al. (2018)’s manual implementation of backpropagation takes over one thousand lines for a model that is four layers deep. For denoising score matching, the value of is chosen by grid search using the score matching loss on a validation dataset. All models are trained with 15 different random seeds and training is stopped when validation loss does not improve for 200 steps.


Results in Fig. 1 demonstrate that SSM-VR performs comparably to score matching, when evaluated using the score matching loss on a held-out test set. We observe that variance reduction substantially improves the performance of SSM. In addition, SSM outperforms other computationally efficient approaches. DSM can perform comparably to SSM on RedWine. However, it is challenging to select for DSM. Models trained using from the heuristic in Saremi et al. (2018) are far from optimal (on both score matching losses and likelihoods), and extensive grid search is needed to find the best . CP performs substantially worse, which is likely because it injects noise for each node in the computation graph, and the amount of noise introduced is too large for a neural-network-based kernel evaluated at 200 inducing points, which supports our hypothesis that CP does not work effectively for deep models. Results for approximate backpropagation are omitted because the losses exceeded on all datasets. This is because approx BP provides a biased estimate of the Hessian without any error guarantees. All the results are similar when evaluating with log-likelihoods (Appendix C.1).

Scalability to high dimensional data

We evaluate the computational efficiency of different losses on data of increasing dimensionality. We fit DKEF models to a multivariate standard normal in high dimensional spaces. The average running time per minibatch of 100 examples is reported in Fig. 2. Score matching performance degrades linearly with the dimension of the input data due to the computation of the Hessian, and creates out of memory errors for a 12GB GPU after the dimension increases beyond 400. SSM performs similarly to DSM, approx BP and CP, and they are all scalable to large dimensions.

Test SM Loss Test LL
MLE -579 -791
SSM-VR -8054 -3355
SSM -2428 -2039
DSM () -3035 -4363
DSM () -97 -8082
CP -1694 -1517
Approx BP -48 -2288
Table 1: Score matching losses and log-likelihoods for NICE models on MNIST. is by grid search and is from the heuristic of Saremi et al. (2018).

6.1.2 Deep Flow Models


As a sanity check, we also evaluate SSM by training a NICE flow model (Dinh et al., 2015), whose likelihood is tractable and can be compared to ground truths obtained by MLE. The model we use has 20 hidden layers, and 1000 units per layer. Models are trained to fit MNIST handwritten digits, which are 784 dimensional images. Data are dequantized by adding uniform noise in the range

, and transformed using a logit transformation,

. We provide additional details in Appendix C.2.

Training with score matching is extremely computationally expensive in this case. Our SM implementation based on auto-differentiation takes around 7 hours to finish one epoch of training, and 12 GB GPU memory cannot hold a batch size larger than 24, so we do not include these results. Since NICE has tractable likelihoods, we also evaluate MLE as a surrogate objective for minimizing the score matching loss. Notably, likelihoods and score matching losses might be uncorrelated when the model is mis-specified, which is likely to be the case for a complex dataset like MNIST.


Score matching losses and log-likelihoods on the test dataset are reported in Tab. 1, where models are evaluated using the best checkpoint in terms of the score matching loss on a validation dataset over 100 epochs of training. SSM-VR greatly outperforms all the other methods on the score matching loss. DSM performs worse than SSM-VR, and is still hard to tune. Specifically, following the heuristic in Saremi et al. (2018) leads to , which performed the worst (on both log-likelihood and SM loss) of the eight choices of

in our grid search. Approx BP has more success on NICE than for training DKEF models. It is probably because the Hessians of hidden layers of NICE are closer to a diagonal matrix, which results in a smaller approximation error for approx BP. As in the DKEF experiments, CP performs worse. This is likely due to injecting noise to all hidden units, which will lead to large variance for a network as big as NICE. Unlike the DKEF experiments, we find that good log-likelihoods are less correlated with good score matching loss, probably due to model mis-specification.

6.2 Score Estimation

We consider two typical tasks that require accurate score estimations: 1) training VAEs with an implicit encoder and 2) training Wasserstein Auto-Encoders. We show in both tasks SSM outperforms previous score estimators. Samples generated by various algorithms can be found in Appendix A.

max width=0.8   VAE WAE   Latent Dim 8 32 8 32   ELBO 96.87 89.06 N/A N/A SSM 95.50 89.25 (88.29) 98.24 90.37 Stein 96.71 91.84 99.05 91.70 Spectral 96.60 94.67 98.81 92.55  

Table 2: Negative log-likelihoods on MNIST, estimated by AIS. The result of SSM with , in which case SSM matches the computational cost of kernel methods, which used 100 samples for each data point.

max width=   MethodIteration 10k 40k 70k 100k   VAE ELBO 96.20 73.70 69.42 66.32 SSM 108.52 70.28 66.52 62.50 Stein 126.60 118.87 120.51 126.76 Spectral 131.90 125.04 128.36 133.93   WAE SSM 84.11 61.09 56.23 54.33 Stein 82.93 63.46 58.53 57.61 Spectral 82.30 62.47 58.03 55.96  

Table 3: FID scores of different methods versus number of training iterations on CelebA dataset.

6.2.1 VAE with Implicit Encoders


Consider a latent variable model , where and denote observed and latent variables respectively. A variational auto-encoder (VAE) is composed of two parts: 1) an encoder modeling the conditional distribution of given ; and a decoder that approximates the posterior distribution of the latent variable. A VAE is typically trained by maximizing the following evidence lower bound (ELBO)

where denotes a pre-specified prior distribution. The expressive power of is critical to the success of variational learning. Typically, is chosen to be a simple distribution so that is tractable. We call this traditional approach “ELBO” in the experiments. With score estimation techniques, we can directly compute for implicit distributions, which enables more flexible options for . We consider 3 score estimation techniques: sliced score matching (SSM), Stein (Li & Turner, 2018) and Spectral (Shi et al., 2018).

Given a data point , kernel score estimators need multiple samples from to estimate its score. On MNIST, we use 100 samples, as done in Shi et al. (2018). On CelebA, however, we can only take 20 samples due to GPU memory limitations. In contrast, SSM learns a score network along with , which amortizes the cost of score estimation. Unless noted explicitly, we use one projection per data point () for SSM, which is much more scalable to large networks.


We consider VAE training on MNIST and CelebA (Liu et al., 2015). All images in CelebA are first cropped to a patch of and then resized to . We report negative likelihoods on MNIST, as estimated by AIS (Neal, 2001) with 1000 intermediate distributions. We evaluate sample quality on CelebA with FID scores (Heusel et al., 2017). For fast AIS evaluation on MNIST, we use shallow fully-connected networks with 3 hidden layers. For CelebA experiments we use deep convolutional networks. The architectures of implicit encoders and score networks are straightforward modifications to the encoders of ELBO. More details are provided in Appendix C.3.


The negative likelihoods of different methods on MNIST are reported in the left part of Tab. 2. We note that SSM outperforms Stein and Spectral in all cases. When the latent dimension is 8, SSM outperforms ELBO, which indicates that the expressive power of implicit pays off. When the latent dimension is 32, the gaps between SSM and kernel methods are even larger, and the performance of SSM is still comparable to ELBO. Moreover, when (matching the computation of kernel methods), SSM outperforms ELBO.

For CelebA, we provide FID scores of samples in the top part of Tab. 3. We observe that after 40k training iterations, SSM outperforms all other baselines. Kernel methods perform poorly in this case because only 20 samples per data point can be used due to limited amount of GPU memory. Early during training, SSM does not perform as well. Since the score network is trained along with the encoder and decoder, a moderate number of training steps is needed to give an accurate score estimation (and better learning of the VAE).

6.2.2 WAEs


WAE is another method to learn latent variable models, which generally produces better samples than VAEs. Similar to a VAE, it contains an encoder and a decoder and both can be implicit distributions. Let be a pre-defined prior distribution, and denote the aggregated posterior distribution. Using a metric function and KL divergence between and , WAE minimizes the following objective

Here can again be approximated by score estimators. Note that in this case kernel methods can work efficiently, as the samples from can be collected by sampling one for each in a mini-batch. For SSM, we use a score network and train it alongside .


The setup for WAE experiments is largely the same as VAE. The architectures are very similar to those used in the VAE experiments, and the only difference is that we made decoders implicit, as suggested in Tolstikhin et al. (2018). More details can be found in Appendix C.3.


The negative likelihoods on MNIST are provided in the right part of Tab. 2. SSM outperforms both kernel methods, and achieves a larger performance gap when the latent dimension is higher. The likelihoods are lower than the VAE results as the WAE objective does not directly maximize likelihoods.

We show FID scores for CelebA experiments in the bottom part of Tab. 3. As expected, kernel methods perform much better than before, because it is faster to sample from . The FID scores are generally lower than those in VAE experiments, which supports the forklore that WAEs generally obtain better samples. SSM outperforms both kernel methods when the number of iterations is more than 40k. This shows the advantages of training a deep, expressive score network compared to using a fixed kernel in score estimation tasks.

7 Conclusion

We propose sliced score matching, a scalable method for learning unnormalized statistical models and providing score estimates for implicit distributions. Compared to the original score matching and its variants, our estimator can scale to large-data scenarios, while remaining cheap to implement in modern auto-differentiation frameworks. Theoretically, our estimator is consistent and asymptotically normal under some regularity conditions. Experimental results demonstrate that our method outperforms competitors in learning deep energy-based models and provides more accurate estimates than kernel score estimators in training implicit VAEs and WAEs.


  • Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.

    Tensorflow: A system for large-scale machine learning.

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283, 2016.
  • Adam et al. (2017) Adam, P., Soumith, C., Gregory, C., Edward, Y., Zachary, D., Zeming, L., Alban, D., Luca, A., and Adam, L. Automatic differentiation in pytorch. In NIPS Autodiff Workshop, 2017.
  • Canu & Smola (2006) Canu, S. and Smola, A. Kernel methods and the exponential family. Neurocomputing, 69(7):714 – 720, 2006. ISSN 0925-2312. New Issues in Neurocomputing: 13th European Symposium on Artificial Neural Networks.
  • Dinh et al. (2015) Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear independent components estimation. International Conference in Learning Representations Workshop Track, 2015.
  • Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL
  • Dudley (1967) Dudley, R. M. The sizes of compact subsets of hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
  • Gorham & Mackey (2017) Gorham, J. and Mackey, L. Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning, pp. 1292–1301, 2017.
  • Gutmann & Hyvärinen (2010) Gutmann, M. and Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    , pp. 297–304, 2010.
  • Gutmann & Hirayama (2011) Gutmann, M. U. and Hirayama, J.-i. Bregman divergence as general framework to estimate unnormalized statistical models. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pp. 283–290. AUAI Press, 2011.
  • Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637, 2017.
  • Huszár (2017) Huszár, F. Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235, 2017.
  • Hutchinson (1990) Hutchinson, M. F. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433–450, 1990.
  • Hyvärinen (2005) Hyvärinen, A. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.
  • Kingma & LeCun (2010) Kingma, D. P. and LeCun, Y. Regularized estimation of image statistics by score matching. In Advances in Neural Information Processing Systems, pp. 1126–1134, 2010.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. International Conference on Learning Representations, 2014.
  • LeCun et al. (2006) LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. A., and Huang, F. J. A tutorial on energy-based learning. In Predicting structured data. MIT Press, 2006.
  • Li & Turner (2018) Li, Y. and Turner, R. E. Gradient estimators for implicit models. In International Conference on Learning Representations, 2018.
  • Liu et al. (2016) Liu, Q., Lee, J., and Jordan, M. A kernelized Stein discrepancy for goodness-of-fit tests. In Proceedings of The 33rd International Conference on Machine Learning, pp. 276–284, 2016.
  • Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In

    Proceedings of International Conference on Computer Vision (ICCV)

    , 2015.
  • Martens et al. (2012) Martens, J., Sutskever, I., and Swersky, K. Estimating the hessian by back-propagating curvature. Proceedings of the 29th International Conference on Machine Learning, 2012.
  • Neal (2001) Neal, R. M. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
  • Owen (2013) Owen, A. B. Monte Carlo theory, methods and examples. 2013.
  • Rabin et al. (2012) Rabin, J., Peyré, G., Delon, J., and Bernot, M. Wasserstein barycenter and its application to texture mixing. In Bruckstein, A. M., ter Haar Romeny, B. M., Bronstein, A. M., and Bronstein, M. M. (eds.), Scale Space and Variational Methods in Computer Vision, pp. 435–446, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-24785-9.
  • Saremi et al. (2018) Saremi, S., Mehrjou, A., Schölkopf, B., and Hyvärinen, A. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018.
  • Sasaki et al. (2014) Sasaki, H., Hyvärinen, A., and Sugiyama, M. Clustering via mode seeking by direct estimation of the gradient of a log-density. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 19–34. Springer, 2014.
  • Shi et al. (2018) Shi, J., Sun, S., and Zhu, J. A spectral approach to gradient estimation for implicit distributions. In Proceedings of the 35th International Conference on Machine Learning, pp. 4651–4660, 2018.
  • Sriperumbudur et al. (2017) Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyvärinen, A., and Kumar, R. Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research, 18(57):1–59, 2017.
  • Stein (1981) Stein, C. M.

    Estimation of the mean of a multivariate normal distribution.

    The annals of Statistics, pp. 1135–1151, 1981.
  • Strathmann et al. (2015) Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo, Z., and Gretton, A. Gradient-free hamiltonian monte carlo with efficient kernel exponential families. In Advances in Neural Information Processing Systems, pp. 955–963, 2015.
  • Sun et al. (2019) Sun, S., Zhang, G., Shi, J., and Grosse, R. Functional variational Bayesian neural networks. In International Conference on Learning Representations, 2019.
  • Sutherland et al. (2018) Sutherland, D., Strathmann, H., Arbel, M., and Gretton, A. Efficient and principled score estimation with nyström kernel exponential families. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp. 652–660, 2018.
  • Tolstikhin et al. (2018) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018.
  • Tran et al. (2017) Tran, D., Ranganath, R., and Blei, D. Hierarchical implicit models and likelihood-free variational inference. In Advances in Neural Information Processing Systems, pp. 5523–5533, 2017.
  • van der Vaart (1998) van der Vaart, A. W. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998. doi: 10.1017/CBO9780511802256.
  • Vincent (2011) Vincent, P.

    A connection between score matching and denoising autoencoders.

    Neural computation, 23(7):1661–1674, 2011.
  • Wenliang et al. (2018) Wenliang, L., Sutherland, D., Strathmann, H., and Gretton, A. Learning deep kernels for exponential family densities. arXiv e-prints, art. arXiv:1811.08357, Nov 2018.

Appendix A Samples

a.1 Vae With Implicit Encoders

a.1.1 Mnist

Latent Dim 8 Latent Dim 32
Table 4: VAE samples on MNIST.

a.1.2 CelebA

(a) ELBO (b) SSM
(c) Stein (d) Spectral
Table 5: VAE samples on CelebA.

a.2 Wae

a.2.1 Mnist

Latent Dim 8 Latent Dim 32
Table 6: WAE samples on MNIST.

a.2.2 CelebA

Table 7: WAE samples on CelebA.

Appendix B Proofs

b.1 Notations

Below we provide a summary of the most commonly used notations used in the proofs. First, we denote the data distribution as and assume that the training/test data are i.i.d. samples of . The model is denoted as , where is restricted to a parameter space . Note that can be an unnormalized energy-based model. We use to represent a random vector with the same dimension of input . This vector is often called the projection vector, and we use to denote its distribution.

Next, we introduce several shorthand notations for quantities related to and . The log-likelihood and are respectively denoted as and . The (Stein) score function and are written as and , and finally the Hessian of w.r.t. is denoted as .

We also adopt some convenient notations for collections. In particular, we use to denote a collection of vectors and use to denote vectors .

b.2 Basic Properties

The following regularity conditions are needed for integration by parts and identifiability.

Assumption 1 (Regularity of score functions).

The model score function and data score function are both differentiable. They additionally satisfy and .

Assumption 2 (Regularity of projection vectors).

The projection vectors satisfy , and .

Assumption 3 (Boundary conditions).


Assumption 4 (Identifiability).

The model family is well-specified, i.e., . Furthermore, whenever .

Assumption 5 (Positiveness).

and .

Theorem 1.

Assume , and satisfy some regularity conditions (Assumption 1, Assumption 2). Under proper boundary conditions (Assumption 3), we have


where is a constant w.r.t. .


The basic idea of this proof is similar to that of Theorem 1 in Hyvärinen (2005). First, note that can be expanded to


where is due to the assumptions of bounded expectations. We have absorbed the second term in the bracket of (14) into since it does not depend on . Now what we need to prove is


This can be shown by first observing that


where we assume . Then, applying multivariate integration by parts (cf., Lemma 4 in Hyvärinen (2005)), we obtain

where denotes the -th component of . In the above derivation, is due to Cauchy-Schwarz inequality and is from the assumption that and vanishes at infinity.

Now returning to (17), we have

which proves (16) and the proof is completed. ∎

Lemma 1.

Assume our model family is well-specified and identifiable (Assumption 4). Assume further that the densities are all positive ( Assumption 5). When satisfies some regularity conditions (Assumption 2), we have


First, since , implies