1 Introduction
Score matching (Hyvärinen, 2005) is a statistical estimation method particularly suitable for learning unnormalized models. Unlike maximum likelihood estimation (MLE), the objective of score matching only depends on the derivative of the logdensity function (a.k.a., score), which is independent of the normalizing term. However, score matching has been so far limited to small density models and lowdimensional data, because its objective involves the trace of the Hessian for the logdensity function, which is expensive to compute for general densities (Martens et al., 2012)
, and especially those parameterized by neural networks, such as deep energybased models
(LeCun et al., 2006; Wenliang et al., 2018).There are several attempts to alleviate this difficulty: Kingma & LeCun (2010)
proposes approximate backpropagation for computing the trace of the Hessian;
Martens et al. (2012) develops curvature propagation, a fast stochastic estimator for the trace in score matching; and Vincent (2011)transforms score matching to a denoising problem which avoids secondorder derivatives. These methods have achieved some success, but may suffer from one or many of the following problems: inconsistent parameter estimation, large estimation variance, and cumbersome implementation.
To avoid these problems, we propose sliced score matching
, a variant of score matching that can scale up to large unnormalized models and high dimensional data. The key intuition is that instead of directly matching the highdimensional scores, we match their projections along random directions. Our theoretical results show that under some regularity conditions, sliced score matching is a welldefined statistical estimation criterion that yields consistent and asymptotically normal parameter estimates. Moreover, compared to the methods of
Kingma & LeCun (2010) and Martens et al. (2012), whose implementations require customized backpropagation for deep networks, the objective of sliced score matching only involves Hessianvector products, thus can be easily implemented in automatic differentiation frameworks such as TensorFlow
(Abadi et al., 2016)and PyTorch
(Adam et al., 2017).Since score matching is based on the distance between model and data scores, it can be naturally used as an objective for estimating data score functions (Sasaki et al., 2014; Strathmann et al., 2015). We show that given a set of samples, sliced score matching can be used to train a deep neural network as a score estimator. This observation opens up a new direction of applications for sliced score matching. For example, we show that sliced score estimation can be used to provide accurate score estimates for many tasks, such as variational inference with implicit distributions (Huszár, 2017) and learning Wasserstein AutoEncoders (WAE) (Tolstikhin et al., 2018).
Finally, we evaluate the performance of sliced score matching on both density and score estimation tasks. For density estimation, experiments on deep kernel exponential families (Wenliang et al., 2018) and NICE flow models (Dinh et al., 2015) show that our method is either more scalable or more accurate than existing score matching variants. When applied to score estimation, our method improves the performance of variational autoencoders (VAE) with implicit encoders, and can train WAEs without a discriminator or MMD loss by directly optimizing the KL divergence between aggregated posteriors and the prior. In both situations we outperformed kernelbased score estimators (Li & Turner, 2018; Shi et al., 2018) by achieving better test likelihoods and better qualities of random image generation.
2 Background
Given i.i.d. samples from a data distribution , our task is to learn an unnormalized density, , where is from some parameter space . The model’s normalizing constant (a.k.a., partition function) is denoted as , which is assumed to be intractable. A deep energybased model is an unnormalized model that uses a deep feedforward neural network to parameterize . Let be the normalized density determined by our model, we have
For convenience, we denote the (Stein) score function as , which is the gradient of the logdensity function w.r.t. the data point. Note that since , we immediately conclude that the score function of our model does not depend on the intractable partition function .
2.1 Score Matching
Learning unnormalized models with maximum likelihood estimation (MLE) is difficult because of the intractability of the partition function . To avoid the partition function, score matching (Hyvärinen, 2005) minimizes the Fisher divergence between and , which is defined as
(1) 
Since is independent of , the Fisher divergence avoids the intractable partition function. However, Eq. (1) is still not readily usable for learning unnormalized models, as we usually do not have access to the score function of the data .
By applying integration by parts, Hyvärinen (2005) shows that can be written as (cf., Theorem 1 in Hyvärinen (2005)):
(2) 
where is a constant that does not depend on , denotes the trace of a matrix, and
(3) 
is the Hessian of the logdensity function. The constant can be ignored and the following unbiased estimator of the remaining terms is used to train our unnormalized model:
where is a shorthand used throughout the paper for a collection of data points sampled i.i.d. from .
Computational difficulty
While the score matching objective avoids the computation of for unnormalized models, it introduces a new computational difficulty: computing the trace of the Hessian of a logdensity function, . Suppose that the input data has dimension . A naïve approach of computing the trace of the Hessian requires times more backward passes than computing the gradient (see Alg. 2 in the appendix). For example, the trace could be computed by applying backpropogation times to to get each diagonal term of sequentially. In practice, can be many thousands, which can render score matching too slow for practical purposes. Moreover, Martens et al. (2012) argues from a theoretical perspective that it is unlikely that there exists a computationally efficient algorithm (at the level of a constant number of forward and backward passes) for computing the diagonal of the Hessian defined by an arbitrary computation graph.
2.2 Score Estimation for Implicit Distributions
Besides parameter estimation in unnormalized models, score matching can also be used to estimate scores of implicit distributions, which are distributions that have a tractable sampling process but without a tractable density. Implicit distributions can arise in many situations such as the marginal distribution of a nonconjugate model (Sun et al., 2019)
, noninvertible transformation of random variables
(Goodfellow et al., 2014), and models defined by complex simulation processes (Tran et al., 2017). In these cases learning and inference often become intractable due to the need of optimizing an objective that involves the intractable density.In these cases, directly estimating the score function based on i.i.d. samples from an implicit density can be useful. For example, suppose our learning problem involves optimizing the entropy of an implicit distribution. This situation is common when dealing with variational free energies (Kingma & Welling, 2014). Suppose can be reparameterized as , where is a simple random variable independent of , such as a standard normal, and is a deterministic mapping. We can write the gradient of the entropy with respect to as
where is usually easy to compute. The score is intractable but can be approximated by score estimation.
Score matching is an attractive solution for score estimation since (2) naturally serves as an objective if we replace with . There are also other kernelbased score estimators (Li & Turner, 2018; Shi et al., 2018) which we discuss in detail in section 5.2. Score estimation like this can be applied to training VAEs and WAEs, as we shall see in the experiments.
3 Method
In this section, we introduce sliced score matching (SSM), our scalable method designed for density and score estimation with deep energybased models.
3.1 Sliced Score Matching
Our intuition is that one dimensional problems are usually much easier to solve than high dimensional ones. Inspired by the idea of Sliced Wasserstein distance (Rabin et al., 2012), we consider projecting and onto some random direction and propose to compare their average difference along that random direction. More specifically, we consider the following objective as a replacement of the Fisher divergence in Eq. (1):
(4) 
Here and are independent, and we require and . Many examples of satisfy these requirements. For instance, can be a multivariate standard normal (
), a multivariate Rademacher distribution (the uniform distribution over
), or a uniform distribution on the hypersphere (recall that refers to the dimension of ).To eliminate the dependence of on , we use integration by parts as in score matching (cf., the equivalence between Eq. (1) and (2)). Defining
(5) 
the equivalence is summarized in the following theorem.
Theorem 1.
Other than our requirements on , the assumptions are exactly the same as in Theorem 1 of Hyvärinen (2005). We advise the interested readers to read Appendix B.2 for technical statements of the assumptions and a rigorous proof of the theorem.
In practice, given a dataset , we draw projection vectors independently for each point . The collection of all such vectors are denoted as . We then use the following unbiased estimator of
(7) 
Note that when is a multivariate standard normal or multivariate Rademacher distribution, we have , in which case the second term of can be integrated analytically, and may lead to an estimator with reduced variance:
(8) 
Empirically, we do find that has better performance than . We refer to this version as sliced score matching with variance reduction (SSMVR). In fact, we can leverage to create a control variate for guaranteed reduction of variance (Appendix D). is also closely related to Hutchinson’s trace estimator (Hutchinson, 1990), which we will analyze later in Section 4.3.
For sliced score matching, the second derivative term is much less computationally expensive than . Using autodifferentiation systems that support higher order gradients, we can compute it using two gradient operations for a single , as shown in Alg. 1. Similarly, when there are ’s, the total number of gradient operations is . In contrast, assuming the dimension of is and , we typically need gradient operations to compute because each diagonal entry of needs to be computed separately (see Alg. 2 in the appendix).
In practice, we can tune to find the sweet spot between variance and computation. In our experiments, we find that typically is often a good choice.
3.2 Sliced Score Estimation
As mentioned in section 2.2, the task of score estimation is to predict at some test point , given a set of samples . Our sliced score matching objective is particularly suitable for this task since it depends only on the scores.
In practice, we propose to use a vectorvalued deep neural network as our score model. Then, substituting into for , we get
and we can optimize the objective to get . Afterwards, can be used as an approximation to .^{1}^{1}1Note that may not correspond to the gradient of any scalar function. For to represent a gradient, one necessary condition is for all , which may not be satisfied by general networks. However, because Theorem 1 does not depend on the fact that is a gradient, we can still use the estimator in practice.
4 Analysis
In this section, we present several theoretical results to justify sliced score matching as a principled objective. We also discuss the connection of sliced score matching to other methods. For readability, we will defer rigorous statements of assumptions and theorems to the Appendix.
4.1 Consistency
One important question to ask for any statistical estimation objective is whether the estimated parameter is consistent under reasonable assumptions. Our results confirm that for any , the objective is consistent under suitable assumptions as .
We need several standard assumptions to prove the results rigorously. Let be the normalized density induced by our unnormalized model . First, we assume is compact (Assumption 6), and our model family is wellspecified and identifiable (Assumption 4). These are standard assumptions used for proving the consistency of MLE (van der Vaart, 1998). We also adopt the assumption in Hyvärinen (2005) that all densities are positive (Assumption 5). Finally, we assume that has some Lipschitz properties (Assumption 7, which always holds if and have uniformly bounded derivatives w.r.t. ), and
has bounded higherorder moments (Assumption
2, true for all considered in the experiments). Then, we can prove the consistency of :Theorem 2 (Consistency).
Assume the conditions of Theorem 1 are satisfied. Assume further that the assumptions discussed above hold. Let be the true parameter of the data distribution. Then for every ,
Sketch of proof.
Remark 1.
In Hyvärinen (2005), the authors only showed that , which leads to “local consistency” of score matching. This is a weaker notion of consistency compared to our settings.
4.2 Asymptotic Normality
Asymptotic normality results can be very useful for approximate hypothesis testing and comparing different estimators. Below we show that is asymptotically normal when .
In addition to the assumptions in Section 4.1, we need an extra assumption to prove asymptotic normality. We require to have a stronger Lipschitz property (Assumption 9). The assumption holds if and , as functions of , have uniformly bounded third derivatives.
For simplicity, we denote as , where is an arbitrary function. In the following, we will only show the asymptotic normality result for a specific . More results can be found in Appendix B.4.
Theorem 3 (Asymptotic normality, special case).
Assume the assumptions discussed above hold. If is the multivariate Rademacher distribution, we have
where
(9) 
Here is the dimension of data; and are p.s.d matrices depending on , and their definitions can be found in Appendix B.4.
Sketch of proof.
Remark 2.
As expected, larger will lead to smaller asymptotic variance.
Remark 3.
As far as we know, there is no published proof of asymptotic normality for score matching. However, by using the same techniques in our proofs, and under the same assumptions, we can conclude that the asymptotic variance of the score matching estimator is (Corollary 1), which will always be smaller than sliced score matching with multivariate Rademacher projections. However, the gap reduces when increases.
4.3 Connection to Other Methods
Sliced score matching is widely connected to many other methods, and can be motivated from some different perspectives. Here we discuss a few of them.
Connection to NCE
Noise contrastive estimation (NCE), proposed by Gutmann & Hyvärinen (2010), is another principle for training unnormalized statistical models. The method works by comparing with a noise distribution . We consider a special form of NCE which minimizes the following objective
(10) 
where , and . Assuming to be small, Eq. (10) can be written as the following by Taylor expansion
(11) 
For derivation, see Proposition 1 in the appendix. A similar derivation can also be found in Gutmann & Hirayama (2011). As a result of (11), if we choose some and take the expectation of (10) w.r.t. , the objective will be equivalent to sliced score matching whenever .
Connection to Hutchinson’s Trick
Hutchinson’s trick (Hutchinson, 1990) is a stochastic algorithm to approximate for any square matrix . The idea is to choose a distribution of random vector such that , and then we have . Hence, using Hutchinson’s trick, we can replace with in the score matching objective . Then the finite sample objective of score matching becomes
which is exactly the sliced score matching objective with variance reduction .
5 Related Work
5.1 Scalable Score Matching
To the best of our knowledge, there are three existing methods that are able to scale up score matching to learning deep models on high dimensional data.
Denoising Score Matching
Vincent (2011) proposes denoising score matching, a variant of score matching that completely circumvents the Hessian. Specifically, consider a noise distribution , and let . Denoising score matching applies the original score matching to the noisecorrupted data distribution , and the objective can be proven to be equivalent to the following
which can be estimated without computing the Hessian. Although denoising score matching is much faster than score matching, it has obvious drawbacks. First, it can only recover the noise corrupted data distribution. Furthermore, choosing the parameters of the noise distribution is highly nontrivial and in practice the performance can be very sensitive to
, and people often have to use heuristics. For example,
Saremi et al. (2018) proposes a heuristic of choosing based on Parzen windows.Approximate Backpropagation
Kingma & LeCun (2010) proposes a backpropagation method to approximately compute the trace of the Hessian. Because the backpropagation of the full Hessian scales quadratically w.r.t. the layer size, the authors limit backpropagation only to the diagonal so that it has the same cost as the usual backpropagation for computing loss gradients. However, there are no theoretical guarantees for approximation errors. In fact, the authors only did experiments on networks with a single hidden layer, in which case the approximation is exact. Moreover, there is no direct support for the proposed approximate backpropagation method in modern automatic differentiation frameworks.
Curvature Propagation
Martens et al. (2012)
estimates the trace term in score matching by applying curvature propagation to compute an unbiased, complexvalued estimator of the diagonal of the Hessian. The work claims that curvature propagation will have the smallest variance among a class of estimators, which includes the Hutchinson estimator. However, their proof incorrectly evaluates the pseudovariance of the complexvalued estimator instead of the variance. In practice, curvature propagation can have large variance when the number of nodes in the network is large, because it introduces noise for each node in the network. Finally, implementing curvature propagation also requires manually modifying the backpropagation code, handling complex numbers in neural networks, and will be inefficient for networks of more general structures, such as recurrent neural networks.
5.2 Kernel Score Estimators
Two prior methods for score estimation are based on a generalized form of Stein’s identity (Stein, 1981; Gorham & Mackey, 2017):
(12) 
where is a continuous differentiable density and is a function satisfying some regularity conditions. Li & Turner (2018) propose to set as the feature map of some kernel in the Stein class (Liu et al., 2016) of , and solve a finitesample version of (12) to obtain estimates of at the sample points. We refer to this method as Stein in the experiments. Shi et al. (2018) adopt a different approach but also exploit (12). They build their estimator by expanding
as a spectral series of eigenfunctions and solve for the coefficients using (
12). Compared to Stein, their method is argued to have theoretical guarantees and principled outofsample estimation at an unseen test point. We refer to their method as Spectral in the experiments.6 Experiments
Our experiments include two parts: (1) to test the effectiveness of SSM in learning deep models for density estimation, and (2) to test the ability of SSM in providing score estimates for applications such as VAEs with implicit encoders and WAEs. Unless specified explicitly, we choose by default.
6.1 Density Estimation
We evaluate SSM and its variancereduced version (SSMVR) for density estimation and compare against score matching (SM) and its three scalable baselines: denoising score matching (DSM), approximate backpropagation (approx BP), and curvature propagation (CP). All SSM methods use the multivariate Rademacher distribution as . Our results demonstrate that: (1) SSM is comparable in performance to score matching, (2) SSM outperforms other computationally efficient approximations to score matching, and (3) unlike score matching, SSM scales effectively to high dimensional data.
6.1.1 Deep Kernel Exponential Families
Model
Deep kernel exponential families (DKEF) are unnormalized density models trained using score matching (Wenliang et al., 2018). DKEFs parameterize the unnormalized log density as , where is a mixture of Gaussian kernels evaluated at different inducing points: . The kernel is defined on the feature space of a neural network, and the network parameters are trained along with the inducing points . Further details of the model, which is identical to that in Wenliang et al. (2018), are presented in Appendix C.1.
Setup
Following the settings of Wenliang et al. (2018), we trained models on three UCI datasets: Parkinsons, RedWine, and WhiteWine (Dua & Graff, 2017), and used their original code for score matching. To compute the trace term exactly, Wenliang et al. (2018)’s manual implementation of backpropagation takes over one thousand lines for a model that is four layers deep. For denoising score matching, the value of is chosen by grid search using the score matching loss on a validation dataset. All models are trained with 15 different random seeds and training is stopped when validation loss does not improve for 200 steps.
Results
Results in Fig. 1 demonstrate that SSMVR performs comparably to score matching, when evaluated using the score matching loss on a heldout test set. We observe that variance reduction substantially improves the performance of SSM. In addition, SSM outperforms other computationally efficient approaches. DSM can perform comparably to SSM on RedWine. However, it is challenging to select for DSM. Models trained using from the heuristic in Saremi et al. (2018) are far from optimal (on both score matching losses and likelihoods), and extensive grid search is needed to find the best . CP performs substantially worse, which is likely because it injects noise for each node in the computation graph, and the amount of noise introduced is too large for a neuralnetworkbased kernel evaluated at 200 inducing points, which supports our hypothesis that CP does not work effectively for deep models. Results for approximate backpropagation are omitted because the losses exceeded on all datasets. This is because approx BP provides a biased estimate of the Hessian without any error guarantees. All the results are similar when evaluating with loglikelihoods (Appendix C.1).
Scalability to high dimensional data
We evaluate the computational efficiency of different losses on data of increasing dimensionality. We fit DKEF models to a multivariate standard normal in high dimensional spaces. The average running time per minibatch of 100 examples is reported in Fig. 2. Score matching performance degrades linearly with the dimension of the input data due to the computation of the Hessian, and creates out of memory errors for a 12GB GPU after the dimension increases beyond 400. SSM performs similarly to DSM, approx BP and CP, and they are all scalable to large dimensions.
Test SM Loss  Test LL  
MLE  579  791 
SSMVR  8054  3355 
SSM  2428  2039 
DSM ()  3035  4363 
DSM ()  97  8082 
CP  1694  1517 
Approx BP  48  2288 
6.1.2 Deep Flow Models
Setup
As a sanity check, we also evaluate SSM by training a NICE flow model (Dinh et al., 2015), whose likelihood is tractable and can be compared to ground truths obtained by MLE. The model we use has 20 hidden layers, and 1000 units per layer. Models are trained to fit MNIST handwritten digits, which are 784 dimensional images. Data are dequantized by adding uniform noise in the range
, and transformed using a logit transformation,
. We provide additional details in Appendix C.2.Training with score matching is extremely computationally expensive in this case. Our SM implementation based on autodifferentiation takes around 7 hours to finish one epoch of training, and 12 GB GPU memory cannot hold a batch size larger than 24, so we do not include these results. Since NICE has tractable likelihoods, we also evaluate MLE as a surrogate objective for minimizing the score matching loss. Notably, likelihoods and score matching losses might be uncorrelated when the model is misspecified, which is likely to be the case for a complex dataset like MNIST.
Results
Score matching losses and loglikelihoods on the test dataset are reported in Tab. 1, where models are evaluated using the best checkpoint in terms of the score matching loss on a validation dataset over 100 epochs of training. SSMVR greatly outperforms all the other methods on the score matching loss. DSM performs worse than SSMVR, and is still hard to tune. Specifically, following the heuristic in Saremi et al. (2018) leads to , which performed the worst (on both loglikelihood and SM loss) of the eight choices of
in our grid search. Approx BP has more success on NICE than for training DKEF models. It is probably because the Hessians of hidden layers of NICE are closer to a diagonal matrix, which results in a smaller approximation error for approx BP. As in the DKEF experiments, CP performs worse. This is likely due to injecting noise to all hidden units, which will lead to large variance for a network as big as NICE. Unlike the DKEF experiments, we find that good loglikelihoods are less correlated with good score matching loss, probably due to model misspecification.
6.2 Score Estimation
We consider two typical tasks that require accurate score estimations: 1) training VAEs with an implicit encoder and 2) training Wasserstein AutoEncoders. We show in both tasks SSM outperforms previous score estimators. Samples generated by various algorithms can be found in Appendix A.
6.2.1 VAE with Implicit Encoders
Background
Consider a latent variable model , where and denote observed and latent variables respectively. A variational autoencoder (VAE) is composed of two parts: 1) an encoder modeling the conditional distribution of given ; and a decoder that approximates the posterior distribution of the latent variable. A VAE is typically trained by maximizing the following evidence lower bound (ELBO)
where denotes a prespecified prior distribution. The expressive power of is critical to the success of variational learning. Typically, is chosen to be a simple distribution so that is tractable. We call this traditional approach “ELBO” in the experiments. With score estimation techniques, we can directly compute for implicit distributions, which enables more flexible options for . We consider 3 score estimation techniques: sliced score matching (SSM), Stein (Li & Turner, 2018) and Spectral (Shi et al., 2018).
Given a data point , kernel score estimators need multiple samples from to estimate its score. On MNIST, we use 100 samples, as done in Shi et al. (2018). On CelebA, however, we can only take 20 samples due to GPU memory limitations. In contrast, SSM learns a score network along with , which amortizes the cost of score estimation. Unless noted explicitly, we use one projection per data point () for SSM, which is much more scalable to large networks.
Setup
We consider VAE training on MNIST and CelebA (Liu et al., 2015). All images in CelebA are first cropped to a patch of and then resized to . We report negative likelihoods on MNIST, as estimated by AIS (Neal, 2001) with 1000 intermediate distributions. We evaluate sample quality on CelebA with FID scores (Heusel et al., 2017). For fast AIS evaluation on MNIST, we use shallow fullyconnected networks with 3 hidden layers. For CelebA experiments we use deep convolutional networks. The architectures of implicit encoders and score networks are straightforward modifications to the encoders of ELBO. More details are provided in Appendix C.3.
Results
The negative likelihoods of different methods on MNIST are reported in the left part of Tab. 2. We note that SSM outperforms Stein and Spectral in all cases. When the latent dimension is 8, SSM outperforms ELBO, which indicates that the expressive power of implicit pays off. When the latent dimension is 32, the gaps between SSM and kernel methods are even larger, and the performance of SSM is still comparable to ELBO. Moreover, when (matching the computation of kernel methods), SSM outperforms ELBO.
For CelebA, we provide FID scores of samples in the top part of Tab. 3. We observe that after 40k training iterations, SSM outperforms all other baselines. Kernel methods perform poorly in this case because only 20 samples per data point can be used due to limited amount of GPU memory. Early during training, SSM does not perform as well. Since the score network is trained along with the encoder and decoder, a moderate number of training steps is needed to give an accurate score estimation (and better learning of the VAE).
6.2.2 WAEs
Background
WAE is another method to learn latent variable models, which generally produces better samples than VAEs. Similar to a VAE, it contains an encoder and a decoder and both can be implicit distributions. Let be a predefined prior distribution, and denote the aggregated posterior distribution. Using a metric function and KL divergence between and , WAE minimizes the following objective
Here can again be approximated by score estimators. Note that in this case kernel methods can work efficiently, as the samples from can be collected by sampling one for each in a minibatch. For SSM, we use a score network and train it alongside .
Setup
Results
The negative likelihoods on MNIST are provided in the right part of Tab. 2. SSM outperforms both kernel methods, and achieves a larger performance gap when the latent dimension is higher. The likelihoods are lower than the VAE results as the WAE objective does not directly maximize likelihoods.
We show FID scores for CelebA experiments in the bottom part of Tab. 3. As expected, kernel methods perform much better than before, because it is faster to sample from . The FID scores are generally lower than those in VAE experiments, which supports the forklore that WAEs generally obtain better samples. SSM outperforms both kernel methods when the number of iterations is more than 40k. This shows the advantages of training a deep, expressive score network compared to using a fixed kernel in score estimation tasks.
7 Conclusion
We propose sliced score matching, a scalable method for learning unnormalized statistical models and providing score estimates for implicit distributions. Compared to the original score matching and its variants, our estimator can scale to largedata scenarios, while remaining cheap to implement in modern autodifferentiation frameworks. Theoretically, our estimator is consistent and asymptotically normal under some regularity conditions. Experimental results demonstrate that our method outperforms competitors in learning deep energybased models and provides more accurate estimates than kernel score estimators in training implicit VAEs and WAEs.
References

Abadi et al. (2016)
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M.,
Ghemawat, S., Irving, G., Isard, M., et al.
Tensorflow: A system for largescale machine learning.
In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283, 2016.  Adam et al. (2017) Adam, P., Soumith, C., Gregory, C., Edward, Y., Zachary, D., Zeming, L., Alban, D., Luca, A., and Adam, L. Automatic differentiation in pytorch. In NIPS Autodiff Workshop, 2017.
 Canu & Smola (2006) Canu, S. and Smola, A. Kernel methods and the exponential family. Neurocomputing, 69(7):714 – 720, 2006. ISSN 09252312. New Issues in Neurocomputing: 13th European Symposium on Artificial Neural Networks.
 Dinh et al. (2015) Dinh, L., Krueger, D., and Bengio, Y. NICE: Nonlinear independent components estimation. International Conference in Learning Representations Workshop Track, 2015.
 Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Dudley (1967) Dudley, R. M. The sizes of compact subsets of hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Gorham & Mackey (2017) Gorham, J. and Mackey, L. Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning, pp. 1292–1301, 2017.

Gutmann & Hyvärinen (2010)
Gutmann, M. and Hyvärinen, A.
Noisecontrastive estimation: A new estimation principle for
unnormalized statistical models.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, pp. 297–304, 2010.  Gutmann & Hirayama (2011) Gutmann, M. U. and Hirayama, J.i. Bregman divergence as general framework to estimate unnormalized statistical models. In Proceedings of the TwentySeventh Conference on Uncertainty in Artificial Intelligence, pp. 283–290. AUAI Press, 2011.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637, 2017.
 Huszár (2017) Huszár, F. Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235, 2017.
 Hutchinson (1990) Hutchinson, M. F. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in StatisticsSimulation and Computation, 19(2):433–450, 1990.
 Hyvärinen (2005) Hyvärinen, A. Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.
 Kingma & LeCun (2010) Kingma, D. P. and LeCun, Y. Regularized estimation of image statistics by score matching. In Advances in Neural Information Processing Systems, pp. 1126–1134, 2010.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational bayes. International Conference on Learning Representations, 2014.
 LeCun et al. (2006) LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. A., and Huang, F. J. A tutorial on energybased learning. In Predicting structured data. MIT Press, 2006.
 Li & Turner (2018) Li, Y. and Turner, R. E. Gradient estimators for implicit models. In International Conference on Learning Representations, 2018.
 Liu et al. (2016) Liu, Q., Lee, J., and Jordan, M. A kernelized Stein discrepancy for goodnessoffit tests. In Proceedings of The 33rd International Conference on Machine Learning, pp. 276–284, 2016.

Liu et al. (2015)
Liu, Z., Luo, P., Wang, X., and Tang, X.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, 2015.  Martens et al. (2012) Martens, J., Sutskever, I., and Swersky, K. Estimating the hessian by backpropagating curvature. Proceedings of the 29th International Conference on Machine Learning, 2012.
 Neal (2001) Neal, R. M. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
 Owen (2013) Owen, A. B. Monte Carlo theory, methods and examples. 2013.
 Rabin et al. (2012) Rabin, J., Peyré, G., Delon, J., and Bernot, M. Wasserstein barycenter and its application to texture mixing. In Bruckstein, A. M., ter Haar Romeny, B. M., Bronstein, A. M., and Bronstein, M. M. (eds.), Scale Space and Variational Methods in Computer Vision, pp. 435–446, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 9783642247859.
 Saremi et al. (2018) Saremi, S., Mehrjou, A., Schölkopf, B., and Hyvärinen, A. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018.
 Sasaki et al. (2014) Sasaki, H., Hyvärinen, A., and Sugiyama, M. Clustering via mode seeking by direct estimation of the gradient of a logdensity. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 19–34. Springer, 2014.
 Shi et al. (2018) Shi, J., Sun, S., and Zhu, J. A spectral approach to gradient estimation for implicit distributions. In Proceedings of the 35th International Conference on Machine Learning, pp. 4651–4660, 2018.
 Sriperumbudur et al. (2017) Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyvärinen, A., and Kumar, R. Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research, 18(57):1–59, 2017.

Stein (1981)
Stein, C. M.
Estimation of the mean of a multivariate normal distribution.
The annals of Statistics, pp. 1135–1151, 1981.  Strathmann et al. (2015) Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo, Z., and Gretton, A. Gradientfree hamiltonian monte carlo with efficient kernel exponential families. In Advances in Neural Information Processing Systems, pp. 955–963, 2015.
 Sun et al. (2019) Sun, S., Zhang, G., Shi, J., and Grosse, R. Functional variational Bayesian neural networks. In International Conference on Learning Representations, 2019.
 Sutherland et al. (2018) Sutherland, D., Strathmann, H., Arbel, M., and Gretton, A. Efficient and principled score estimation with nyström kernel exponential families. In Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics, pp. 652–660, 2018.
 Tolstikhin et al. (2018) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein autoencoders. In International Conference on Learning Representations, 2018.
 Tran et al. (2017) Tran, D., Ranganath, R., and Blei, D. Hierarchical implicit models and likelihoodfree variational inference. In Advances in Neural Information Processing Systems, pp. 5523–5533, 2017.
 van der Vaart (1998) van der Vaart, A. W. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998. doi: 10.1017/CBO9780511802256.

Vincent (2011)
Vincent, P.
A connection between score matching and denoising autoencoders.
Neural computation, 23(7):1661–1674, 2011.  Wenliang et al. (2018) Wenliang, L., Sutherland, D., Strathmann, H., and Gretton, A. Learning deep kernels for exponential family densities. arXiv eprints, art. arXiv:1811.08357, Nov 2018.
Appendix A Samples
a.1 Vae With Implicit Encoders
a.1.1 Mnist
Latent Dim 8  Latent Dim 32  


SSM  
Stein  
Spectral 
a.1.2 CelebA
(a) ELBO  (b) SSM 
(c) Stein  (d) Spectral 
a.2 Wae
a.2.1 Mnist
Latent Dim 8  Latent Dim 32  
SSM  
Stein  
Spectral 
a.2.2 CelebA
SSM  
Stein  
Spectral 
Appendix B Proofs
b.1 Notations
Below we provide a summary of the most commonly used notations used in the proofs. First, we denote the data distribution as and assume that the training/test data are i.i.d. samples of . The model is denoted as , where is restricted to a parameter space . Note that can be an unnormalized energybased model. We use to represent a random vector with the same dimension of input . This vector is often called the projection vector, and we use to denote its distribution.
Next, we introduce several shorthand notations for quantities related to and . The loglikelihood and are respectively denoted as and . The (Stein) score function and are written as and , and finally the Hessian of w.r.t. is denoted as .
We also adopt some convenient notations for collections. In particular, we use to denote a collection of vectors and use to denote vectors .
b.2 Basic Properties
The following regularity conditions are needed for integration by parts and identifiability.
Assumption 1 (Regularity of score functions).
The model score function and data score function are both differentiable. They additionally satisfy and .
Assumption 2 (Regularity of projection vectors).
The projection vectors satisfy , and .
Assumption 3 (Boundary conditions).
.
Assumption 4 (Identifiability).
The model family is wellspecified, i.e., . Furthermore, whenever .
Assumption 5 (Positiveness).
and .
Theorem 1.
Proof.
The basic idea of this proof is similar to that of Theorem 1 in Hyvärinen (2005). First, note that can be expanded to
(14)  
(15) 
where is due to the assumptions of bounded expectations. We have absorbed the second term in the bracket of (14) into since it does not depend on . Now what we need to prove is
(16) 
This can be shown by first observing that
(17) 
where we assume . Then, applying multivariate integration by parts (cf., Lemma 4 in Hyvärinen (2005)), we obtain
where denotes the th component of . In the above derivation, is due to CauchySchwarz inequality and is from the assumption that and vanishes at infinity.
Lemma 1.
Proof.
First, since , implies
Comments
There are no comments yet.