1 Introduction
Universal density estimators—
Learning the probability density of complex highdimensional data is a challenging problem in machine learning. Our goal is to build a universal engine with a hierarchical structure to solve this problem robustly and efficiently. This quest has a rich history, going back to the Boltzmann machine at the birth of the connectionist movement
(Hinton & Sejnowski, 1986), but an efficient, robust, and generalpurpose method for continuousvalued data is still missing.The curse of dimensionality— The most fundamental road block for universal density estimation is the curse of dimensionality, the exponential growth of the “volume” of the data manifold with the dimension of the ambient space. In fact, this exponential growth is at the core of the shortcomings of kernel methods in density estimation. The curse can however be overcome by having strong priors on the data distribution. Perhaps, the strongest prior we currently have for natural distributions is that they are hierarchical and must be composed with deep architectures (Bengio, 2009). Guided by this prior, we demand the density estimation to be formulated in a deep hierarchical architecture.
The curse of inference—
One approach is to use latent variable models with hierarchical structure. The biggest challenge in training such models is that for complex distributions, the inference of the latent variables (i.e., over hidden units in the neural network) is intractable. While Markov chain Monte Carlo (MCMC) methods could be used, the problem is that natural distributions, for example natural images, have an intriguing property, also observed in physical systems near secondorder phase transitions, that probability mass is concentrated in sharp ridges that are separated by large regions of low probability. This complex probability landscape is a major road block for learning; variational and MCMC methods have yet to meet the challenge
(Bengio et al., 2013). An alternative is to use energybased models, i.e. define an energy function which is essentially an unnormalized logdensity function for the data. Again, however, such models tend to be intractable in the sense that the partition function (normalization constant) and its derivative cannot be computed easily — their approximations by MCMC runs into the same problem as the inference of latent variables.
Deterministic objectives— These challenges facing MCMC methods gave rise to an important area of research in using deterministic objectives for parameter estimation of probabilistic models. One of the first examples of such frameworks for energybased models is score matching (Hyvärinen, 2005), which will be reviewed in a great depth below. A very early method was pseudolikelihood (Besag, 1974). More recently, minimum probability flow, with deep connections to score matching, was used to learn Ising spin glasses (SohlDickstein et al., 2011). These methods define deterministic objective functions, although the optimization may be performed with stochastic methods.
Our contributions—
In this work, with score matching as the foundation, we introduce a scalable and efficient algorithm to learn the energy of any data distribution in an inferencefree hierarchical framework. We revisit the Bayesian estimation interpretation of the score function and Parzen score matching, and construct an objective, which is optimized with stochastic gradient descent (SGD). The resulting Deep Energy Estimator Network (DEEN) is designed as products of experts (PoE), while the energy is efficiently estimated without resorting to MCMC computations and contrastive divergence approximations. Our experiments were performed on twodimensional synthetic data, MNIST, and the van Heteren data base of natural images. In addition, we diagnosed the problems with score function estimation in MLPs that bypass energy estimation; our framework avoids those problems as the score function is computed from the energy function itself.
2 Background
Energybased models—
Assume we observe a random vector
which has a probability density function (pdf) denoted by
. We want to use a parameterized density model with , and estimate the parameter vector from observations of ; i.e. we want to approximate by for the estimated parameter value . In energybased modelling, we start by defining a function which we use to define the model pdf as(1) 
Here we necessarily see the appearance of the partition function
(2) 
which normalizes the pdf to have unit integral. In general, its computation is extremely difficult.
Score matching— Score matching estimates the parameters while completely avoiding the evaluation of the partition function . The trick is that any multiplicative constant is eliminated in the score function defined as (Hyvärinen, 2005):
(3) 
since the partition function does not depend on . Hyvärinen showed that under mild assumptions, the following score matching objective
(4) 
is (locally) consistent. In addition, he showed that we do not need to know the data score as integration by parts reduces to:
(5) 
The term is the divergence of the vector field , equal to the negative trace of the Hessian of . The last term is not a function of and is dropped in the parameter estimation. For the finite dataset
(6) 
we obtain a finitesample version by replacing by the empirical distribution:
(7) 
So, in practice, was estimated by:
(8) 
KullbackLeibler vs. Fisher— There is a formal link between maximum likelihood and score matching established by Lyu (2009). By perturbing the observed data with the “scale parameter” : , where , he proved that the Fisher divergence is the derivative of the KL divergence with respect to the scale factor . His interpretation of this result was that the score matching is more robust than maximum likelihood in parameter estimation as it searches for parameters that are relatively more stable to small noise perturbations in the training data.
Score matching energy estimation— The finitesample scorematching objective of Eq. 8 written in terms of the energy and expanded in the vector coordinates reduces to
(9) 
(Kingma & LeCun, 2010) considered regularizing this objective where they arrived at regularized form of the score matching objective given by:
(10) 
The second term above was obtained by a perturbative expansion of after adding a Gaussian cloud to the data points and ignoring the offdiagonal elements of the Hessian of the energy. Kingma & LeCun further explored density estimation with the objective and parameterizing the energy
with a MLP. However, the backpropagation of the full Hessian scales like
, where is the total number of the parameters of the MLP. They found further approximations to the diagonal of the Hessian but the approximations were only valid for MLPs with a single hidden layer. Along these lines, approximation methods have been developed to estimate Hessians in neural networks (Martens et al., 2012b). We will see below a more principled way of regularizing score matching with Parzen score matching, and a stochastic version where the secondorder estimations are avoided altogether. The objective we arrived at can also be motivated by the intriguing Bayesian interpretation of the score function that is discussed next.Bayesian interpretation—
Score matching was originally motivated for its computational efficiency in the parameter estimation of unnormalized densities because of the key property that the partition function is eliminated in the score function. However, there are also intriguing connections between score matching and a very different problem of Bayesian inference for signal restoration. This was established by
Hyvärinen (2008) for small measurement noise, and generalized by Raphan & Simoncelli (2011). One way of looking at this is the Miyasawa denoising theorem: given noisy measurementsof an underlying random variable
, when the measurement noise is , the Bayesian leastsquare estimator reduces to(11) 
valid for any level of noise , with being the score function of the noisy data . Thus, estimating the score function is essential for denoising. Raphan & Simoncelli referred to this as singlestep denoising, and generalized the result for a large class of measurement processes beyond Gaussian noise.
Parzen density score matching— Vincent (2011) studied score matching using a nonparametric estimator, replacing by the Parzen windows density estimator :
(12) 
where is the smoothing kernel. The integration of Eq. 4 is then taken with respect to , and is replaced accordingly:
(13) 
He then proved that the Parzen score matching objective is equivalent to the following “denoising” objective
(14) 
with
(15) 
for any smoothing kernel (see the appendix in Vincent (2011)
for the proof.) The difficult kernel density estimation
was thus replaced with , which opens up the possibility of using a simpler stochastic implementation as we will see below.Singlestep denoising interpretation— Next we explain the denoising interpretation of the Parzen score objective (Vincent, 2011). This becomes clear by taking the Parzen kernel to be the conditional Gaussian density:
(16) 
where after scaling by , the denoising objective in (14) reduces to:
(17) 
Note that from Eq. 15, is clamped to the training data, and can be interpreted as Gaussian “noise” added to the training data. Optimization of learns a score function such that comes as close as possible to , which is why it was dubbed as the denoising score matching objective by Vincent. That being said, we prefer here to interpret the objective defined in Eq. 17 as Parzen score matching instead of denoising. This is more principled since in Parzen estimation there is a notion of optimality in smoothing the distribution: The optimal kernel width is likely to be nonzero and noninfinitesimal, irrespective of whether the measurements are noisefree or not. This is in contrast to denoising, where is a free parameter, and its optimal value is a less meaningful quantity.
3 Deep Energy Estimator Network
MLP parameterization of the energy— Our goal is to build a generalpurpose, efficient estimator that closely approximates the true energy of any complex data distribution, up to an additive constant. We construct this energy estimator based on the score matching framework. For many reallife data distributions, the energy landscape is very complex with many canyons/ridges of lowenergy/highdensity being separated by large flat regions of highenergy/lowdensity. Deep neural networks are natural candidates for shaping such complex energy landscapes and thus beating the curse of dimensionality. In fact, to obtain a model of great generality, we can leverage the universal approximation capacity of neural networks and use a neural network as the model of the energy (logdensity):
(18) 
where means the output of a multilayer perceptron with input and (weight) parameters ; it takes the inputs in and outputs a scalar in .
Scalable score matching— One way to find an estimate for the energy would be to simply apply the basic definition of score matching. However, as discussed above, such an approach is often considered to be computationally prohibitive due to the appearance of secondorder derivatives in the objective function (Kingma & LeCun, 2010) and one must resort to approximations (Martens et al., 2012a). Our key insight here is to build on the Bayesian interpretation of the score function of Raphan & Simoncelli and the Parzen score matching of Vincent. This leads to a method which scales well to high dimensions, as well as to large datasets.
The DEEN objective— Thus, we use a multilayer perceptron that parametrizes and the score is given by . In this work we use the Gaussian kernel, which gives, based on Eq. 14, the objective
(19) 
For the MLP parameterization of the energy, the term is an acyclic directed computation graph mapping
, which can be built with automatic differentiation. The variance
is the smoothing hyperparameter of the Parzen kernel, which we choose by optimizing the test likelihood of the Parzen density over a validation set in our experiments. The objective is then optimized:
(20) 
Hyperparameter — Parzen score matching gave us an objective for energy estimation that did not involve estimating the Hessian of the energy. In addition, there is also an automatic regularization in DEEN because of the hyperparameter of the Parzen smoothed density. Note that is dataset dependent: as the dataset size increases, decreases. Also, is smaller for datasets with lower complexity (measured by entropy). However, the theoretical limit is never within reach for complex datasets due to the curse of dimensionality as has to grow exponentially in dimension .
SGD optimization of — For the dataset , is approximated by constructing the joint set :
(21)  
(22) 
where are iid samples from the smoothing Kernel , i.e. data with noise added. The objective is then approximated by an empirical version:
(23) 
After constructing the set with (for a better approximation), we would arrive at a “deterministic” objective, which is optimized with SGD with two sources of stochasticity: the random choice of mini batches and . In our doubleSGD implementation we set the mini batch size of the second set to be one, i.e. a single noisy observation for each data point. We emphasize that the computations are scalable, unlike a straightforward application of the original score matching.
DEEN as PoE— In our MLP architecture, the energy is constructed linearly from the last hidden layer :
(24) 
where the sums are over the hidden units in the last layer. The factorization corresponds to the products of experts (PoE) where
are the experts parametrized by a neural network. PoE is powerful in partitioning the space in a distributed representation formed by
, in contrast to mixture models that partition the space with one region per expert (Hinton, 1999). Despite PoE’s immense potentials in beating the curse of dimensionality, the estimation of the derivative of the partition function plagues the learning, and approximations like the contrastive divergence (Hinton, 2006; Hinton et al., 2006) has yet to overcome the curse of inference for complex distributions. In DEEN, powered by score matching, no such approximations are made. In this work, all the experts share parameters, but our framework contains “mixtures of DEENs”, where the energy is constructed by averaging the outputs of neural networks if we loosely consider each hidden unit as a neural network in itself. Although a single DEEN is by itself a universal estimator, but imposing structures by breaking the neural network into subnetworks would help us better understand the experts specilizations, which presumably will be crucial for representation learning with the emergent PoE.DEEN and denoising autoencoders— An important result in the score matching literature is due to Alain & Bengio (2014), who discovered a strong and surprising link between regularized (contractive/denoising) autoencoders and the score function itself. Denoting as the optimal reconstruction of the regularized autoencoder, they arrived at the result:
(25) 
which is similar to the singlestep denoising interpretation of DEEN on the surface. But DEEN is not an autoencoder, it learns the energy and thus the score function without a decoder, and as we emphasized, the optimal in the kernel density estimator will not be small for any complex dataset. In addition, denoising autoencoders have stability issues in learning the score function, which we will elaborate further shortly. We think DEEN is better regularized in that respect, since it has no constraint on which is a hyperparameter (dataset dependent). In fact, a natural estimator for is obtained by optimizing the test likelihood of the Parzen density over a validation set.
Energy estimation vs score estimation— What if, alternatively, one is only interested in estimating the score function and not the energy? A multilayer perceptron that maps to — referred here as deep score matching (DSM) — can be trained to learn the score by optimizing the objective given in Eq. 17. However, the “direct” score function estimation with MLPs are not robust — this problem was first observed in Alain & Bengio (2014) for denoising autoencoders (see their Fig. 7). We provide a detailed diagnosis of this problem in our 2D experiments below. DEEN does not suffer from such stability issues of the score function estimation by construction.
4 Experiments
Twodimensional experiments— We estimated the energy function of a spiral toy data set as well as a Gaussian mixture data set presented in Fig. 1(a)–(d). The MLP architecture is fully connected with three hidden layers of tanh nonlinearities, and a linear last layer: , . DEEN estimated the energy accurately with wellbehaved training curves.
Failures of estimating the score function directly— We experimented with learning the score function directly in our DSM network discussed at the end of Sec. 3. Whether the learned vector field can be written as a gradient of a scalar field can be diagnosed and visualized in 2D by using the curl operator: . The identity is a necessary and sufficient condition for the existence of a scalar field such that : checking is therefore an easy sanity check that the learned score must pass. But, in practice, training the MLP with the DSM objective learns a vector field such that , far from zero in fact as we show in our experiments in Fig. 1 (g)(h). Note that DEEN is immune to this type of instability, as the score function is computed from the learned energy function itself.
DEEN on MNIST— In MNIST experiments, the hyperparameter was first chosen by maximizing the test likelihood of the Parzen estimator. Fig. 2(a) shows the running average of vs. SGD iterations for . We tested DEEN on singlestep denoising (SSD) experiments, presented in Fig. 2 (b)(c). Theoretically, SSD should be done with the same value of , but we also tested it with different values, presented here are the experiments with . The MLP architecture is the same as 2D experiments but with .
DEEN on natural images— We tested DEEN’s performance on patches extracted from the van Hateren database of natural images (Van Hateren & van der Schaaf, 1998). This time, after training, we also tested the denoising for additive patchdependent noise where . Fig. 3(a) shows examples of the noisy patches in the test set. Fig. 3(b) is the denoising by median + Gaussian filters, a baseline in signal processing (Gonzalez & Woods, 2012). Fig. 3 (c) is the singlestep denoising from the energy function. The err/pixel reported here is equal to normalized by the number of pixels. The MLP architecture here is again fully connected, but with five hidden layers of tanh nonlinearities, and
. All experiments presented so far were implemented in Tensorflow
(Abadi et al., 2016) using Adam (Kingma & Ba, 2014) as the optimizer with the learning rate of 0.001.CD failures— We also experimented with an alternative formulation of score matching by means of contrastive divergence (CD) with Langevin updates (Hyvärinen, 2007). However, the equivalence of Langevin dynamics CD and score matching is only exact in asymptotic regimes, and in our 2D experiments the CD updates showed severe mode collapse that we report here. We trained the same MLP architecture as the DEEN experiments in 2D using the CD updates with negative samples:
(26)  
(27) 
where , and is the learning rate set to . The above dynamics does not optimize a welldefined objective, and in our experiments, we observed severe mode collapse in energy estimation for both mixtures of Gaussians and spiral. Examples of the failures are reported in Fig. 1 (i)(j). We did some experimentations with the hyperparameter , but the experiments shown here were for the same value of as in the other 2D experiments in that figure.
5 Discussion
Our main contribution— We presented a hierarchical, yet efficient solution for approximating the energy of data distributions. This was done in terms of a multilayer perceptron, thus allowing universal approximation. The objective is principled, rooted in score matching with consistent estimation guarantees, and the training has the complexity of backpropagation. We further diagnosed instabilities of the denoising autoencoders in learning the score function. DEEN is immune to those particular instabilities since the score function is directly computed from the learned energy function.
Scalable unsupervised learning algorithms—
Our approach fits within the general movement, prominent in recent years, which employs the SGD’s “unreasonable effectiveness” in training deep neural networks at the service of unsupervised learning. An important step is to formulate a learning objective that allows automatic differentiation and some variant of backpropagation to work with linear complexity in the number of dimensions and parameters. With score matching, we had to resort to the Bayesian interpretation and the Parzen kernel formulation of score matching to achieve this.
Different aspects of unsupervised learning— It is important to notice that unsupervised learning is not a single welldefined problem; in fact there is a number of different goals, and different methods excel in different goals (see for example Theis et al. (2015) for discussions on generative models.) One example is the problem of generative modeling and generative adversarial networks are designed for this task (Goodfellow et al., 2014). Another goal is posterior inference for hierarchical graphical models. Variational autoencoders were designed to tackle this problem with a variational objective (Kingma & Welling, 2013). Another problem is the problem of disentangling factors of variation, i.e. the problem of nonlinear ICA. This problem was recently reformulated for time series, solving the nonidentifiability problem (Hyvärinen & Morioka, 2016). In our work here, we looked at another important problem in unsupervised learning, density estimation of unnormalized densities. The score matching objective which is the foundation for our work is surprisingly simple, but its power is displayed when the energy is parameterized by a deep neural network.
Future directions—
Two important applications of energy estimators include semisupervised learning and generative modeling. Presumably, the network has learned a useful representation of the data in the last hidden layer(s), and such a hidden representation would enable very efficient supervised learning. In that regard, understanding the PoE underlying DEEN is an important next step. In addition, the energy function could be directly pluggedin in various MCMC methods. The second problem is perhaps harder since Hamiltonian Monte Carlo (HMC) methods are typically quite slow. But there are important new developments on the HMC front
(Girolami & Calderhead, 2011; Levy et al., 2017), which DEEN can build on. Among closely related work to our present work, we should mention noisecontrastive estimation (NCE), which also has the potential for estimating the energy function using general function approximation by deep learning
(Gutmann & Hyvärinen, 2012). A further important avenue would be to see if NCE provides energy estimates of the same quality as score matching; presumably, an intelligent way of choosing the “noise” distribution may be important.Stein contrastive divergence— In preparing this submission, we came across the work by Liu & Wang (2017) and the Stein contrastive divergence introduced there for learning energy models. There are deep connections between score matching and their approach, which was emphasized in their paper, where at the same time they also avoided secondorder derivatives. However, Stein contrastive divergence is fundamentally based on infinitesimal negativesample generations and CDtype updates (Eq. 13 and Algorithm 1 in the reference). DEEN by contrast is fundamentally based on backpropagation and SGD on an objective. It is important to understand the possible links between the two frameworks and compare the performance and the efficiency of the two algorithms.
References
 Abadi et al. (2016) Abadi, Martín, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard, Michael, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pp. 265–283, 2016.
 Alain & Bengio (2014) Alain, Guillaume and Bengio, Yoshua. What regularized autoencoders learn from the datagenerating distribution. Journal of Machine Learning Research, 15(1):3563–3593, 2014.
 Bengio (2009) Bengio, Yoshua. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
 Bengio et al. (2013) Bengio, Yoshua, Courville, Aaron, and Vincent, Pascal. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
 Besag (1974) Besag, Julian. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B, pp. 192–236, 1974.
 Girolami & Calderhead (2011) Girolami, Mark and Calderhead, Ben. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B, 73(2):123–214, 2011.
 Gonzalez & Woods (2012) Gonzalez, Rafael C and Woods, Richard E. Digital image processing. Prentice Hall, 2012.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Gutmann & Hyvärinen (2012) Gutmann, Michael U and Hyvärinen, Aapo. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13(Feb):307–361, 2012.
 Hinton et al. (2006) Hinton, Geoffrey, Osindero, Simon, Welling, Max, and Teh, YeeWhye. Unsupervised discovery of nonlinear structure using contrastive backpropagation. Cognitive science, 30(4):725–731, 2006.
 Hinton (1999) Hinton, Geoffrey E. Products of experts. 1999.
 Hinton (2006) Hinton, Geoffrey E. Training products of experts by minimizing contrastive divergence. Training, 14(8), 2006.
 Hinton & Sejnowski (1986) Hinton, Geoffrey E and Sejnowski, Terrence J. Learning and relearning in Boltzmann machines. Parallel Distributed Processing, 1, 1986.
 Hyvärinen (2005) Hyvärinen, Aapo. Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.
 Hyvärinen (2007) Hyvärinen, Aapo. Connections between score matching, contrastive divergence, and pseudolikelihood for continuousvalued variables. IEEE Transactions on Neural Networks, 18(5):1529–1531, 2007.
 Hyvärinen (2008) Hyvärinen, Aapo. Optimal approximation of signal priors. Neural Computation, 20(12):3087–3110, 2008.

Hyvärinen & Morioka (2016)
Hyvärinen, Aapo and Morioka, Hiroshi.
Unsupervised feature extraction by timecontrastive learning and nonlinear ICA.
In Advances in Neural Information Processing Systems, pp. 3765–3773, 2016.  Kingma & Ba (2014) Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & LeCun (2010) Kingma, Diederik P and LeCun, Yann. Regularized estimation of image statistics by score matching. In Advances in neural information processing systems, pp. 1126–1134, 2010.
 Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Autoencoding variational Bayes. arXiv:1312.6114, 2013.
 Levy et al. (2017) Levy, Daniel, Hoffman, Matthew D, and SohlDickstein, Jascha. Generalizing Hamiltonian Monte Carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.
 Liu & Wang (2017) Liu, Qiang and Wang, Dilin. Learning deep energy models: contrastive divergence vs. amortized MLE. arXiv preprint arXiv:1707.00797, 2017.

Lyu (2009)
Lyu, Siwei.
Interpretation and generalization of score matching.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, pp. 359–366. AUAI Press, 2009.  Martens et al. (2012a) Martens, James, Sutskever, Ilya, and Swersky, Kevin. Estimating the hessian by backpropagating curvature. arXiv:1206.6464, 2012a.
 Martens et al. (2012b) Martens, James, Sutskever, Ilya, and Swersky, Kevin. Estimating the Hessian by backpropagating curvature. In Proceedings of the 29th International Conference on Machine Learning, ICML, 2012b.
 Raphan & Simoncelli (2011) Raphan, Martin and Simoncelli, Eero P. Least squares estimation without priors or supervision. Neural Computation, 23(2):374–420, 2011.
 SohlDickstein et al. (2011) SohlDickstein, Jascha, Battaglino, Peter B, and DeWeese, Michael R. New method for parameter estimation in probabilistic models: minimum probability flow. Physical Review Letters, 107(22):220601, 2011.
 Theis et al. (2015) Theis, Lucas, Oord, Aäron van den, and Bethge, Matthias. A note on the evaluation of generative models. arXiv:1511.01844, 2015.
 Van Hateren & van der Schaaf (1998) Van Hateren, J Hans and van der Schaaf, Arjen. Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London B: Biological Sciences, 265(1394):359–366, 1998.
 Vincent (2011) Vincent, Pascal. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
Comments
There are no comments yet.