Tutorial on density estimation using DDEs
Learning generative probabilistic models that can estimate the continuous density given a set of samples, and that can sample from that density, is one of the fundamental challenges in unsupervised machine learning. In this paper we introduce a new approach to obtain such models based on what we call denoising density estimators (DDEs). A DDE is a scalar function, parameterized by a neural network, that is efficiently trained to represent a kernel density estimator of the data. Leveraging DDEs, our main contribution is to develop a novel approach to obtain generative models that sample from given densities. We prove that our algorithms to obtain both DDEs and generative models are guaranteed to converge to the correct solutions. Advantages of our approach include that we do not require specific network architectures like in normalizing flows, ordinary differential equation solvers as in continuous normalizing flows, nor do we require adversarial training as in generative adversarial networks (GANs). Finally, we provide experimental results that demonstrate practical applications of our technique.READ FULL TEXT VIEW PDF
Tutorial on density estimation using DDEs
Learning Generative Models using Denoising Density Estimators
Tutorial on Energy-based Generative Models using DDEs
Learning generative probabilistic models from raw data is one of the fundamental problems in unsupervised machine learning. The defining property of such models is that they provide functionality to sample from the probability density represented by the input data. In other words, such models can generate new content, which has applications in image or video synthesis for example. In addition, generative probabilistic models may include capabilities to perform density estimation or inference of latent variables. Recently, the use of deep neural networks has led to significant advances in this area. For example, generative adversarial networks(Goodfellow et al., 2014) can be trained to sample very high dimensional densities, but they do not provide density estimation or inference. Inference in Boltzman machines (Salakhutdinov and Hinton, 2009) is tractable only under approximations (Welling and Teh, 2003). Variational autoencoders (Kingma and Welling, 2014) provide functionality for both (approximate) inference and sampling. Finally, normalizing flows (Dinh et al., 2014) perform all three operations (sampling, density estimation, inference) efficiently.
In this paper we introduce a novel type of generative model based on what we call denoising density estimators (DDEs), which supports efficient sampling and density estimation. Our approach to construct a sampler is straightforward: assuming we have a density estimator that can be efficiently trained and evaluated, we learn a sampler by forcing its generated density to be the same as the input data density via minimizing their Kullback-Leibler (KL) divergence. A core component of this approach is the density estimator, which we derive from the theory of denoising autoencoders, hence our term denoising density estimator. Compared to normalizing flows, a key advantage of our theory is that it does not require any specific network architecture, except differentiability, and we do not need to solve ordinary differential equations (ODE) like in continuous normalizing flows. In contrast to GANs, we do not require adversarial training. In summary, our contributions are as follows:
A novel approach to obtain a generative model by explicitly estimating the energy (un-normalized density) of the generated and true data distributions and minimizing the statistical divergence of these densities.
A density estimator based on denoising autoencoders called denoising density estimator (DDE), and its parameterization using neural networks, which we leverage to train our novel generative model.
Generative adversarial networks (Goodfellow et al., 2014)
are currently the most widely studied type of generative probabilistic models for very high dimensional data such as images or videos. However, they are often difficult to train in practice, they can suffer from mode-collapse, and they only support sampling, but neither inference nor density estimation. Hence, there has been a renewed interest in alternative approaches to learn generative models. A common approach is to formulate these models as mappings between a latent space and the data domain, and one way to categorize these techniques is to consider the constraints on this mapping. For example, in normalizing flows(Dinh et al., 2014; Rezende and Mohamed, 2015) the mapping is invertible and differentiable, such that the data density can be estimated using the determinant of its Jacobian, and inference can be perfomed by applying the inverse mapping. Normalizing flows can be trained simply using maximum likelihood estimation (Dinh et al., 2017). The challenge for these techniques is to design computational structures so that their inverses and Jacobians, including their determinants, can be computed efficiently (Huang et al., 2018; Kingma and Dhariwal, 2018). Chen et al. (2018) and Grathwohl et al. (2019) derive continuous normalizing flows by parameterizing the dynamics (the time derivative) of an ODE using a neural network. They show that this implies that the time derivative of the log density can also be expressed as an ODE, which only involves the trace (not the determinant) of the Jacobian of the network. This makes it possible to use arbitrary network architectures to obtain normalizing flows, but it comes at the computation cost of solving ODEs to produce outputs.
In contrast, in variational techniques the relation between the latent variables and data is probabilistic, usually expressed as a Gaussian likelihood function. Hence computing the marginal likelihood requires integration over latent space. To make this tractable, it is common to bound the marginal likelihood using the evidence lower bound (Kingma and Welling, 2014). As an advantage over normalizing flows, variational methods do not require an invertible mapping between latent and data space. However, Gaussian likelihood functions correspond to an reconstruction error, which arguably leads to blurriness artifacts. Recently, Li and Malik (2018) have shown that an approximate form of maximum likelihood estimation, which they call implicit maximum likelihood estimation, can also be performed without requiring invertible mappings. A disadvantage of their approach is that it requires nearest neighbor queries in (high dimensional) data space.
Not all generative models include a latent space, for example autoregressive models(van den Oord et al., 2016) or denoising autoencoders (DAEs) (Alain and Bengio, 2014). In particular, Alain and Bengio (2014) and Saremi and Hyvärinen (2019) use the well known relation between DAEs and the score of the corresponding data distributions (Vincent, 2011; Raphan and Simoncelli, 2011)
to construct an approximate Markov Chain sampling procedure. Similarly,Bigdeli and Zwicker (2017) and Bigdeli et al. (2017)
use DAEs to learn the gradient of image densities for optimizing maximum a-posteriori problems in image restoration. Our approach also builds on DAEs, but we formulate an estimator for the un-normalized, scalar density, rather than for the score (a vector field). This is crucial to allow us to train a generator instead of requiring Markov chain sampling, which has the disadvantages of requiring sequential sampling and producing correlated samples. In concurrent work,Song and Ermon (2019) are formulating a generative model using Langevin dynamics based on estimating gradients of the data distribution via score matching, which also requires an iterative sampling procedure and can sample the data density exactly only asymptotically. Table 1 summarises the similarities and differences of our approach to Generative adversarial networks (GAN), Score Matching (SM), and Normalizing Flows (NF).
|Forward sampling model||✓||iterative||✓||✓|
|Free net architecture||✓||✓||-||✓|
Here we show how to estimate a density using a variant of denoising autoencoders (DAEs). More precisely, our approach allows us to obtain the density smoothed by a Gaussian kernel, which is equivalent to kernel density estimation (Parzen, 1962), up to a normalizing factor. Originally, the optimal DAE (Vincent, 2011; Alain and Bengio, 2014) is defined as the function minimizing the following denoising loss,
where the data is distributed according to a density over , and represents
-dimensional, isotropic additive Gaussian noise with variance. It has been shown (Robbins, 1956; Raphan and Simoncelli, 2011; Bigdeli and Zwicker, 2017) that the optimal DAE minimizing can be expressed as follows, which is also known as Tweedie’s formula,
where is the gradient with respect to the input , and denotes the convolution between the data and noise distributions and , respectively. Inspired by this result, we reformulate the DAE-loss as a noise estimation loss,
There is a unique minimizer that satisfies
That is, the optimal estimator corresponds to the gradient of the logarithm of the Gaussian smoothed density , that is, the score of the density.
A key observation is that the desired vector-field is the gradient of a scalar function and conservative. Hence we can write the noise estimation loss in terms of a scalar function instead of the vector field , which we call the denoising density estimation loss,
A similar formulation has recently been proposed by Saremi and Hyvärinen (2019). Our terminology is motivated by the following corollary:
The minimizer satisfies
with some constant .
From Proposition 1 and the definition of we know that , which leads immediately to the corollary. ∎
In summary, we have shown how modifying the denoising autoencoder loss (Eq. 1) into a noise estimation loss based on the gradients of a scalar function (Eq. 5) allows us to derive a density estimator (Corollary 1), which we call the denoising density estimator (DDE).
In practice, we approximate the DDE using a neural network . Assuming that the network has enough capacity and is everywhere-differentiable both with respect to and its parameters , we can find the unique minimum of Eq. 5
using standard stochastic gradient descent techniques. For illustration, Figure1
shows 2D distribution examples, which we approximate using a DDE implemented as a multi-layer perceptron. We only use Softplus activations in our network since it is differentiable everywhere.
By leveraging DDEs, our key contribution is to formulate a novel training algorithm to obtain generators for given densities, which can be represented by a set of samples or as a continuous function. In either case, we denote the smoothed data density , which is obtained by training a DDE in case the input is given as a set of samples as described in Section 3. We express our samplers using mappings , where (usually ), and
is typically a latent variable with standard normal distribution. In contrast to normalizing flows,does not need to be invertible. Let us denote the distribution of induced by the generator as , that is , and also its Gaussian smoothed version .
We obtain the generator by minimizing the KL divergence between the density induced by the generator and the data density . Our algorithm is based on the following observation:
Given a scalar function that satisfies the following conditions:
then for small enough .
We will use the first order approximation , where the division is pointwise. Using to denote the inner product, we can write
because the first term on the right hand side is negative (first assumption), the second term is zero (second assumption), and the third and fourth terms are quadratic in and can be ignored for when is small enough. ∎
Based on the above observation, Algorithm 1 minimizes by iteratively computing updated densities that satisfy the conditions from Proposition 2, hence . This iteration is guaranteed to converge to a global minimum, because is convex as a function of .
At the beginning of each iteration in Algorithm 1, by definition is the density obtained by sampling our generator (-dimensional standard normal distribution), and the generator is a neural network with parameters . In addition, is defined as the density obtained by sampling . Finally, the DDE correctly estimates , that is .
In each iteration, we update the generator such that its density is changed by a small that satisfies the conditions from Proposition 2. We achieve this by computing a gradient descent step of with respect to the generator parameters . The constant can be ignored since we only need the gradient. A small enough learning rate guarantees that condition one in Proposition 2 is satisfied. The second condition is satisfied because we update the distribution by updating its generator, and the third condition is also satisfied under a small enough learning rate (and assuming the generator network is Lipschitz continuous). After updating the generator, we update the DDE to correctly estimate the new density produced by the updated generator.
Note that it is crucial in the first step in the iteration in Algorithm 1 that we sample using and not . This allows us, in the second step, to use the updated to train a DDE that exactly (up to a constant) matches the density generated by . Even though in this approach we only minimize the KL divergence with the “noisy” input density , the sampler still converges to a sampler of the underlying density in theory (Section 4.1).
Our objective involves reducing the KL divergence between the Gaussian smoothed generated density and the data density . This also implies that the density obtained from sampling the generator is identical with the data density , without Gaussian smoothing, which can be expressed as the following corollary:
Let and be related to densities and , respectively, via convolutions using a Gaussian , that is . Then the smoothed densities and are the same if and only if the data density and the generated density are the same.
This follows immediately from the convolution theorem and the fact that the Fourier transform of Gaussian functions is non-zero everywhere, that is, Gaussian blur is invertible.
Similar to prior work, we perform experiments for 2D density estimation and visualization over three datasets (Grathwohl et al., 2019)
. Additionally, we use these datasets to learn generative models. For our DDE networks, we used multi-layer perceptrons with residual connections. All networks have 25 layers, each with 32 channels and Softplus activation. Trainings have 2048 samples per iteration. Figure1 shows the comparison of our method with Glow (Kingma and Dhariwal, 2018), BNAF (De Cao et al., 2019), and FFJORD (Grathwohl et al., 2019)
. Our DDEs can estimate the density accurately and capture the underlying complexities of each density. Due to inherent kernal density estimation (KDE), our method induces a small blur to the distribution to the density compared to BNAF. However, our DDE can estimate the density coherently through the data domain, whereas BNAF produces noisy approximation across the data manifold, where the estimated density is sometimes too small or too large. To demonstrate, we show DDEs trained with both small and large noise standard deviationsand .
Generator training and sampling is demonstrated in Figure 1. The sharp edges of the checkerboard samples implies that the generator learns to sample from the target density although the DDEs estimate noisy densities. The generator update requires DDE networks to be optimal at each gradient step. For faster convergence, we take 10 DDE gradient descent steps for each generator update. In Figure 2 we illustrate the influence of the noise level on the generated densities. This shows that in practice larger do not lead to accurate sampling, since inverting the Gaussian blur becomes ill-posed. We summarize the training parameters used in these experiments in Appendix E.
|Real samples||Glow||BNAF||FFJORD||Ours ()||Ours ()||Ours generated|
Figure 3 illustrates our generative training on the MNIST (LeCun, 1998) dataset using Algorithm 1. We use a dense block architecture with fully connected layers here and refer to Appendix B for the network and training details, including additional results for Fashion-MNIST (Xiao et al., 2017). Figure 3
shows qualitatively that our generator is able to replicate the underlying distributions. In addition, latent-space interpolation demonstrates that the network learns an intuitive and interpretable mapping from noise to samples of the distribution.
|(a) Generated samples||(b) Real samples|
(c) Interpolated samples using our model
Figure 4 shows additional experiments on the CelebA dataset (Liu et al., 2015). The images in the dataset have dimensions and we normalize the pixel values to be in range . To show the flexibility of our algorithm with respect to neural network architectures, here we use a style-based generator (Karras et al., 2019) architecture for our generator network. Please refer to Appendix C for network and training details. Figure 4 shows that our approach can produce natural-looking images, and the model has learned to replicate the global distribution with a diverse set of images and different characteristics.
|(a) Generated samples||(b) Real samples|
We perform a quantitative evaluation of our approach based on the synthetic Stacked-MNIST (Metz et al., 2016) dataset, which was designed to analyse mode-collapse in generative models. The dataset is constructed by stacking three randomly chosen digit images from MNIST to generate samples of size . This augments the number of classes to
, which are considered as distinct modes of the dataset. Mode-collapse can be quantified by counting the number of nodes generated by a model. Additionally, the quality of the distribution can be measured by computing the KL-divergence between the generated class distribution and the original dataset, which has a uniform distribution in terms of class labels. Similar to prior work(Metz et al., 2016)
, we use an external classifier to measure the number of classes that each generator produces by separately inferring the class of each channel of the images.
Figure 5 reports the quantitative results for this experiment by comparing our method with well-tuned GAN models. DCGAN (Radford et al., 2015) implements a basic GAN training strategy using a stable architecture. WGAN uses the Wasserstein distance (Arjovsky et al., 2017), and WGAN+GP includes a gradient penalty to regularize the discriminator (Gulrajani et al., 2017). For a fair comparison, all methods use the DCGAN network architecture. Since our method requires two DDE networks, we have used fewer parameters in the DDEs so that in total we preserve the same number of parameters and capacity as the other methods. For each method, we generate batches of 512 samples per training iteration and count the number of classes within each batch (that is, the maximum number of different labels in each batch is 512). We also plot the reverse KL-divergence to the uniform ground truth class distribution. Using the two measurements we can see how well each method replicates the distribution in terms of diversity and balance. Without fine-tuning and changing the capacity of our network models, our approach is comparable to modern GANs such as WGAN and WGAN+GP, which outperform DCGAN by a large margin in this experiment.
We also report results for sampling techniques based on score matching. We trained a Noise Conditional Score Network (NCSN) parametrized with a UNET architecture (Ronneberger et al., 2015), which is then followed by a sampling algorithm using the Annealed Langevin Dynamics (ALD) as described by Song and Ermon (Song and Ermon, 2019). We refer to this method as UNET+ALD. We also implemented a model based on our approach called DDE+ALD, where we used our DDE network. While our training loss is identical to the score matching objective, the DDE network outputs a scalar and explicitly enforces the score to be a conservative vector field by computing it as the gradient of its scalar output. ALD+DDE uses the spatial gradient of the DDE for sampling with ALD (Song and Ermon, 2019), instead of our proposed direct, one-step generator. We observe that DDE+ALD is more stable compared to the UNET+ALD baseline, even though the UNET achieves a lower loss during training. We believe that this is because DDEs guarantee conservativeness of the distribution gradients (i.e. scores), which leads to more diverse and stable data generation as we see in Figure 5. Furthermore, our approach with direct sampling outperforms both UNET+ALD and DDE+ALD.
|(a) Generated modes per batch||(b) KL-divergence|
We follow the experiments in BNAF (De Cao et al., 2019) for density estimation on real-world data. This includes POWER, GAS, HEPMASS, and MINIBOON datasets (Asuncion and Newman, 2007). Since DDEs can estimate densities up to their normalizing constant, we approximate the normalizing constant using Monte Carlo estimation for these experiments. We show average log-likelihoods over test sets and compare to state-of-the-art methods for normalized density estimation in Table 2. We have omitted the results of the BSDS300 dataset (Martin et al., 2001), since we could not estimate the normalizing constant reliably (due to high dimensionality of the data).
To train our DDEs, we used Multi-Layer Perceptrons (MLP) with residual connections between each layer. All networks have 25 layers, with 64 channels and Softplus activations, except for GAS and HEPMASS, which employ 128 channels. We trained the models for 400 epochs using learning rate ofwith linear decay with scale of every 100 epochs. Similarly, we started the training by using noise standard deviation and decreased it linearly with the scale of up to a dataset specific value, which we set to for POWER, for GAS, for HEPMASS, and
for MINIBOON. We estimate the normalizing constant via importance sampling using a Gaussian distribution with the mean and variance of the DDE input distribution. We average 5 estimations using 51200 samples each (we used 10 times more samples for GAS), and we indicate the variance of this average in Table2.
|RealNVP (Dinh et al., 2017)|
|Glow (Kingma and Dhariwal, 2018)|
|MADE MoG (Germain et al., 2015)|
|MAF-affine (Papamakarios et al., 2017)|
|MAF-affine MoG (Papamakarios et al., 2017)|
|FFJORD (Grathwohl et al., 2019)|
|NAF-DDSF (Huang et al., 2018)|
|TAN (Oliva et al., 2018)||12.06|
|BNAF (De Cao et al., 2019)||12.06|
Our approach relies on a key hyperparameterthat determines the training noise for the DDE, which we currently set manually. In the future we will investigate thorough strategies to determine this parameter in a data-dependent manner. Another challenge is to obtain high-quality results using extremely high-dimensional data such as high-resolution images. In practice, one strategy is to combine our approach with latent embedding learning methods (Bojanowski et al., 2018), in a similar fashion as proposed by Hoshen et al. (2019). Finally, our framework uses three networks to learn a generator based on input samples (a DDE for the samples, the generator, and a DDE for the generator). Our generator training approach, however, is independent of the type of density estimator, and techniques other than DDEs could also be used in this step.
In conclusion, we presented a novel approach to learn generative models using a novel density estimator, called the denoising density estimator (DDE). We developed simple training algorithms and our theoretical analysis proves their convergence to a unique optimum. Our technique is derived from a reformulation of denoising autoencoders, and does not require specific neural network architectures, ODE integration, nor adversarial training. We achieve state of the art results on a standard log-likelihood evaluation benchmark compared to recent techniques based on normalizing flows, continuous flows, and autoregressive models.
The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §5.
On estimation of a probability density function and mode. Ann. Math. Statist. 33 (3), pp. 1065–1076. External Links: Cited by: §3.
Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: Appendix C.
Deep Boltzmann machines. In Artificial intelligence and statistics, pp. 448–455. Cited by: §1.
This is a proof for Proposition 1 in the main paper.
Clearly is convex in hence the minimizer is unique. We can rewrite the noise estimation loss from Equation 3 as
which we minimize with respect to the vector-valued function . Substituting yields
We can minimize this with respect to by differentiating and setting the derivative to zero, which leads to
which follows from basic calculus and has also been used by Raphan and Simoncelli (2011). ∎
For the experiments on MNIST and Fashion-MNIST, we used the Dense Block architecture (Huang et al., 2017)
with 15 fully-connected layers and 256 additional neurons each. The last layer of the network maps all its inputs to one value, which we train to approximate the density of input images. For the generator network, we used Dense Blocks with 15 fully connected layers and 256 additional neurons each. The last layer maps all outputs to the image size of. For the input of the generator, we used noise with a 16 dimensional standard normal distribution. In addition, the DDEs were trained with noise standard deviation , where pixel values were scaled to range between 0 and 1.
In addition to the MNIST results, here we include visual results on the Fashion-MNIST dataset, where we have used the exact setup as in our experiments on MNIST for training our generator. Figure 6 shows our generated images and interpolations in the latent space of Fashion-MNIST.
|(a) Generated samples||(b) Real samples|
(c) Interpolated samples using our model
For our experiments on CelebA we use a style-based generator (Karras et al., 2019) architecture. We use Swish activations (Ramachandran et al., 2017) in all hidden layers of our networks except for their last layer, which we set to be linear. Additionally, we normalized each output of the generator to be in the accepted range . We used equalized learning rate (Karras et al., 2018) with learning rate for the DDEs, and a slightly lower learning rate for the generator . We trained our DDEs using and set the truncation parameter in the style-based generator to when feeding the generator with random noise (Karras et al., 2019) at test time.
In our experiments with Stacked-MNIST, our generative networks are trained using a learning rate of , the Adam optimizer with , and the generator updates took place after every 10th DDE step. We use standard parameters for the other methods (DCGAN, WGAN, WGAN+GP), including a learning rate of , the Adam optimizer with , and we trained the generator every 5th iteration of the discriminator training.
The NCSN models are trained to remove Gaussian noise at ten different noise standard deviations within the range (geometric interpolation). The input to the NCSN models include also the noise level. To further improve the quality of the networks, we use separate last-layers for each noise standard deviation for training and test. This way we can increase the capacity of the network significantly, while we keep the same order of parameters as in the other methods. We used the Adam optimizer with original parameters and a learning rate of .
|Figure 1 (density estimation)||0.05, 0.2||
|Figure 1 (generative), Figure 2||0.2||Checkerboard||
Table 3 lists the hyper-parameters we used for different experiments on 2D datasets.