At a high level, this work brings together empirical Bayes as formulated by Robbins (1956) together with its recent interpretation for learning unnormalized densities (Saremi and Hyvärinen, 2019) and variational Bayes in its modern formulation (Jordan et al., 1999; Kingma and Welling, 2013; Rezende et al., 2014). This unification as it will become clear is quite general but it has been particularly inspired by the problem of learning the energy function of the random variable based on the i.i.d. sequence sampled from in . Our plan for the introduction is to first briefly review empirical Bayes and the empirical Bayes denoising methodology for learning unnormalized densities. We then present our main contribution in bringing variational Bayes and variational autoencoders to the picture, together with a highlight of the main empirical results in the paper.
1.1 Empirical Bayes
The story goes back to (Robbins, 1956), where one starts with the i.i.d. sequence (the noisy observations that one measures) and the simple yet profound question is whether one can estimate given a single noisy measurement using the least-squares estimator of assuming that we do not know the distribution of , i.e. we only know the measurement kernel also known as the noise model. There are two main results. The first one is not surprising and is the fact that the least-squares estimator of is the Bayes estimator:
The second result is remarkable and it is the fact that the Bayes estimator can be written in closed form purely based on the distribution of . For Gaussian kernels the estimator takes the form (Miyasawa, 1961):
This result is indeed surprising since, on its face, one would expect that knowing is a must to carry out the integral in (1) in closed form but that is in fact not necessary for a large class of noise models; see (Raphan and Simoncelli, 2011) for literature review and a unified formalism on empirical Bayes estimators. [on notation] A derivation of Miyasawa’s result (Eq. 2) consistent with the notations here is given in (Saremi and Hyvärinen, 2019). The only difference is that in this paper we switch from denoting densities by (used in the mathematical statistics literature) to (common in the variational Bayes literature). In this work, we are moving towards variational Bayes and this switch from to is appropriate. On the other hand, we use to denote energy functions. In addition, we follow the convention of dropping the subscripts from densities when the arguments to the density functions are present: , etc.
1.2 Neural Empirical Bayes
With this background on empirical Bayes, consider a different problem: approximating based on the i.i.d. sequence . This is a simpler problem than (Robbins, 1956) since we now observe the clean samples. It is also a relaxation of the problem of estimating the density of since, to say the least, is smoother than . In addition, in approximating the score function we only require approximating modulo an additive constant. In other words, the learned density is unnormalized.
In neural empirical Bayes—referred to by DEEN due to its origin in (Saremi et al., 2018)111“Deep Energy Estimator Networks” was based prominently on denoising score matching (Vincent, 2011), itself rooted in score matching (Hyvärinen, 2005) , and with connections to denoising autoencoders
, and with connections to denoising autoencoders(Alain and Bengio, 2014); but as expressed in (Saremi and Hyvärinen, 2019) we maintain the view that for the problem of approximating , empirical Bayes is the more fundamental formulation. Also, regarding combining score matching and latent variable models, we should point out (Swersky et al., 2011; Vértes and Sahani, 2016) with fundamentally different approaches than here as they are both based on Hyvärinen’s score matching: they are formulated for learning the energy function of —not —and therefore algorithmically they require the computation of second-order derivatives of the energy function.—the learning problem is set up by parametrizing the energy function of
with a neural network with parameters, denoted by We remind the reader that the energy function of is defined as modulo an additive constant. With the energy function parametrization, the Bayes estimator (Eq. 2) takes the following form:
DEEN’s learning algorithm is based on the following objective, minimized with SGD:
where the expectation is over the empirical distribution on pairs:
In summary, there are two main “relaxations” in formulating density estimation in DEEN: (i) we are modeling the smoothed density222The learned energy function has an implicit dependence on , the hyperparameter in
, the hyperparameter in. that is associated with not (see Figure 1a) (ii) the statistical model is unnormalized (Hyvärinen, 2005).
1.3.1 Big Picture
DEEN appears to be a powerful and efficient learning framework for what it was designed to solve: learning unnormalized smoothed densities with the methodology of empirical Bayes. The “power” comes from the neural network parametrization (Bengio, 2009; Lu et al., 2017) and the “efficiency” comes from SGD and the fact that the energy is computed deterministically, short-cutting the inference step. But can we do better?
Conceptually, one insight in (Saremi and Hyvärinen, 2019) was to make a case in favor of probabilistic modeling of instead of . But what is missing there is a latent space where one can reason about noisy data. A natural question arises: can we set up the learning problem in terms of the denoising methodology of empirical Bayes, but also have a latent space? This quest was conceptual/aesthetic at first—i.e. to make the model “more Bayesian”—but to our surprise it also came with a significant boost to DEEN itself.
1.3.2 Learning Algorithm
The sketch of the learning algorithm is as follows:
Following up on (Saremi, 2020), we define a latent variable model for the random variable , where the joint density is set up by taking the prior to be the isotropic Gaussian and taking the conditional density to be
The evidence lower bound (ELBO) for follows immediately:
started with the setup above with the goal of formulating Bayesian inference for very noisy data and they introduced the notion ofimaginary noise model defined by and studied ELBO for instead. In short, in that framework is out of the picture, i.e. the noisy data is not seen in training the model. The noise model was called “imaginary” since the measurement kernel (for real noise) dictated how the latent variable model for the clean data was set up. In short, the idea was to implicitly model noise with Bayesian inference.
Here we take a very different route where we aim at explicitly modeling the noisy data with the noise model given in Eq. 6. The main idea is to bring back the empirical Bayes, but now we parametrize the energy function with the negative of the ELBO (modulo an additive constant). We therefore arrive at the following form for the energy function:
The negative ELBO is known as variational free energy. It is called the energy function here since, crucially, instead of minimizing it (maximizing the ELBO), we follow (Saremi and Hyvärinen, 2019) and set up the empirical Bayes learning objective:
This framework for approximating with the unified machinery of empirical Bayes and variational Bayes is named unnormalized variational Bayes: “unnormalized”, since we only approximate modulo an additive constant; and “variational Bayes”, since it is indeed the workhorse for learning even though the learning objective itself is not based on maximum likelihood estimation but based on the empirical Bayes least-squares estimation.
1.3.3 Empirical Results
Empirically, the following are the two highlights in the paper:
One may postulate that nothing should be gained here empirically: “nothing”, in terms of obtaining a lower loss. The end result is approximating and if a neural network can do just that, why should one go through this lengthy program? But as it turns out—given equal “capacity” as defined by the dimension of —there is a significant gap in the loss between parametrizing with a MLP (as it was done in DEEN) versus the parametrization in the variational Bayes machinery of UVB.
We also revisit “walk-jump sampling” proposed in (Saremi and Hyvärinen, 2019), which we explore here for (the random variable takes values in ). Why such a large ? We would like to explore a new paradigm in generative modeling, a middle ground between obtaining high-quality samples and having (very) fast-mixing MCMC samplers. The walk-jump sampling is designed to be such a “middle ground”. On MNIST, for , we demonstrate that in a single run—without a restart—we can traverse all classes of digits in a variety of styles with a fast-mixing sampler, sampling , which are mapped to by the jumps via after every (only) 10 MCMC steps.
The remainder of the paper is organized as follows. In Sec. 2.1 we introduce variational autoencoders. In Sec. 2.2 we review -VAE which this work builds on. In Sec. 2.3 we present our main contribution in unifying empirical Bayes and variational Bayes which, algorithmically speaking, brings together DEEN and -VAE. We then present the empirical results. In Sec. 3.1 we present the network architecture used for both DEEN and UVB chosen s.t. the dimensions of match. In Sections 3.2 and 3.3 we present one of the most surprising result in the paper: UVB’s significant improvements over DEEN accentuated for larger . In Sec. 3.4 we present the walk-jump sampling results on MNIST for to illustrate fast-mixing yet good quality sample generation capability of UVB in the very noisy regime. In Sec. 3.5 we look more closely at the encoder/decoder underlying the UVB, emphasizing UVBVAE. In Sec. 4 we discuss open questions, and we finish with a summary.
2 Unnormalized Variational Bayes (UVB)
Our plan for this section is to first introduce variational Bayes with a perspective suited for its unification with empirical Bayes where one is interested in modeling not . This paper also builds on (Saremi, 2020) in formulating a notion of smoothed variational inference which we briefly review. This is then followed by presenting our main contribution.
2.1 Auto-Encoding Variational Bayes
Consider the random variable in in the context of latent variable models, where we introduce the (latent) random variable in with a parametrized joint density and our goal is to learn such that is a good approximation to . This is an intractable problem in general (in the absence of a priori knowledge on the distribution ofas the metric of choice to measure the approximation, and given the i.i.d. sequence , the problem of learning is then equivalent to maximizing the log-likelihood: . In directed graphical models, one considers factorizing the joint density , but the brute force maximum likelihood estimation of the parameters is in general intractable due to the computation .
In variational inference, one approaches this problem by studying yet another intractable problem: approximating the posterior Indeed, the posterior itself is also intractable (in general) due to the Bayes rule , where the only tractable density is the numerator (assuming we have designed a tractable factorization). In other words, in probabilistic graphical models, the problem of modeling and the problem of the posterior inference over are duals and the complexity of the two problems “match” in some loose sense (see Remark 2.1 below).
Taking a flexible yet tractable as the candidate to approximate the posterior, it is straightforward to derive a lower bound for using Jensen’s inequality:
where is the relative entropy between the approximate posterior and the prior, and the first term measures the reconstruction performance of the autoencoder. The right hand side in the inequality is referred to by evidence lower bound (ELBO) and the framework of choice for learning and inference is the variational autoencoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014) with the important invention of the reparameterization trick that was developed to pass gradients through
crucial for having low variance estimates of the gradients of the ELBO with respect to.
The “duality” presented above is in fact another motivation for setting up a latent variable model for : since is more tractable than , therefore must be more tractable than . More specifically, the standard assumptions made in the literature on (i) the Gaussian prior and (ii) the factorized Gaussian posterior, both appears to be very natural in studying and more so in higher dimensions and/or large regimes. However, there are subtleties as we discuss next.
2.2 Imaginary Noise Models and -Vae
Now consider learning and inference in VAEs, where one is particularly interested in a formulation suited for instead of (see Remark 2.1). At a high level, this problem is also motivated by formalizing a notion of smoothed variational inference as outlined in (Saremi, 2020), where the goal is to have a machinery to “reason” about noisy data and thus be robust to noise, where smoothness and robustness are taken as duals, one implying the other.
On the surface, it seems straightforward to set up a latent variable model for the random variable , where in the presentation in the previous section needs to be changed to . First, one needs to set up the joint density . Due to the definition , a natural choice is given by
together with the standard isotropic Gaussian prior (see Remark 2.1). It is straightforward to derive the following lower bound for :
where is dropped from the r.h.s. as it does not affect the optimization of the ELBO ( is fixed). As stated in the introduction, we are especially interested in this problem in the regime of large noise (large ). Just from the start, the variational inference formulated as such is “doomed for failure” as the stochastic estimators of the gradients of the ELBO will now have a high variance amplified by the Gaussian noise. In contrast to the denoising machinery in neural empirical Bayes, here the clean data is completely left out of the picture.
In (Saremi, 2020), they took a different route where they considered defining a notion of imaginary noise model. In its summary, one starts with the real noise model as defined by and then use it as a template for defining the noise model for the clean data:
where the rationale for the notation is that it is indeed the Bayes estimator of given : . The model was named -VAE, summarized by:
After learning, if the -VAE is shown noisy data, say sampled from —or any other noise processes—it first infers which is then used to estimate by the decoder . The -VAE was trained only seeing clean samples, but quite surprisingly, the imaginary noise model was robust to large levels of (real) noise. Informally speaking, it appears that imaginary noise models are very expressive in dealing with noise, but at the present time they cannot compete with DEEN with its denoising engine for learning. This motivated the model we present next that brings DEEN and -VAE closer together.
2.3 Uvb = -Vae + Deen
In the space of models, neural empirical Bayes (Saremi and Hyvärinen, 2019) (referred to by DEEN) and the -VAE (Saremi, 2020) are very different as summarized below, where we highlight their key features both conceptually and from algorithmic standpoint:
DEEN is based on the denoising machinery of empirical Bayes with the Bayes estimator of given by . The learning is formulated by parametrizing the negative (modulo a constant) with a neural network and the error signal is minimized on expectation using stochastic gradients. There are two types of noise present in the learning:
the noise in the SGD,
the noisy training data sampled from ,
but there is no latent space and the model itself is fully deterministic, e.g. given , the model always returns the same result for . A key feature of DEEN inherited from empirical Bayes is that learning is not based on maximum likelihood estimation but based on least-squares estimation.
-VAE is based on variational Bayes. There, the noisy data is absent during learning and the model is purely based on maximizing the ELBO for as stated in (10), but the model is structured such that the inference is smoothed and robust to noise, “capable of denoising”. There are two types of noise present in learning:
the noise in the SGD,
the noise in inference
In contrast to DEEN, the latent space is at the heart of the learning framework, e.g. given noisy data at test time, the model infers by the encoder and then returns via the decoder. Finally, -VAE is based on maximum likelihood estimation which is the starting point in the standard formulation of variational Bayes.
Here, our starting point is the same as (Saremi, 2020) where we set up the latent variable model for as defined by
where is parameterized by a neural network and we take the prior to be
In this work, we take the approximate posterior to the factorized Gaussian (see Remark 2.1):
where is the index for the dimensions in the latent space . With this setup, one obtains the following lower bound for :
With the choices (i) the Gaussian prior and (ii) the factorized Gaussian posterior, the KL divergence term is easily derived in closed form (Kingma and Welling, 2013):
The important invention in variational autoencoders was the reparameterization trick which was devised to pass gradients through the noise present in the inference network (Kingma and Welling, 2013; Rezende et al., 2014). For the factorized Gaussian posterior (13), the reparametrization takes the form:
Main Idea 1
How can we bring empirical Bayes to the picture? The idea is somewhat straightforward conceptually. It is to use the VAE machinery to parametrize the energy function of identify it by the negative of the ELBO for . In the simple setup here, the energy function takes the form
which is minimized with SGD. At optimality,
In plain words, we only use the VAE machinery to parameterize the energy function: the learning objective is not based on maximum likelihood estimation but based on the empirical Bayes least-squares estimation. This framework which is designed to approximate with the unified machinery of empirical Bayes and variational Bayes is named unnormalized variational Bayes (UVB); “unnormalized”, since we can only approximate modulo an additive constant; “variational Bayes” since it is indeed the workhorse for learning the unnormalized density.
There are three types of noise present in the learning of the energy function:
The noise in SGD,
The (very) noisy training data sampled from ,
The noise in the inference sampled with the reparametrization trick (16).
Informally speaking, empirical Bayes and variational Bayes play an “equally” important role here. From the perspective of learning dynamics, the presence of the three types of noise (described above) is to our knowledge novel, but also troubling at first: one may suspect that the model should not work at all! But the most surprising empirical result in the paper is that the new machinery significantly improves on (Saremi and Hyvärinen, 2019) where they parametrized the energy function with a very wide (highly overparametrized) ConvNet. As we demonstrate the significant quantitative improvements in the loss become qualitative, as visualized by the Bayes estimator in the very noisy (ultraviolet) regime.
3.1 Network architecture
The encoder/decoder architecture here is the same one used in (Saremi, 2020). There are three neural networks in UVB: (i) (ii) , (iii) . The architectures for and were the same, without any weight sharing. They were both ConvNets with the expanding channels=
, without pooling, and one fully connected layer with 200 neurons, and the linear readout with. The decoder
had one hidden layer with 2000 neurons and the logistic readout. Throughout, the activation function was
, a smoothed-out ReLU that comes with at least two different names: “SiLU”(Elfwing et al., 2017) and “Swish” (Ramachandran et al., 2017).
We chose the Adam optimizer (Kingma and Ba, 2014) with the batch size of 128 and the constant learning rate of2017). The standard constant learning rate was not stable, which we think it is due to the three types of noise present in the learning as discussed at the end of Section 2.3. We also ran some experiments with smaller batch size of 64 and 32 and the learning was stable arriving at similar losses after 400 epochs in the range of reported here.
In both UVB and DEEN (see below), the architectures for MNIST and CIFAR10 were the same. The architecture search for this problem is important (especially in exploring large for complex distributions) but that is beyond the scope of this paper. One crucial choice both in UVB and DEEN is the activation function. The choice of a smooth activation function is important here333With ReLU, the optimizer saturates at significantly higher loss (compared to SiLU/Swish). since must be computed first before computing the stochastic gradients for updating parameters ( is the loss evaluated on mini batches).
There is only one neural network in DEEN, , which parametrizes the energy function. We used a ConvNet with a similar architecture as in (Saremi and Hyvärinen, 2019), but smaller with the expanding channels=(128,256,512) and the fully connected layer of size 100. In Table 1 we compare the dimensions of in DEEN vs. in UVB.
3.2 UVB’s capacity in approximating the score function vs. DEEN’s
We start with UVB’s improvements over DEEN where we report both the training and test losses at the end of training with constant learning rate as stated in previous section. The results are reported in Tables 2 and 3 which clearly demonstrates that UVB has a higher capacity in approximating highlighted by the significant gaps in the losses between the two for larger noise levels. UVB’s improvements over DEEN is an empirical observation and is not a fundamental result per se: DEEN is founded upon the universal approximation capability of neural networks with its rich theory (Cybenko, 1989; Hornik et al., 1989; Lu et al., 2017) and the end goal in both DEEN and UVB is approximating . Therefore if the neural network has large enough capacity one should be able to arrive at a solution such that is arbitrarily close to pointwise. However, that is a big if! Ultimately, the neural network is finite with a finite capacity and the parametrization and the optimization both play important roles in learning.
In Tables 2 and 3, one also notices a clear qualitative difference between UVB and DEEN in terms of the generalization gap (in abuse of terminology, taken to be the difference between the test loss and the training loss). In DEEN, there is practically no difference between training and test losses while in UVB there is a gap which quite surprisingly persists to very high levels of noise. This gives yet another perspective on UVB’s higher capacity in optimizing the empirical Bayes denoising loss.
For a sanity check, we also experimented with UVB with a Laplace decoder as outlined in (Saremi, 2020), where we assumed (wrongly) that the noise model had been the Laplace distribution even though the training data was the Gaussian corrupted samples. Not surprisingly, the learning capacity of the model becomes very limited, where for on MNIST the training loss saturated at . This experiment validates our expectation that the design of the VAE itself plays an important role in learning the energy function .
3.3 The ultraviolet regime
We are especially interested in the regime of large . Here we present visualization experiments for on MNIST (LeCun et al., 1998). To have a sense of the scale , note that the largest distance in the hypercube is the diagonal, i.e. to , but the norm of the Gaussian noise itself concentrates at for .444The geometric interpretation of noise level with relation to pairwise distances in the dataset is discussed in (Saremi and Hyvärinen, 2019) around the notion of “-sphere”. See Figure 2 in the reference. The noise also happens to be quite high for our visual system as illustrated in Figure 2. Informally speaking, we call this the ultraviolet regime of noise where the noisy data become hardly recognizable.555Ultraviolet is defined at the onset of 750 THz in electromagnetic radiation, just above (in frequency) the violet range of the spectrum, which is invisible to humans although visible to many insects. Source: https://en.wikipedia.org/wiki/ultraviolet The empirical Bayes estimator of given noisy samples for randomly selected samples from the MNIST test set is presented in Figure 2 after learning the energy function for both UVB and DEEN.
3.3.1 Poor man’s NEBULA
Saremi and Hyvärinen (2019) defined a probabilistic notion of associated memory called NEBULA as attractors (strict local minima) of the energy function where the attractor dynamics is dictated by the gradient flow of the leaned energy . We adopt the same definition here, where now the energy function is learned by UVB. Strictly speaking, to arrive at the attractors one should run the continuous gradient flow dynamics which in practice is simulated with gradient descent with small step sizes. In this work, we experimented with a “poor man’s” version of NEBULA where we assume we only have the budget to take two steps and set the step size to be . The results are reported in Figure 2d: if we were to take more steps, the denoised signals would have been cleaned up even more but that comes with unexpected consequences, changing the style of digits to more prototypical modes and the digit classes themselves could also change.
3.3.2 Failures of Vanilla (Gaussian) VAE in modeling
Below we report the failures of the vanilla Gaussian VAE in modeling for high levels of noise (). Here, the learning objective is to maximize the ELBO given in the r.h.s. of (14). In the figure below, the top row are the clean samples from MNIST test set, the second row are noisy samples where , and the third row are the corresponding , where . For completeness, in the last row we have also given the samples from the decoder , where .
3.4 Fast-mixing walk-jump sampling in the ultraviolet regime
What we have established so far empirically is that the unnormalized variational Bayes (UVB) appears to have a much higher capacity (as measured by the loss obtained) in learning unnormalized densities than neural empirical Bayes (DEEN). Due to the significant improvements of UVB over DEEN—especially, in the high noise regime—we revisited the walk-jump sampling introduced in (Saremi and Hyvärinen, 2019) for . Walk-jump sampling finds a “happy medium” between the desire to generate sharp (high-quality) samples and the quest for MCMC to mix faster. The algorithm is quite simple:
Sample after learning the energy function . Here as in (Saremi and Hyvärinen, 2019) we consider (overdamped) Langevin MCMC:
where , is the step size, and is the discrete time. One typically starts the MCMC at . In very noisy regime (and/or in very high dimensions) we can visualize (19) as walking on/in the high-dimensional “ultraviolet manifold” where is concentrated (see Figure 1a).
At arbitrary times use the Bayes estimator of to jump from to In comparison with the Langevin MCMC, these samples are not statistical samples (taken from ) but going back to the definition of the Bayes estimator, they are the mean over the posterior:
Essentially, with the jumps we get much sharper “denoised” samples than the noisy samples generated by the Langevin MCMC (if our goal is to see sharper samples).
As it is clear above, walk-jump sampling finds a compromise between having fast-mixing MCMC samplers (more on that in Remark 3.4 below) and obtaining sharp samples “close to” data manifold where the random variable is concentrated. From our perspective, the first quest is the more important one, i.e. in this paradigm, one is easily willing to sacrifice sample quality for faster mixing (see Remark 3.4). On MNIST, we present results for . On could do experiments with smaller but was chosen due to the (visually) good denoising performance of UVB presented in Figure 2c and to push the model to its limits, informally speaking bringing the model close to the schematic in Figure 1a.
The results are presented in Fig. 3, where we set the step size (see Eq. 19) and demonstrate jump samples (Eq. 20) after every 10 Langevin MCMC steps. There was no warmup phase in the experiment presented, i.e. the first sample shown is obtained after only 10 MCMC steps (the samples are viewed top-left to bottom-right). The experiment demonstrates the fast-mixing property of the Langevin MCMC in the ultraviolet regime while retaining some sharpness in the jump steps. In addition, samples shown in Fig. 3 can become sharper (cleaned up more) if they are followed up by another jump as described in Sec. 3.3.1 and visualized in Fig. 2d but here we opted for the less sharp version here to highlight the transitions between the digits in the Langevin MCMC more clearly.
There have been recent developments in analyzing Langevin MCMC for smooth and strongly convex energy functions (Cheng et al., 2018b), and related results on statistical models where the energy function is not convex globally but -strongly convex outside a ball of radius (Cheng et al., 2018a). In the latter non-convex case (relevant here), the assumption is that there exists such that for all :
As expected, one gains in time complexity in the convergence of Langevin MCMC to the true distribution for models that satisfy the condition above for smaller . On the other hand, it is clear that larger lead to lower time complexity. Unfortunately, it is not tractable to analytically bound for parameterized by UVB, but assuming , it is an open question whether one can analyze this problem for and study the role of in the time complexity analysis.
In this presentation we emphasized the regime of interest on having both fast-mixing samplers and good quality samples—“good enough,” that is! But finding that middle ground is clearly dataset and application dependent. On that note, generative models are sometimes motivated by the ability of humans in dreaming or in planning for actions, but what is usually left out is that we (humans) typically do not plan/dream in high resolution. On the other hand, stretching the (informal) analogy between planning/dreaming and sampling probabilistic models further, we appear to have very fast-mixing samplers built in our brain.
3.5 Uvb Vae
In this section, we look more closely at UVB “under the hood”. In particular, we show empirical results on the fact that the encoder/decoder underlying the UVB learned by least-squares estimation is very different than the one learned by maximum likelihood estimation: symbolically, UVB VAE. This must be clear since the learning paradigms are very different but it is worth to be emphasized through experiments by looking at both the encoder and the decoder of the VAE learned by UVB’s least-squares objective and comparing it with the one learned by maximizing the ELBO as discussed in Sec. 3.3.2.
3.5.1 UVB’s encoder
We first report of the encoder of the UVB for and compare it with learned by maximizing the ELBO discussed in Sec. 3.3.2. The encoder-decoder architectures are the same; the learning objectives, vastly different! The results are given in Table 4 showing the astronomical 688 nats differences between the two.
3.5.2 UVB’s decoder
This section is the repeat of the experiments of the previous section but here we compare the reconstruction term between the two models, where is evaluated on the test set. The results are reported in Table 5.
Note that neither the reconstruction term nor the KL divergence term themselves enter the learning objective in UVB: only their gradients. To shed more light on this, in Figure 4 we have also included visualizations of the decoder given by for both MNIST and CIFAR10, giving yet another perspective on the fact that UVBVAE.
4 Open Questions
We highlight two open questions:
1. Why does UVB have a higher capacity than DEEN for approximating ?
This issue can be looked at from the perspective of implicit parametrization. Note that in both DEEN and UVB the learning objective is set up to approximate the score function , but in both cases, we parametrized the energy function instead. One could view energy function as implicitly parametrizing the score function; “implicit”, since is computed with automatic differentiation and (in practice) its symbolic form is not known (Griewank, 2003; Baydin et al., 2017). This implicit parametrization view of DEEN for approximating the score function was discussed in (Saremi, 2019). In UVB this implicit parameterization is taken to a “higher level” where the energy function itself is not parametrized by a MLP but by the ELBO computed by a VAE.
2. Why does the generalization gap in UVB persist to very high levels of noise?
In DEEN, there is no distinction between the test set and the training set for higher levels of noise: on CIFAR10 starting at and on MNIST starting at . However, for UVB a gap persists all the way to —we should point out that the test loss in our experiments did not go up during training for the range of reported here, and the generalization gap is not a sign of overfitting. (In general, in this denoising framework, it must be very difficult to overfit for large noise levels.) For CIFAR10, it is a next step to repeat the experiments here on the 80 million tiny image database (Torralba et al., 2008)666It appears that the database have been taken down at the present time. which CIFAR10 (Krizhevsky, 2009) is a small subset of and check whether this gap closes.
We introduced unnormalized variational Bayes (UVB) as a unification of empirical Bayes and variational Bayes for approximating unnormalized densities. Algorithmically, the energy function was parameterized and computed by a variational autoencoder but the learning itself was formulated via empirical Bayes. What did we gain in this “unification”? In variational Bayes, one should be mindful of how tight the evidence lower bound (ELBO) is; there are no such worries here since the learned density is unnormalized. In addition, as we demonstrated empirically, (to our surprise) the learning itself is boosted significantly by the variational autoencoder (VAE) parametrization. At the present time, it is difficult to tease apart whether this “boost” is mostly due to the parametrization or the fact the optimization is more regularized due to the inner workings of variational Bayes. In this paper, the VAE was set up by an isotropic Gaussian prior, a factorized Gaussian posterior (for the encoder), and a Gaussian conditional density (for the decoder), but this is just one instantiation of UVB and more expressive priors/encoders/decoders can be used where the only overhead to have in mind is the computation of the gradient of the ELBO with respect to noisy inputs.
I am grateful to Aapo Hyvärinen and Francis Bach for their comments on the manuscript and to my colleagues Christian Osendorfer and Rupesh Srivastava for discussions.
Alain and Bengio (2014)
Guillaume Alain and Yoshua Bengio.
What regularized auto-encoders learn from the data-generating
Journal of Machine Learning Research, 15(1):3563–3593, 2014.
- Baydin et al. (2017) Atılım Günes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research, 18(1):5595–5637, 2017.
- Bengio (2009) Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
- Cheng et al. (2018a) Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I Jordan. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv preprint arXiv:1805.01648, 2018a.
Cheng et al. (2018b)
Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan.
Underdamped Langevin MCMC: A non-asymptotic analysis.In Conference on Learning Theory, pages 300–323, 2018b.
Approximation by superpositions of a sigmoidal function.Mathematics of control, signals and systems, 2(4):303–314, 1989.
- Elfwing et al. (2017) Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. arXiv preprint arXiv:1702.03118, 2017.
- Griewank (2003) Andreas Griewank. A mathematical view of automatic differentiation. Acta Numerica, 12(1):321–398, 2003.
- Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
- Hyvärinen (2005) Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.
- Jordan et al. (1999) Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
- Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lu et al. (2017) Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems, pages 6231–6239, 2017.
- Miyasawa (1961) Koichi Miyasawa. An empirical Bayes estimator of the mean of a normal population. Bulletin of the International Statistical Institute, 38(4):181–188, 1961.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017.
- Ramachandran et al. (2017) Prajit Ramachandran, Barret Zoph, and Quoc V Le. Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941, 7, 2017.
- Raphan and Simoncelli (2011) Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23(2):374–420, 2011.
- Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
- Robbins (1956) Herbert Robbins. An empirical Bayes approach to statistics. In Proc. Third Berkeley Symp., volume 1, pages 157–163, 1956.
- Saremi (2019) Saeed Saremi. On approximating with neural networks. arXiv preprint arXiv:1910.12744, 2019.
- Saremi (2020) Saeed Saremi. Learning and inference in imaginary noise models. arXiv preprint arXiv:2005.09047, 2020.
- Saremi and Hyvärinen (2019) Saeed Saremi and Aapo Hyvärinen. Neural empirical Bayes. Journal of Machine Learning Research, 20(181):1–23, 2019.
- Saremi et al. (2018) Saeed Saremi, Arash Mehrjou, Bernhard Schölkopf, and Aapo Hyvärinen. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018.
Swersky et al. (2011)
Kevin Swersky, Marc’Aurelio Ranzato, David Buchman, Benjamin M Marlin, and
Nando de Freitas.
On autoencoders and score matching for energy based models.In ICML, 2011.
Torralba et al. (2008)
Antonio Torralba, Rob Fergus, and William T Freeman.
80 million tiny images: A large data set for nonparametric object and scene recognition.IEEE transactions on pattern analysis and machine intelligence, 30(11):1958–1970, 2008.
- Vértes and Sahani (2016) Eszter Vértes and Maneesh Sahani. Learning doubly intractable latent variable models via score matching. Technical report, Gatsby Unit, UCL, 2016.
- Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.