Recent advances in deep likelihood based generative models [vae, rezende, pixelcnn, pixelcnn++, glow] enable the modeling of very high dimensional and complicated data such as natural images, sequences [wavenet] and graphs [graphvae]. Compared to Generative Adversarial Networks (GANs) [goodfellow2014generative], these models can evaluate the likelihood of input data easily, return latent variables that are useful for downstream tasks and do not suffer from the instability and mode dropping issues [salimans2016improved] of GAN training. However, GANs are still the state-of-the-art generative models for many generative tasks, because they can produce sharper and more realistic samples than non-adversarial models [brock2018large]. Much effort has been devoted to improving the sample quality of non-adversarial, likelihood based generative models [van2017neural, behrmann2018invertible, huang2020augmented, dai2019diagnosing] by modifying the training objectives.
This paper aims at a slightly different task: we want to improve the sample quality of an existing generative model through a better sampling procedure. Note that in deep generative models, samples are usually obtained by sending latent variables through a deterministic transformation, where the latent variables are sampled from a pre-defined prior distribution. Controlling the temperature of sampling from the prior may produce better samples, but this comes at the cost of less diversity [glow]. Recently, there is a line of literature that improves the sample quality of pre-trained GANs [tanaka2019discriminator, che2020your]. They utilize the information contained in the discriminator of GANs to obtain better samples of the latent variable. In particular, [che2020your] sample latent variables from an energy-based model (EBM) defined jointly by the generator and discriminator.
Extension of these ideas to likelihood based models is nontrivial, because the discriminator of GANs is critical to provide guidance for moving the latent variables. We extend these methods to generative models without adversarial training by constructing a latent variable EBM that consists of the pre-trained generative model and an energy function. The EBM can be trained efficiently by maximizing the data likelihood, and we observe that training the EBM only adds a slight computational overhead, as the convergence is fast. After convergence, new samples are produced by latent variables sampled from the EBM. We show that our method can effectively improve the sample quality of a variety of pre-trained generative models, including normalizing flows, VAEs and a combination of the two.
2 Background on Energy-Based Models
An Energy-based Model assumes a gibbs distribution
over data . is the energy function with parameter , and is the normalizing constant. When is an image,
is usually chosen to be a convolutional neural network with scalar output[du2019implicit]. The EBM can be trained by the maximum likelihood principle, namely minimizing the negative log likelihood . This objective is known to have derivative
where represents a sample drawn from the EBM. However, it is difficult to draw samples from such complex unnormalized distributions, and one typically needs to employ MCMC algorithms. One efficient MCMC algorithm in high dimensional continuous state spaces is Stochastic Gradient Langevin dynamics popularized in the statistics literature in the early 90’s in [amit-jasa]
and introduced in the deep learning literature in[10.5555/3104482.3104568]. This algorithm initializes at and runs updates
The continuous Langevin dynamics is guaranteed to produce samples from the target distribution. In practice we use a discrete approximation, which yields a Markov chain with invariant distribution close to the original target distribution.
3.1 Exponential tilting of Generative Models
Suppose we have a pre-trained probabilistic generative model over data space , we can define a new model by “exponential tilting" with energy function :
and the Langevin dynamics to generate samples from is
Note that the derivative of the log likelihood of the original generative model appears in the update, driving the Langevin dynamics to samples with large likelihood and low energy simultaneously. However, this only works for generative models with tractable likelihood. More importantly, operating in the pixel space may be inefficient as it completely ignores the latent variables.
3.2 EBM in Latent Space
Many types of probabilistic generative models, including normalizing flows and VAEs, adopt a decoder structure in their generation process, namely there is a pre-defined prior distribution , and samples are generated by
We can therefore re-parametrize the EBM in (4) with the latent variable and obtain
When is trained by maximizing the likelihood, the second term of (2) can also be re-parametrized to z space:
See Appendix A for a simple derivation. Similarly, samples from can be obtained through running the Langevin dynamics
Sometimes has lower dimensionality than and can often be computed easily, therefore training and sampling from is more efficient.
4 Related Work
Our work is closely related to recent literature that uses a discriminator to improve the sample quality of an existing GAN. Discriminator rejection sampling [azadi2018discriminator] and Metropolis-Hastings GANs [turner2018metropolis] use the discriminator as a criterion of accepting or rejecting samples from the generator. They are inefficient as many of samples may be rejected. Discriminator optimal transport (DOT) [tanaka2019discriminator] and Discriminator Driven Latent Sampling (DDLS) [che2020your] both move the latent variable to make samples better fool the discriminator. In particular, [tanaka2019discriminator] uses deterministic gradient descent in the latent space, while [che2020your]
formulates an EBM on latent variables and use Langevin dynamics to sample latent variables. All these methods rely on the fact that the discriminator of a trained GAN is a good classifier on real/fake images, which is not applicable to likelihood based generative models.
When applied on VAEs, our method shares similarity with some recent literature that trains an auxiliary model to match the empirical latent distribution of an existing VAE, and samples are produced by latent variables generated by the auxiliary model. The auxiliary model can be another VAE [dai2019diagnosing], normalizing flow [xiao2019generative] or auto-regressive model [van2017neural]. Our method use an EBM as the auxiliary model, but its purpose is to define the energy for generated samples. More importantly, other methods can improve the sample quality only when increasing the weight on the reconstruction term in the objective, which will make the latent representation less structured. In contrast, our method can improve the sample quality of VAE trained without modifying the objective.
Our method heavily relies on the progress of training deep EBMs. Recently, [du2019implicit, nijkamp2019anatomy, nijkamp2019learning] successfully scales up the maximum likelihood learning of EBMs to high dimensional images. In particular, we follow [nijkamp2019anatomy, nijkamp2019learning] and use short run non convergent MCMC to train our EBMs.
5.1 Toy dataset
To give a quick proof-of-concept, we apply our method on toy datasets (25-Gaussians and Swiss Roll) following the setting of [tanaka2019discriminator]. We first train a VAE on the training data, and then we fix the VAE and train a latent EBM as described in Section 3.2. The decoder and the energy function (which corresponds to the discriminator in GANs) have simple fully connected structure as described in [tanaka2019discriminator]. Note that we do not use normalizing flows on toy datasets, because vanilla flow is heavily constrained by the manifold structure of the prior distribution, making it very hard to model distributions like the 25-Gaussians.
We show qualitative results in Figure 5.1. We observe that although samples from VAEs can basically cover the shape of the true distribution, many samples still appear at low density regions. In contrast, by sampling and decoding latent variables obtained from the post-trained latent EBM, we can accurately preserve all modes in the target distribution while eliminating spurious modes in the 25-Gaussians case. In the Swiss Roll case, it is also clear that the EBM better captures the underlying data distribution.
5.2 Image dataset
In this section we evaluate the performance of the proposed latent EBM on MNIST, Fashion MNIST and CIFAR-10 dataset. We use different decoder based generative models , including normalizing flow, VAE and GLF, which uses a latent flow model [xiao2019generative] that combines a deterministic auto-encoder and a normalizing flow on the latent variables. Note that our main focus is on the relative improvements of sampling from the EBMs over sampling from base generative models, and therefore the performances of the base generative models may not be state-of-the-art. In fact, we adopt relatively simple network structures for convenience.
As in [du2019implicit], we use a convolutional network with scalar outputs as . We adopt short-run non-persistent MCMC, so the Langevin dynamics (8) is run with initialized from for a small number of steps. We fix and is trained by maximum likelihood. For details on the settings of our experiments, see Appendix B.
We show some qualitative results of training our proposed latent EBMs on top of a GLOW [glow] model in Figure 5.2. From Figure 5.2, we clearly observe that samples generated by latent variables obtained from the latent EBMs have higher quality than samples from the base generative model (i.e., decoding latent variables from prior distribution). On MNIST and Fashion MNIST, samples obtained through the latent EBM have smoother shapes than samples from the GLOW. On CIFAR-10, the latent EBM effectively corrects the noisy backgrounds of the samples generated by the GLOW. We illustrate the process of Langevin dynamics sampling from the latent EBM in Figure 5.3, where we generate samples for every 10 iterations. Apparently the the Langevin dynamics is going towards latent variables that produce more semantically meaningful and sharp samples.
More qualitative examples, including results of training latent EBMs on top of VAE and GLF are presented in Appendix C. In Figure C.2, we observe that the VAE+latent EBM generates sharper samples than VAE alone. However, it should be noted that, since our EBM operates on the latent space, the overall sample quality is constrained by the capacity of the base generative models.
Our observation on the improvements of sample quality can be confirmed by quantitative results in Table 1, where we compare the FID scores [heusel2017gans] of different models. We see that sampling latent variables from the latent EBM significantly improves the quality of generated samples over directly sampling from . In addition, we see that methods mentioned in Section 4 do not improve the sample quality of VAEs without posing a large weight on reconstruction term, while our method can generate better samples without changing the VAE’s objective, leading to good sample quality and structured latent representation. For completeness, we also present results of training EBMs directly on pixel space using the same model structures. The results are in the same range as the latent EBM models, but we observe that training EBMs on data is more sensitive to hyper-parameter settings and more computationally expensive.
|GLOW + EBM||12.3||41.6||67.8|
|VAE + flow||18.6||52.3||128.2|
|VAE + EBM||16.0||38.1||108.4|
|GLF + EBM||12.1||25.3||85.1|
|EBM on data||25.5||39.3||80.6|
5.3 Overfitting issue of training latent EBMs
As pointed out in [grathwohl2019your, nijkamp2019anatomy]
, instability and overfitting are frequently observed when training Energy-based Models. Overfitting happens when the energy of training samples are much lower than samples drawn from the EBM, which causes the model to produce worse samples. We find heuristic approaches such as energy regularization and gradient clipping not helpful in preventing overfitting, so we simply stop the training of the latent EBM when sample quality deteriorates. We believe that future studies in improving the stability of EBM training can further boost the performance of our method.
5.4 Training time
One training step of EBM requires obtaining a sample by running multiple steps of MCMC, and therefore it is much slower than one training step of the base generative model. However, we find that very few training iterations are needed to achieve the results in Table 1. Specifically, we only train latent EBMs for 200 steps when is a GLOW or GLF, and 1000 steps when is a VAE. Here a step refers to one
batch, not an entire epoch. These numbers are several orders of magnitude smaller than the training steps needed for the base generative models. Therefore, our method does not add much computational overhead. As a comparison, training EBMs on pixel space typically requires more thank steps.
In this paper, we propose to train an Energy-based model on the latent space of pre-trained generative models. We show that with little computational overhead, we can improve the sample quality of a variety of generative models, including normalizing flow and VAE, by sampling latent variables from the EBM. Our method also provides a general framework that connects Energy-based models and other likelihood based generative models. We believe this connection is an interesting direction for future research.
Appendix A Proof of (7)
Since is generated through a deterministic mapping from the latent space to the observation space , for any function on we have:
Take derivative w.r.t we can get
which is exactly (7).
Appendix B Experimental Settings
b.1 Base generative models
We first introduce training settings of the base generative models that we used in our experiments. We train GLOW basically following the settings provided in [nalisnick2018deep]. For MNIST and Fashion MNIST,we use a GLOW architecture of 2 blocks of 16 affine coupling layers, squeezing the spatial dimension in between the 2 blocks. For the coupling function, we use a 3-layer Highway network with 64 hidden channels. For CIFAR-10, we use 3 blocks of 32 affine coupling blocks, applying the multi-scale architecture between each block. The coupling function is a 3-layer Highway network with 256 hidden channels. Note that we modify the model size to fit in a single GPU for training. For MNIST and Fashion MNIST, we train the GLOW for 128 epochs with batch size 64 and Adam optimizer with fixed learning rate . For CIFAR-10, we train the GLOW for 256 epochs with batch size 64 and Adam optimizer with fixed learning rate .
Our use the DCGAN [radford2015unsupervised] structure on the decoders of our VAEs, and the encoders are designed to be symmetric to the decoder. We use latent dimension for all experiments. For MNIST and Fashion datasets, we use binary cross entropy as reconstruction loss, while for CIFAR-10, we use MSE loss. All VAEs are trained for 256 epochs with batch size 128 and Adam optimizer with fixed learning rate .
For GLF adopt the same encoder-decoder structure as in our settings for training VAEs. We use latent dimension for all experiments. The normalizing flow for matching the latent distribution is a simple GLOW network with 4 affine coupling layers, each consists of one fully connected layer with 256 units. The AE and the flow are jointly trained for 256 epochs with batch size 128 and Adam optimizer with fixed learning rate .
b.2 Energy based models
We used a simplified version of the network structure described in [du2019implicit] to define our . In particular, our method consists of 3 resnet blocks with 64 hidden channels and 3 resent blocks with 128 hidden channels, followed by Global Sum Pooling and a FC layer. We also find the network structure in [nalisnick2018deep], which has much less parameters, leads to only slightly worse performances. Therefore, their energy function can be used for parameter efficiency.
Unlike [du2019implicit, nijkamp2019anatomy] where the Langevin dynamics is dominated by the gradient, we find our latent EBMs work well with balanced noise and gradient in (8). For Langevin dynamics, we use and run the chain for 60 steps. We find adding a small amount (with coefficient ) of energy regularization is helpful for avoiding over-fitting early in the training. After training, we find sampling latent variables with longer chain leads to better performances. We generate samples from by running the chain for 100 steps.
For EBMs on the pixel space, we find short-run non-persistent training as described in [nijkamp2019anatomy] hard to converge on MNIST and Fashion MNIST, so we follow the setting in [du2019implicit], where they use persistent initialization for the Langevin dynamics. They maintain a sample replay buffer during the training, and samples from the buffer are used to initialize the chain. We follow the hyper-parameter settings in [du2019implicit],and we train EBMs on MNIST and Fashion MNIST for k steps, and on CIFAR-10 for
k steps. Note that we train less number of steps on CIFAR-10 than the open source implementation of[du2019implicit], because it takes prohibitively long time due to the hardware constraint. We find our samples qualitatively comparable to those of [du2019implicit] (see Figure C.4), but we are unable to match their reported FID scores on CIFAR-10, possibly due to not training long enough. After training, new samples are generated from chains initialized from the replay buffer.
Appendix C Additional Qualitative Results
In this section, we show some additional qualitative results. In Figure C.1, we presents more examples of samples from GLOW and GLOW + latent EBM, in addition to Figure 5.2 in the main text. In Figure C.2 we show samples from VAE and VAE + latent EBM. In Figure C.3, we show samples from GLF and GLF + latent EBM. In all of these experiments, we clearly observe that latent EBMs improve the sample quality of base generative models. Finally, in Figure C.4, we show samples from EBMs trained on pixel space.