Generative Adversarial Networks (GANs) (goodfellow2014generative)
are state-of-the-art models for a large variety of tasks such as image generation, semi-supervised learning(dai2017good), image editing (Yi_2017_ICCV), image translation (zhu2017unpaired)
, and imitation learning(ho2016generative)
. In a nutshell, the GAN framework consists of two neural networks, the generatorand the discriminator
. The optimization process is formulated as an adversarial game, with the generator trying to fool the discriminator and the discriminator trying to better classify real from fake samples.
Despite the ability of GANs to generate high-resolution, sharp samples, these models are notoriously hard to train. Previous work on GANs focused on improving stability and mode dropping issues (Metz2016UnrolledGA; che2016mode; arjovsky2017wasserstein; gulrajani2017improved; roth2017stabilizing), which are believed to be the main difficulties in training GANs.
Besides instability and mode dropping, the samples of state-of-the-art GAN models sometimes contain bad artifacts or are even not recognizable (karras2019analyzing). It is conjectured that this is due to the inherent difficulty of generating high dimensional complex data, such as natural images, and the optimization challenge of the adversarial formulation. In order to improve sample quality, conventional sampling techniques, such as increasing the temperature, are commonly adopted for GAN models (biggan). Recently, new sampling methods such as Discriminator Rejection Sampling (DRS) (azadi2018discriminator), Metropolis-Hastings Generative Adversarial Network (MH-GAN) (pmlr-v97-turner19a-mhgan) and Discriminator Optimal Transport (DOT) (tanaka2019discriminator) have shown promising results by utilizing the information provided by both the generator and the discriminator. However, these sampling techniques are either inefficient or lack theoretical guarantees, possibly reducing the sample diversity and making the mode dropping problem more severe.
In this paper, we show that GANs can be better understood through the lens of Energy-Based Models (EBM). In our formulation, GAN generators and discriminators collaboratively learn an “implicit” energy-based model. However, efficient sampling from this energy based model directly in pixel space is extremely
challenging for several reasons. One is that there is no tractable closed form for the implicit energy function in pixel space. This motivates an intriguing possibility: that Markov Chain Monte Carlo (MCMC) sampling may prove more tractable in the GAN’s latent space.
Surprisingly, we find that the implicit energy based model defined jointly by a GAN generator and discriminator takes on a simpler, tractable form when it is written as an energy-based model over the generator’s latent space. In this way, we propose a theoretically grounded way of getting high quality samples from GANs through what we call Discriminator Driven Latent Sampling (DDLS).
DDLS leverages the information contained in the discriminator to re-weight and correct the biases and errors in the generator. Through experiments, we show that our proposed method is highly efficient in terms of mixing time, is generally applicable to a variety of GAN models (Minimax, Non-Saturating, Wasserstein GANs), and is robust against a wide range of hyper-parameters.
We highlight our main contributions as follows:
We propose and prove that it is beneficial to sample from the energy-based model defined both by the generator and the discriminator instead of from the generator only.
We derive an equivalent formulation of the pixel-space energy-based model in the latent space, and show that sampling is much more efficient in the latent space.
We show experimentally that samples from this energy-based model are of higher quality than samples from the generator alone.
We show that our method can approximately extend to other GAN formulations, such as Wasserstein GANs.
In this section we present the background methodology of GANs and EBMs on which our method is based.
2.1 Generative Adversarial Networks
GANs (goodfellow2014generative) are a powerful class of generative models defined through an adversarial minimax game between a generator network and a discriminator network . The generator takes a latent code from a prior distribution and produces a sample . The discriminator takes a sample as input and aims to classify real data from fake samples produced by the generator, while the generator is asked to fool the discriminator as much as possible. We use to denote the true data-generating distribution and to denote the implicit distribution induced by the prior and the generator network. The standard non-saturating training objective for the generator and discriminator is defined as:
Wassertein GANs (WGAN) (arjovsky2017wasserstein) are a special family of GAN models. Instead of targeting a Jensen-Shannon distance, they target the 1-Wasserstein distance . The WGAN discriminator objective function is constructed using the Kantorovich duality
where is the set of 1-Lipstchitz functions.
The WGAN discriminator is motivated in terms of defining a critic function whose gradient with respect to its input is better behaved (smoother) than original GANs, making the optimization of the generator easier (lucic2018gans).
2.2 Energy-Based Models and Langevin Dynamics
An energy-based model (EBM) is defined by the Boltzmann distribution
where , is the state space, and is the energy function. Samples are typically generated from by an MCMC algorithm. One common MCMC algorithm in continuous state spaces is Langevin dynamics, with an update equation
Langevin dynamics are guaranteed to exactly sample from the target distribution as 111In practice we will use a small, finite, value for in our experiments..
One solution to the problem of slow-sampling Markov Chains is to perform sampling using a carfefully crafted latent space (DBLP:conf/icml/BengioMDR13; hoffman2019neutra)
. However, in the unsupervised learning setting, it is not clear how to simultaneously train a latent representation together with a Markov chain. Our method shows how one can execute such latent space MCMC in GAN models.
3.1 GANs as an Energy-Based Model
Suppose we have a GAN model trained on a data distribution with a generator with generator distribution and a discriminator . We assume that and have the same support. This can be guaranteed by adding small Gaussian noise to these two distributions.
The training of GANs is an adversarial game which is very hard to converge to the optimal generator, so usually and
do not match perfectly at the end of training. However, the discriminator provides a quantitative estimate for how much these two distributions (mis)match. Let’s assume the discriminator is near optimality, namely:(goodfellow2014generative)
From the above equation, let be the logit of , in which case
and we have , and . Normalization of is not guaranteed, and so it may not be a valid probabilistic model. We therefore consider the energy-based model:
where is a normalization constant. Intuitively, this formulation has two desirable properties. First, as we elaborate later, if where is the optimal discriminator, then . Secondly, it corrects the bias in the generator via weighting and normalization. If we can sample from this distribution, it should be able to improve our samples.
There are two difficulties in sampling efficiently from :
Doing MCMC in pixel space to sample from the model is impractical due to the high dimensionality and long mixing time.
is implicitly defined and its density cannot be computed directly.
In the next section we show how to overcome these two difficulties.
3.2 Rejection Sampling and MCMC in Latent Space
Our approach to the above two problems is to formulate an equivalent energy-based model in the latent space. To derive this formulation, we first review rejection sampling (10.2307/4356322). With as the proposal distribution, we have . Denote (this is well-defined if we add a Gaussian noise to the output of the generator and is in a compact space). If we accept samples from proposal distribution
with probability, then the samples we produce have the distribution .
We can alternatively interpret the rejection sampling procedure above as occurring in the latent space . In this interpretation, we first sample from , and then perform rejection sampling on with acceptance probability . Only once a latent sample has been accepted do we generate the pixel level sample .
This rejection sampling procedure on
induces a new probability distribution. To explicitly compute this distribution we need to conceptually reverse the definition of rejection sampling. We formally write down the “reverse” lemma of rejection sampling as Lemma 1, to be used in our main theorem.
On space there is a probability distribution . is a measurable function on . We consider sampling from , accepting with probability , and repeating this procedure until a sample is accepted. We denote the resulting probability measure of the accepted samples . Then we have:
From the definition of rejection sampling, we can see that in order to get the distribution , we can sample from and do rejection sampling with probability , where for all . So we have . If we choose , then from for all , we can see that satisfies , for all . So we can choose , resulting in . ∎
Namely, we have the prior proposal distribution and an acceptance probability . We want to compute the distribution after the rejection sampling procedure with . With Lemma 1, we can see that . We expand on the details in our main theorem.
Interestingly, has the form of an energy-based model, , with tractable energy function . In order to sample from this Boltzmann distribution, one can use an MCMC sampler, such as Langevin dynamics or Hamiltonian Monte Carlo. The algorithm using Langevin dynamics is given in Alg. 1.
3.3 Main Theorem
Summarizing the arguments and results above, we have the following theorem:
Assume is the data generating distribution, and is the generator distribution induced by the generator , where is the latent space with prior distribution . Define , where is the normalization constant.
Assume and have the same support. This assumption is typically satisfied when . We address the case that in Corollary 1. Further, let be the discriminator, and be the logit of , namely . We define the energy function , and its Boltzmann distribution . Then we have:
when is the optimal discriminator.
If we sample , and , then we have . Namely, the induced probability measure .
(1) follows from the fact that when is optimal, , so , which implies that (which is finite on the support of due to the fact that they have the same support). Thus, , we must have for normalization, so .
For (2), for samples , if we do rejection sampling with probability (where is a constant with ), we get samples from distribution . We can view this rejection sampling as a rejection sampling on latent space , where we perform rejection sampling on with acceptance probability . Apply lemma 1, we can see that this rejection sampling procedure induces a probability distribution on latent space . is the normalization constant. Thus sampling from is equivalent to sampling from and generate with . ∎
3.4 Sampling Wasserstein GANs with Langevin Dynamics
Wasserstein GANs are different from original GANs in that they target the Wassertein loss. Although when the discriminator is trained to optimality, the discriminator can recover the Kantorovich dual (arjovsky2017wasserstein) of the optimal transport between and , the target distribution cannot be exactly recovered using the information in and 222In tanaka2019discriminator, the authors claim that it is possible to recover with and in WGAN in certain metrics, but we show in the Appendix that their assumption doesn’t hold and in the metric, which WGAN uses, it is not possible to recover . . However, in the following we show that in practice, the optimization of WGAN can be viewed as an approximation of an energy-based model, which can also be sampled with our method.
The objectives of Wasserstein GANs can be summarized as:
where is restricted to be a -Lipschitz function.
On the other hand, consider a new energy-based generative model (which also has a generator and a discriminator) trained with the following objectives:
Discriminator training phase (D-phase). Unlike GANs, our energy-based model tries to match the distribution with the data distribution , where can be interpreted as an EBM with energy . In this phase, the generator is kept fixed, and the discriminator is trained.
Generator training phase (G-phase). The generator is trained such that matches , in this phase we treat as fixed and train .
In the D-phase, we are training an EBM with data from . The gradient of the KL-divergence can be written as (mackay2003information):
Namely we are trying to maximize on real data and trying to minimize it on fake data. Note that the fake data distribution is a function of both the generator and discriminator, and cannot be sampled directly. As with other energy-based models, we can use an MCMC procedure such as Langevin dynamics to generate samples from . Although in practice it would be difficult to generate equilibrium samples by MCMC for every training step, historically training of energy-based models has been successful even when negative samples are generated using only a small number of MCMC steps (tieleman2008training).
In the G-phase, we can train the model with KL-divergence. Let be a fixed copy of , we have (see the Appendix for more details):
Note that the losses above coincide with what we are optimizing in WGANs, with two differences:
In WGAN, we optimize on instead of . This may not be a big difference in practice, since as training progresses is expected to approach , as the optimizing loss for the generator explicitly acts to bring closer to (Equation 12). Moreover, it has recently been found in LOGAN (wu2019logan) that optimizing on rather than can lead to better performance.
In WGAN, we impose a Lipschitz constraint on . This constraint can be viewed as a smoothness regularizer. Intuitively it will make the distribution more “flat” than , but its value still makes (which lies in a distribution family parameterized by ) an approximator to .
Thus, we can conclude that for a Wasserstein GAN with discriminator , WGAN approximately optimizes the KL divergence of with , with the constraint that is -Lipschitz. This suggests that one can also perform discriminator Langevin sampling on the latent space to get better samples with energy function .
3.5 Practical Issues and the Mode Dropping Problem
Mode dropping is a major problem in training GANs. In our main theorem it is assumed that and have the same support. We also assumed that is a deterministic function. Thus, if cannot recover some of the modes in , also cannot recover these modes.
However, we can partially solve the mode dropping problem by introducing an additional Gaussian noise to the output of the generator, namely we define the new deterministic generator . We treat as a part of the generator, and do DDLS on joint latent variables . The Langevin dynamics will help the model to move data points that are a little bit off-mode to the data manifold, yielding the following corollary of the main theorem.
Assume is the data generating distribution with small Gaussian noise added. The generator is deterministic, where is the latent space endowed with prior distribution . Assume is an additional Gaussian noise variable with . Let , denote the distribution of the extended generator as .
Define , where is the normalization constant.
is the discriminator trained between and . Let be the pre-sigmoid function of
be the pre-sigmoid function of, namely . We define the energy function , and its Boltzmann distribution . Then we have:
when is the optimal discriminator.
If we sample , and , then we have . Namely, the induced probability measure .
Let be the generator defined in Theorem 1, we can see that and has the same support. Apply Theorem 1 and we deduce the corollary. ∎
Additionally, one can also add this Gaussian noise to all the intermediate layers in the generator and rely on Langevin dynamics to correct the resulting data distribution.
4 Related Works
Previous work has considered utilizing the discriminator to achieve better sampling for GANs. Discriminator rejection sampling (azadi2018discriminator) and Metropolis-Hastings GANs (pmlr-v97-turner19a-mhgan) use as the proposal distribution and as the criterion of acceptance or rejection. However, these methods are inefficient as they may need to reject a lot of samples. Intuitively, one major drawback of these methods is that since they operate in the pixel space, their algorithm can use discriminators to reject samples when they are bad, but cannot easily guide latent space updates which would improve these samples.
Discriminator optimal transport (DOT) (tanaka2019discriminator) is another way of sampling GANs. They use deterministic gradient descent in the latent space to get samples with higher -values, However, since and cannot recover the data distribution exactly, DOT has to make the optimization local in a small neighborhood of generated samples (they use a small to prevent over-optimization), which hurts the sample performance. Also, DOT is not guaranteed to converge to the data distribution even under ideal assumptions ( is optimal).
Energy-based models have gained significant attention in recent years. Most work focuses on the maximum likelihood learning of energy-based models (lecun2006tutorial; du2019implicit; salakhutdinov2009deep). The primary difficulty in training energy-based models comes from effectively estimating and sampling the partition function. The contribution to training from the partition function can be estimated via MCMC (du2019implicit; hinton2002training; nijkamp2019learning), via training another generator network (kim2016deep; kumar2019maximum), or via surrogate objectives to maximum likelihood (hyvarinen2005estimation; GutmannM2010; sohl2011new).
The connection between GANs and EBMs has been studied by many authors (zhao2016energy; finn2016connection). Our paper can be viewed as establishing a new connection between GANs and EBMs which allows efficient latent MCMC sampling.
5 Experimental results
In this section we present a set of experiments demonstrating the effectiveness of our method on both synthetic and real-world datasets. In section 5.1 we illustrate how the proposed method, DDLS, can improve the distribution modeling of a trained GAN and compare with other baseline methods. In section 5.2 we show that DDLS can improve the sample quality on real world datasets, both qualitatively and quantitatively.333Code available at https://github.com/sodabeta7/gan_as_ebm.
5.1 Synthetic dataset
Following the same setting used in azadi2018discriminator; pmlr-v97-turner19a-mhgan; tanaka2019discriminator, we apply DDLS to a WGAN model trained on two synthetic datasets, 25-gaussians and swiss roll, and investigate the effect and performance of the proposed sampling method.
The 25-Gaussians dataset is generated by a mixture of twenty-five two-dimensional isotropic Gaussian distributionarranged in a grid. The Swiss Roll dataset is a standard dataset for testing various dimensionality reduction algorithms. We use the specific implementation from scikit-learn, and rescale the coordinates as suggested by tanaka2019discriminator
. We train a Wasserstein GAN model with the standard WGAN-GP objective. Both the generator and discriminator are fully connected neural networks with ReLU as non-linear activations and we follow the same architecture design proposed by DOT ((tanaka2019discriminator).
Qualitative results With the trained generator and discriminator, we generate samples from the generator, then apply DDLS in latent space to obtain enhanced samples. We also apply the DOT method as a baseline. All results are depicted in Figure 1 and Figure 2 together with the target dataset samples. For the 25-Gaussian dataset we can see that DDLS recovered and preserved all modes while significantly eliminating spurious modes compared to a vanilla generator and DOT. For the Swiss Roll dataset we can also observe that DDLS successfully improved the distribution and recovered the underlying low-dimensional manifold of the data distribution. This qualitative evidence supports the hypothesis that our GANs as energy based model formulation outperforms the noisy implicit distribution induced by the generator only.
Quantitative results We first examine the performance of DDLS quantitavely by using the metrics proposed by DRS (azadi2018discriminator). We generate samples recovered modes “high quality” std “high quality” Generator only DRS GAN w. DDLS
with the DDLS algorithm, and each sample is assigned to its closest mixture component. A sample is of “high quality” if it is within four standard deviations of its assigned mixture component, and a mode is successfully “recovered” if at least one high-quality sample is assigned to it.
As shown in Table 1, our proposed model achieves a higher “high-quality” ratio. We also investigate the distance between the distribution induced by our GAN as EBM formulation and the true data distribution. We use the Earth Mover’s Distance (EMD) between the two corresponding empirical distributions as a surrogate, as proposed in DOT (tanaka2019discriminator). As shown in Table 2, the EMD between our sampling distribution and the ground-truth distribution is significantly below the baselines. Note that we use our own re-implementation, and numbers differ slightly from those previously published.
In this section we evaluate the performance of the proposed DDLS method on the CIFAR-10 dataset.
|DCGAN w/o DRS or MH-GAN||-|
|DCGAN w/ DRS(cal) (azadi2018discriminator)||-|
|DCGAN w/ MH-GAN(cal) (pmlr-v97-turner19a-mhgan)||-|
|ResNet-SAGAN w/o DOT|
|ResNet-SAGAN w/ DOT|
|SNGAN w/o DDLS|
|Ours: SNGAN w/ DDLS|
|Ours: SNGAN w/ DDLS(cal)|
Implementation details We adopt the Spectral Normalization GAN (SN-GAN) (sngan) as our baseline GAN model. We take the publicly available pre-trained models of unconditional SN-GAN and apply DDLS. We first sample latent codes from the prior distribution, then run the Langevin dynamics procedure with an initial step size up to iterations to generate enhanced samples. Following the practice in (sgld) we separately set the standard deviation of the Gaussian noise as . We optionally fine-tune the pre-trained discriminator with an additional fully-connected layer and a logistic output layer using the binary cross-entropy loss to calibrate the discriminator as suggested by azadi2018discriminator; pmlr-v97-turner19a-mhgan.
Quantitative results We evaluate the quality and diversity of generated samples via the Inception Score (improved_gan) and Fréchet Inception Distance (FID) (fid-ttur). We applied DDLS to the unconditional generator of SN-GAN to generate samples and report all results in Table 4. We found that the proposed method significantly improves the Inception Score of the baseline SN-GAN model from to and reduces the FID from to . Our unconditional model outperforms previous state-of-the-art GANs and other sampling-enhanced GANs (azadi2018discriminator; pmlr-v97-turner19a-mhgan; tanaka2019discriminator) and even approaches the performance of conditional BigGANs (biggan) which achieves an Inception Score and an FID of , without the need of additional class information, training and parameters.
Qualitative results We illustrate the process of Langevin dynamics sampling in latent space in Figure 3 by generating samples for every iterations. We find that our method helps correct the errors in the original generated image, and makes changes towards more semantically meaningful and sharp output by leveraging the pre-trained discriminator. We include more generated samples for visualizing the Langevin dynamics in the appendix. To demonstrate that our model is not simply over-fitting to the CIFAR-10 dataset, we find the nearest neighbors of generated samples in the training dataset and show the results in Figure 4.
Mixing time evaluation MCMC sampling methods often suffer from extremely long mixing times, especially for high-dimensional multi-modal data. For example, more than MCMC iterations are need to obtain the most performance gain in MH-GAN (pmlr-v97-turner19a-mhgan) on real data. We demonstrate the sampling efficiency of our method by showing that we can expect a much shorter mixing time by migrating the Langevin sampling process to the latent space, compared to sampling in high-dimensional multi-modal pixel space. We evaluate the Inception Score and the energy function for every iterations of Langevin dynamics and depict the results in Figure 5.
5.3 ImageNet Dataset
In this section we evaluate the performance of the proposed DDLS method on the ImageNet dataset.
Implementation details As with CIFAR-10, we adopt the Spectral Normalization GAN (SN-GAN) (sngan) as our baseline GAN model. We take the publicly available pre-trained models of SN-GAN and apply DDLS. We first sample latent codes from the prior distribution, then run the Langevin dynamics procedure with an initial step size up to iterations to generate enhanced samples. Following the practice in (sgld) we separately set the standard deviation of the Gaussian noise as . We fine-tune the pre-trained discriminator with an additional fully-connected layer and a logistic output layer using the binary cross-entropy loss to calibrate the discriminator as suggested by azadi2018discriminator; pmlr-v97-turner19a-mhgan.
|cGAN w/o DOT|
|cGAN w/ DOT|
|SNGAN w/o DDLS|
|Ours: SNGAN w/ DDLS|
Due to space constrainsts, we put additional experiments and details to appendix.
6 Conclusion and Future Work
In this paper, we have shown that a GAN’s discriminator can enable better modeling of the data distribution with Discriminator Driven Latent Sampling (DDLS). The intuition behind our model is that learning a generative model to do structured generative prediction is usually more difficult than learning a classifier, so the errors made by the generator can be significantly corrected by the discriminator. The major advantage in DDLS comes from latent space Langevin sampling, which enables efficient sampling and better mixing in latent space.
For future work, we are exploring the inclusion of additional Gaussian noise variables in each layer of the generator, treated as latent variables, such that DDLS can be used to provide a correcting signal for each layer of the generator. We believe that this will lead to further sampling improvements, via correcting small artifacts in the generated samples. Also, the underlying idea behind DDLS is widely applicable to other generative models, if we train an additional discriminator together with the generator. It would be interesting to explore whether VAE-based models can be improved by training an additional discriminator.
Appendix A An analysis of WGAN
a.1 An analysis of DOT algorithm
In this section, we first give an example that in WGAN, given the optimal discriminator and , it is not possible to recover .
Consider the following case: the underlying space is one dimensional space of real numbers . is the Dirac -distribution and data distribution is the Dirac -distribution , where is a constant.
We can easily identity function is the optimal -Lipschitz function which separates and . Namely, we let is the optimal discriminator.
However, is not a function of . Namely, we cannot recover with information provided by and . This is the main reason that collaborative sampling algorithms based on W-GAN formulation such as DOT could not provide exact theoretical guarantee, even if the discriminator is optimal.
a.2 Mathematical Details of approximate WGAN with EBMs
In the paper, we outlined an approximation result of WGAN. In the following, we prove them. For eq. 11:
For eq. 12:
Appendix B Experimental details
Source code of all experiments of this work is included in the supplemental material and is available at https://github.com/sodabeta7/gan_as_ebm, where all detailed hyper-parameters can be found.
We show more generated samples of DDLS during langevin dynamics in figure 6. We run steps of Langevin dynamics and plot generated sample for every iterations. We include more randomly generated samples in the supplemental material.
We introduce more details of the preliminary experimental results on Imagenet dataset here. We run the Langevin dynamics sampling algorithm with an initial step size up to iterations. We decay the step size with a factor for every iterations. The standard deviation of Gaussian noise is annealed simultaneously with the step size. The discriminator is not yet calibrated in this preliminary experiment.