Metropolis-Hastings Generative Adversarial Networks

11/28/2018 ∙ by Ryan Turner, et al. ∙ 6

We introduce the Metropolis-Hastings generative adversarial network (MH-GAN), which combines aspects of Markov chain Monte Carlo and GANs. The MH-GAN draws samples from the distribution implicitly defined by a GAN's discriminator-generator pair, as opposed to sampling in a standard GAN which draws samples from the distribution defined by the generator. It uses the discriminator from GAN training to build a wrapper around the generator for improved sampling. With a perfect discriminator, this wrapped generator samples from the true distribution on the data exactly even when the generator is imperfect. We demonstrate the benefits of the improved generator on multiple benchmark datasets, including CIFAR-10 and CelebA, using DCGAN and WGAN.



There are no comments yet.


page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditionally, density estimation is done with a model that can compute the data likelihood. Generative adversarial networks (GANs) 

(Goodfellow et al., 2014)

present a radically new way to do density estimation: They implicitly represent the density of the data via a classifier that distinguishes real from generated data.

GANs iterate between updating a discriminator and a generator , where generates new (synthetic) samples of data, and attempts to distinguish samples of from the real data. In the typical setup, is thrown away at the end of training, and only is kept for generating new synthetic data points. In this work we propose the Metropolis-Hastings GAN (MH-GAN), a GAN that constructs a new generator that “wraps” using the information contained in . This principle is illustrated in Figure 1.111The code for this project is available at:

The MH-GAN uses Markov chain Monte Carlo (MCMC) methods to sample from the distribution implicitly defined by the discriminator learned for the generator . This is built upon the notion that the discriminator classifies between the generator and a data distribution:


where is the (intractable) density of samples from the generator , and is the data density implied by the discriminator with respect to . If GAN training reaches its global optimum then this discriminator distribution is equal to the data distribution and the generator distribution ((Goodfellow et al., 2014). Furthermore, if the discriminator is optimal for a fixed imperfect generator then the implied distribution still equals the data distribution ().

(a) GAN value function
(b) wraps
Figure 1: (a) We diagram how training of and in GANs performs coordinate descent on the joint minimax value function, shown in the solid black arrow. If GAN training produces a perfect for an imperfect , the MH-GAN wraps to produce a perfect generator , as shown in the final dashed arrow. The generator moves vertically towards the orange region while the discriminator moves horizontally towards the purple. (b) We illustrate how the MH-GAN is essentially a selector from multiple draws of . In the MH-GAN, the selector is built using a Metropolis-Hastings (MH) acceptance rule from the discriminator scores .

We use an MCMC independence sampler (Tierney, 1994) to sample from by taking multiple samples from . Amazingly, using the algorithm we present, one can show that given a perfect discriminator and a decent (but imperfect) generator , one can obtain exact samples from the true data distribution . Standard MCMC implementations require (unnormalized) densities for the target and the proposal , which are both unavailable for GANs. However, the Metropolis-Hastings (MH) algorithm requires only the ratio


which we can obtain using only evaluation of .

2 Related Work

A few other works combine GANs and MCMC in some way. Song et al. (2017) use a GAN-like procedure to train a RealNVP (Dinh et al., 2016) MCMC proposal for sampling an externally provided target . Whereas Song et al. (2017) use GANs to accelerate MCMC, we use MCMC to enhance the samples from a GAN. Similar to Song et al. (2017), Kempinska and Shawe-Taylor (2017) improve proposals in particle filters rather than MCMC. Song et al. (2017) was recently generalized by Neklyudov et al. (2018).

The GAN approach to density estimation is complementary to the earlier density ratio estimation (DRE) approach (Sugiyama et al., 2012). In DRE the generator is fixed, and the density is found by combining Bayes’ rule and the learned classifier . In GANs, the key is learning well; while in DRE, the key is learning well. The MH-GAN has flavors of both in that it uses both and to build .

2.1 Discriminator Rejection Sampling

A very similar concurrent work from Azadi et al. (2018) proposes discriminator rejection sampling (DRS) for GANs, which performs rejection sampling of outputs of

by using the probabilities given by

. This approach is conceptually simpler than our approach at first but suffers from two major shortcomings in practice. First, it is necessary to find an upper-bound on over all possible samples in order to obtain a valid proposal distribution for rejection sampling. Because this is not possible, one must instead rely on estimating this bound by drawing many pilot samples. Secondly, even if one were to find a good bound, the acceptance rate would become very low due to the high-dimensionality of the sampling space. This leads Azadi et al. (2018) to use an extra heuristic to shift the logit scores, making the model sample from a distribution different from even when is perfect. We use MCMC instead, which was invented precisely as a replacement for rejection sampling in higher dimensions. We further improve the robustness of MCMC via use of a calibrator on the discriminator to get more accurate probabilities for computing acceptance probabilities.

3 Background and Notation

In this section, we briefly review the notation and equations with MCMC and GANs.

3.1 MCMC Methods

MCMC methods attempt to draw a chain of samples that marginally come from a target distribution . We refer to the initial distribution as and the proposal for the independence sampler as . The proposal is accepted with probability


If is accepted, , otherwise . Note that when estimating the distribution , one must include the duplicates that are a result of rejections in .

Independent samples

Each chain samples and then does MH iterations to get as the output of the chain; in this case it is also the output of . Therefore, each MCMC chain results in a single sample , and independent chains are used for multiple samples from .

Detailed balance

The detailed balance condition implies that if exactly then exactly as well. Additionally, even if is not exactly distributed according to , the Kullback-Leibler (KL) divergence between the implied density it is drawn from and will always decrease as increases (Murray and Salakhutdinov, 2008; Cover and Thomas, 2012).222The curious reader may wonder why we do not simply stop the chain after the first accept. In general, allowing chain length to be conditioned in any way on the state of the chain (including which samples were accepted or rejected) has the potential to introduce bias to the samples (Cowles et al., 1999).

3.2 GANs

GANs implicitly model the data via a synthetic data generator :


This implies a (intractable) distribution on the data . We refer to the unknown true distribution on the data as . The discriminator is a soft classifier predicting if a data point is real as opposed to being sampled from . If converges optimally for a fixed , then , and if both and converge then  (Goodfellow et al., 2014). GAN training forms a game between and . In practice is often better at estimating the density ratio than G is at generating high-fidelity samples (Shibuya, 2017). This motivates wrapping an imperfect to obtain an improved by using the density ratio information encapsulated in .

Figure 2: Illustration comparing the MH-GAN setup with the formulation of DRS on a univariate example. This figures uses a of four Gaussian mixtures while is missing one of the mixtures. The top row shows the resulting density of samples, while the bottom row shows the typical number of rejects before accepting a sample at that value. The MH-GAN recovers the true density except in the far right tail where there is an exponentially small chance of getting a sample from the proposal . DRS with shift should also be able to recover the density exactly, but it has an even larger error in the right tail. These errors arise because we must approximate the max score and use only pilot samples to do so, as in Azadi et al. (2018). Additionally, due to the large maximum , it needs a large number of draws before a single accept. DRS with shift is much more sample efficient, but completely misses the right mode as the setup invalidates the rejection sampling equations. The MH-GAN is more adaptive in that it quickly accepts samples for the areas models well; more MCMC rejections occur before accepting a sample in the right poorly modeled mode. In all cases the MH-GAN is more efficient than DRS without shift. Presumably, this effect becomes greater in high dimensions.

4 Methods

In this section we show how to sample from the distribution implied by the discriminator . We apply (2) and (3) for a target of and proposal :


The ratio is computed entirely from the discriminator scores . If is perfect, , so the sampler will marginally sample from . A toy one-dimensional example with just such a perfect discriminator is shown in Figure 2.


The probabilities for must not merely provide a good AUC score, but must also be well calibrated. Put in other terms, if one were to warp the probabilities of the perfect discriminator in (1) it may still suffice for standard GAN training, but it will not work in the MCMC procedure defined in (5), as it will result in erroneous density ratio values.

To calibrate , we use a held out calibration set (10% of training data) and either logistic, isotonic, or beta (Kull et al., 2017) regression to warp the output of . Furthermore, for WGAN, calibration is required as it does not learn a typical GAN probabilistic discriminator.

We also detect miscalibration of using the statistic of Dawid (1997) on held out samples and real/fake labels . If is well calibrated, i.e., is indistinguishable from a ,


This means that for large values of , such as when , we reject the hypothesis that is well-calibrated.


We also avoid the burn-in issues that usually plague MCMC methods. Recall that via the detailed balance property (Gilks et al., 1996, Ch. 1), if the marginal distribution of a Markov chain state at time step matches the target (), then the marginal at time step will also follow (). In most MCMC applications it is not possible to get an initial sample from the target distribution (). However, in MH-GANs, we are sampling from the data distribution, so we may use a trick: Initialize the chain at a sample of real data (the correct distribution) to avoid burn-in. If no generated sample is accepted by the end of the chain, restart sampling from a synthetic sample to ensure the initial real sample is never output. To make these restarts rare, we set large (often 640).

Perfect Discriminator

The assumption of a perfect may be weakened for two reasons: (A) Because we recalibrate the discriminator, actual probabilities can be incorrect as long as the decision boundary between real and fake is correct. (B) Because the discriminator is only ever evaluated at samples from or the initial real sample , only needs to be accurate on the manifold of samples from the generator and the real data .

Figure 3:

The 25 Gaussians example. We show the state of the generators at epoch 30 (when MH-GAN begins showing large gains) on the top row and epoch 150 (the final epoch) on the bottom row. The MH-GAN corrects areas of mis-assigned mass in the original GAN. DRS appears visually closer to the original GAN than the data, whereas the MH-GAN appears closer to the actual data.

5 Results

We first show an illustrative synthetic mixture model example followed by real data with images.

5.1 Mixture of 25 Gaussians

We consider the grid of two-dimensional Gaussians used in Azadi et al. (2018), which has become a popular toy example in the GAN literature (Dumoulin et al., 2016). The means are arranged on the grid

and use a standard deviation of


Experimental setup

Following Azadi et al. (2018)

, we use four fully connected layers with ReLU activations for both the generator and discriminator. The final output layer of the discriminator is a sigmoid, and no nonlinearity is applied to the final generator layer. The latent

and observed vectors have dimension two. All hidden layers have size 100. We used training points and generated points in test. The training data was standardized before training.

Visual results

In Figure 3, we show the original data along with samples generated by the GAN. We also show samples enhanced via the MH-GAN (with calibration) and with DRS. The standard GAN creates spurious links along the grid lines between modes. It also missed some modes along the bottom row. DRS is able to “clean up” some of the spurious links but not fill in the missing modes. The MH-GAN recovers these under-estimated modes and further “clean up” the spurious links.

Quantitative results

These results are made more quantitative in Figure 4, where we follow some of the metrics for the example from Azadi et al. (2018). We consider the standard deviations within each mode in Figure 3(a) and the rate of “high quality” samples in Figure 3(b). A sample is assigned to a mode if its distance is within four standard deviations () of its mean. Samples within four standard deviations of any mixture component are considered “high quality”. The within standard deviation plot (Figure 3(a)) shows a slight improvement for MH-GAN, and the high quality sample rate (Figure 3(b)) approaches 100% faster for the MH-GAN than the GAN or DRS.

To test the spread of the distribution, we inspect the categorical distribution of the closest mode. Far away (non-high quality) samples are assigned to a 26th unassigned category. This categorical distribution should be uniform over the 25 real modes for a perfect generator. To assess generator quality, we look at the Jensen-Shannon divergence (JSD) between the sample mode distribution and a uniform distribution. This is a much more stringent test of appropriate spread of probability mass than checking if a single sample is produced near a mode (as was done in 

Azadi et al. (2018)).

(a) mode std. dev.
(b) high quality rate
(c) Jensen-Shannon divergence
Figure 4: Results of the MH-GAN experiments on the mixture of 25 Gaussians example. On the left, we show the standard deviation of samples within a single mode. The black lines represent values for the true distribution. In the center, we show the high quality rate (samples near a real mode) across different GAN setups. On the right, we show the Jensen-Shannon divergence (JSD) between the distribution on the nearest mode vs a uniform, which is the generating distribution on mixture components. The MH-GAN shows, on average, a improvement in JSD over DRS. We considered adding error bars to these plots via a bootstrap analysis, but the error bars are too small to be visible.
(a) performance by epoch
(b) performance by MCMC iteration
(c) epoch 13 scores
Figure 5: Results of the MH-GAN experiments on CIFAR-10 using the DCGAN. On the left, we show the Inception score by training epoch of the DCGAN at

. MH-GAN denotes using the raw discriminator scores and “MH-GAN (cal)” for the calibrated scores. The error bars on MH-GAN performance (in gray) are computed using a t-test on the variation per batch across 80 splits of the Inception score. In the center we show the Inception score vs. MCMC iteration

for the GAN at epoch 15. On the right, we show the scores at epoch 13 where there is some overlap between the scores of fake and real images. When there is overlap, the MH-GAN corrects the distribution to have scores looking similar to the real data. DRS fails to fully shift the distribution because 1) it does not use calibration and 2) its “ shift” setup violates the validity of rejection sampling.

In Figure 3(c), we see that the MH-GAN improves the JSD over DRS by on average, meaning it achieves a much more balanced spread across modes. DRS fails to make gains after epoch 30. Using the principled approach of the MH-GAN along with calibrated probabilities ensures a correct spread of probability mass.

5.2 Real Data

For real data experiments we considered the CelebA (Liu et al., 2015) and CIFAR-10 (Torralba et al., 2008) data sets modeled using the DCGAN (Radford et al., 2015) and WGAN (Arjovsky et al., 2017; Gulrajani et al., 2017). To evaluate the generator , we plot Inception scores (Salimans et al., 2016) per epoch in Figure 4(a) after MCMC iterations. The actual performance boost realized by MH-GAN oscillates from one epoch to the next, perhaps due to fluctuations in the density ratio estimation performance per epoch. Accordingly the statistical significance of the boost from to with calibration varies from no significant change to boost with . Figure 4(b) shows Inception score per MCMC iteration: most gains are made in the first iterations, but smaller gains continue to .

(a) CIFAR-10
(b) CelebA
Figure 6: We show the calibration statistic  (6) for the discriminator on held out data for the DCGAN. The results for CIFAR-10 are shown on the right, and CelebA on the left. The raw discriminator is clearly miscalibrated being far outside the region expected by chance (dashed black), and after multiple comparison correction (dotted black). All the calibration methods give roughly equivalent results. CelebA has a period of training instability during epochs 30–50 which gives trivially calibrated classifiers.
CIFAR-10 p CelebA p CIFAR-10 p CelebA p
GAN 2.8789 2.3317 3.0734 2.7876
DRS 2.977(77) 0.0131 2.511(50) <0.0001
DRS (cal) 3.073(80) <0.0001 2.869(67) <0.0001 3.137(64) 0.0497 2.861(66) 0.0277
MH-GAN 3.113(69) <0.0001 2.682(50) <0.0001
MH-GAN (cal) 3.379(66) <0.0001 3.106(64) <0.0001 3.305(83) <0.0001 2.889(89) 0.0266
Table 1: Results showing Inception score improvements from MH-GAN on DCGAN and WGAN at epoch 60. Like Figure 4(a), the error bars and p-values are computed using a paired t-test across Inception score batches. All results except for DCGAN on celebA are significant at . WGAN does not learn a typical GAN discriminator that outputs a probability, so calibration is actually required in this case.

In Table 1, we summarize performance (Inception score) across all experiments, running MCMC with iterations in all cases. Behavior is qualitatively similar to that in Figure 4(a). While DRS improves on a direct GAN, MH-GAN improves Inception score more in every case. Calibration helps in every case. There was not a substantial difference between different calibration methods, but we found a slight advantage for isotonic regression. Results are computed at epoch 60, and as in Figure 4(a), error bars and p-values are computed using a paired t-test across Inception score batches. All results are significantly better than the baseline GAN at .

Score distribution

In Figure 4(c), we visualize what does to the distribution on discriminator scores. MCMC shifts the distribution of the fakes to match the distribution on true images.

Calibration results

Figure 6 shows the results per epoch for both CIFAR-10 and CelebA. It shows that the raw discriminator is highly miscalibrated, but can be fixed with any of the standard calibration methods. The statistic for the raw discriminator (DCGAN on CIFAR-10) varies from to in the first 60 epochs; even after Bonferroni correction at , we expect with 95% confidence for a calibrated classifier. The calibrated discriminator varies from to , which shows it is almost perfectly calibrated. Accordingly, it is unsurprising that the calibrated discriminator significantly boosts performance in the MH-GAN.

(a) GAN
(b) DRS
(c) MH-GAN
(d) MH-GAN (cal)
Figure 7: Example images on CIFAR-10 for different GAN setups. The different selectors (MH-GAN and DRS) are run on the same batch of images. Meaning, the same images may appear for both generators. The calibrated MH-GAN shows a greater preference for animal-like images with four legs.
(a) GAN
(b) DRS
(c) MH-GAN
(d) MH-GAN (cal)
Figure 8: Example images on CelebA for different GAN setups. Like Figure 7, the same batch of images goes into each selector.

Visual results

Finally, we also show example images from the CIFAR-10 and CelebA setups in Figures 7 and 8. The selectors (such as MH-GAN) result in a wider spread of probability mass across background colors. For CIFAR-10, it enhances modes with animal-like outlines and vehicles.

6 Conclusions

We have shown how to incorporate the knowledge in the discriminator into an improved generator . Our method is based on the premise that is better at density ratio estimation than is at sampling data, which is inherently a harder task. The principled MCMC setup selects among samples from to correct biases in . This is the only method in the literature which has the property that given a perfect one can recover such that . We have shown the raw discriminators in GANs and DRS are poorly calibrated. To our knowledge, this is the first work to evaluate the discriminator in this way and to rigorously show the poor calibration of the discriminator. The MH-GAN has great potential for extension.