Discriminator Rejection Sampling

10/16/2018 ∙ by Samaneh Azadi, et al. ∙ berkeley college 0

We propose a rejection sampling scheme using the discriminator of a GAN to approximately correct errors in the GAN generator distribution. We show that under quite strict assumptions, this will allow us to recover the data distribution exactly. We then examine where those strict assumptions break down and design a practical algorithm - called Discriminator Rejection Sampling (DRS) - that can be used on real data-sets. Finally, we demonstrate the efficacy of DRS on a mixture of Gaussians and on the state of the art SAGAN model. On ImageNet, we train an improved baseline that increases the best published Inception Score from 52.52 to 62.36 and reduces the Frechet Inception Distance from 18.65 to 14.79. We then use DRS to further improve on this baseline, improving the Inception Score to 76.08 and the FID to 13.75.



There are no comments yet.


page 8

page 12

page 13

page 14

page 15

page 16

Code Repositories


Tensorflow implementation of Discriminator Rejection Sampling

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014)

are a powerful tool for image synthesis. They have also been applied successfully to semi-supervised and unsupervised learning 

(Springenberg, 2015; Odena, 2016; Kumar et al., 2017), image editing (Yu et al., 2018; Ledig et al., 2017), and image style transfer (Zhu et al., 2017; Isola et al., 2017; Yi et al., 2017; Azadi et al., 2018)

. Informally, the GAN training procedure pits two neural networks against each other, a generator and a discriminator. The discriminator is trained to distinguish between samples from the target distribution and samples from the generator. The generator is trained to fool the discriminator into thinking its outputs are real. The GAN training procedure is thus a two-player differentiable game, and the game dynamics are largely what distinguishes the study of GANs from the study of other generative models. These game dynamics have well-known and heavily studied stability issues. Addressing these issues is an active area of research

(Mao et al., 2017; Arjovsky et al., 2017; Gulrajani et al., 2017; Odena et al., 2018; Li et al., 2017).

However, we are interested in studying something different: Instead of trying to improve the training procedure, we (temporarily) accept its flaws and attempt to improve the quality of trained generators by post-processing their samples using information from the trained discriminator. It’s well known that (under certain very strict assumptions) the equilibrium of this training procedure is reached when sampling from the generator is identical to sampling from the target distribution and the discriminator always outputs . However, these assumptions don’t hold in practice. In particular, GANs as presently trained don’t learn to reproduce the target distribution (Arora & Zhang, 2017). Moreover, trained GAN discriminators aren’t just identically — they can even be used to perform chess-type skill ratings of other trained generators (Olsson et al., 2018).

We ask if the information retained in the weights of the discriminator at the end of the training procedure can be used to “improve” the generator. At face value, this might seem unlikely. After all, if there is useful information left in the discriminator, why doesn’t it find its way into the generator via the training procedure? Further reflection reveals that there are many possible reasons. First, the assumptions made in various analyses of the training procedure surely don’t hold in practice (e.g. the discriminator and generator have finite capacity and are optimized in parameter space rather than density-space). Second, due to the concrete realization of the discriminator and the generator as neural networks, it may be that it is harder for the generator to model a given distribution than it is for the discriminator to tell that this distribution is not being modeled precisely. Finally, we may simply not train GANs long enough in practice for computational reasons.

In this paper, we focus on using the discriminator as part of a probabilistic rejection sampling scheme. In particular, this paper makes the following contributions:

  • We propose a rejection sampling scheme using the GAN discriminator to approximately correct errors in the GAN generator distribution.

  • We show that under quite strict assumptions, this scheme allows us to recover the data distribution exactly.

  • We then examine where those strict assumptions break down and design a practical algorithm – called DRS – that takes this into account.

  • We conduct experiments demonstrating the effectiveness of DRS. First, as a baseline, we train an improved version of the Self-Attention GAN, improving its performance from the best published Inception Score of 52.52 up to 62.36, and from a Fréchet Inception Distance of 18.65 down to 14.79. We then show that DRS yields further improvement over this baseline, increasing the Inception Score to 76.08 and decreasing the Fréchet Inception Distance to 13.75.

2 Background

2.1 Generative Adversarial Networks

A generative adversarial network (GAN) (Goodfellow et al., 2014) consists of two separate neural networks — a generator, and a discriminator — trained in tandem. The generator takes as input a sample from the prior and produces a sample . The discriminator takes an observation

as input and produces a probability

that the observation is real. The observation is sampled either according to the density (the data generating distribution) or

(the implicit density given by the generator and the prior). Using the standard non-saturating variant, the discriminator and generator are then trained using the following loss functions:

2.2 Evaluation metrics: Inception Score (IS) and Fréchet Inception Distance (FID)

The two most popular techniques for evaluating GANs on image synthesis tasks are the Inception Score and the Fréchet Inception Distance. The Inception Score (Salimans et al., 2016) is given by , where

is the output of a pre-trained Inception classifier

(Szegedy et al., 2014). This measures the ability of the GAN to generate samples that the pre-trained classifier confidently assigns to a particular class, and also the ability of the GAN to generate samples from all classes. The Fréchet Inception Distance (FID) (Heusel et al., 2017)

, is computed by passing samples through an Inception network to yield “semantic embeddings”, after which the Fréchet distance is computed between Gaussians with moments given by these embeddings.

2.3 Self-Attention GAN

We use a Self-Attention GAN (SAGAN) (Zhang et al., 2018) in our experiments. We do so because SAGAN is considered state of the art on the ImageNet conditional-image-synthesis task (in which images are synthesized conditioned on class identity). SAGAN differs from a vanilla GAN in the following ways: First, it uses large residual networks (He et al., 2016) instead of normal convolutional layers. Second, it uses spectral normalization (Miyato et al., 2018) in the generator and the discriminator and a much lower learning rate for the generator than is conventional (Heusel et al., 2017). Third, SAGAN makes use of self-attention layers (Wang et al., ), in order to better model long range dependencies in natural images. Finally, this whole model is trained using a special hinge version of the adversarial loss (Lim & Ye, 2017; Miyato & Koyama, 2018; Tran et al., 2017):

Data: generator G and discriminator D
Result: Filtered samples from G
while  do
       Maximum(, );
       if  then
             Append(, );
       end if
end while
Figure 1: Left: For a uniform proposal distribution and Gaussian target distribution, the blue points are the result of rejection sampling and the red points are the result of naively throwing out samples for which the density ratio () is below a threshold. The naive method underrepresents the density of the tails. Right: the DRS algorithm. KeepTraining continues training using early stopping on the validation set. BurnIn

computes a large number of density ratios to estimate their maximum.

is the logit of

. is as in Equation 8. is an empirical estimate of the true maximum .

2.4 Rejection Sampling

Rejection sampling is a method for sampling from a target distribution which may be hard to sample from directly. Samples are instead drawn from a proposal distribution , which is easier to sample from, and which is chosen such that there exists a finite value such that for . A given sample drawn from is kept with acceptance probability , and rejected otherwise. See the blue points in Figure 1 (Left) for a visualization. Ideally, should be close to , otherwise many samples will be rejected, reducing the efficiency of the algorithm (MacKay, 2003). In Section 3 we explain how to apply this rejection sampling algorithm to the GAN framework: in brief, we draw samples from the trained generator, , and then reject some of those samples using the discriminator to attain a closer approximation to the true data distribution, .

3 Rejection sampling for GANs

In this section we introduce our proposed rejection sampling scheme for GANs (which we call Discriminator Rejection Sampling, or DRS). We’ll first derive an idealized version of the algorithm that will rely on assumptions that don’t necessarily hold in realistic settings. We’ll then discuss the various ways in which these assumptions might break down. Finally, we’ll describe the modifications we made to the idealized version in order to overcome these challenges.

3.1 Rejection sampling for GANs: the idealized version

Suppose that we have a GAN and our generator has been trained to the point that and have the same support. That is, for all , if and only if . If desired, we can make and have support everywhere in

if we add low-variance Gaussian noise to the observations. Now further suppose that we have some way to compute

. Then, if , then for all , so we can perform rejection sampling with as the proposal distribution and as the target distribution as long as we can evaluate the quantity 111 Why go through all this trouble when we could instead just pick some threshold and throw out when ? This doesn’t allow us to recover in general. If, for example, there is s.t. , we still want some probability of observing . See the red points in Figure 1 (Left) for a visual explanation. . In this case, we can exactly sample from (Casella et al., 2004), though we may have to reject many samples to do so.

But how can we evaluate ? is defined only implicitly. One thing we can do is to borrow an analysis from the original GAN paper (Goodfellow et al., 2014), which assumes that we can optimize the discriminator in the space of density functions rather than via changing its parameters. If we make this assumption, as well as the assumption that the discriminator is defined by a sigmoid applied to some function of and trained with a cross-entropy loss, then by Proposition 1 of that paper, we have that, for any fixed generator and in particular for the generator that we have when we stop training, training the discriminator to completely minimize its own loss yields


We will discuss the validity of these assumptions later, but for now consider that this allows us to solve for as follows: As noted above, we can assume the discriminator is defined as:


where is the final discriminator output after the sigmoid, and is the logit. Thus,


Now suppose one last thing, which is that we can tractably compute . We would find that for some (not necessarily unique) . Given all these assumptions, we can now perform rejection sampling as promised. If we define , then for any input , the acceptance probability can be written as . To decide whether to keep any particular example, we can just draw a random number uniformly from and accept the sample if .

3.2 Discriminator Rejection Sampling: the practical scheme

As we hinted at, the above analysis has a number of practical issues. In particular:

  1. Since we can’t actually perform optimization over density functions, we can’t actually compute . Thus, our acceptance probability won’t necessarily be proportional to .

  2. At least on large datasets, it’s quite obvious that the supports of and are not the same. If the support of and has a low volume intersection, we may not even want to compute , because then would just evaluate to 0 most places.

  3. The analysis yielding the formula for also assumes that we can draw infinite samples from , which is not true in practice. If we actually optimized all the way given a finite data-set, it would give nonzero results on a set of measure 0.

  4. In general it won’t be tractable to compute .

  5. Rejection sampling is known to have too low an acceptance probability when the target distribution is high dimensional (MacKay, 2003).

This section describes the Discriminator Rejection Sampling (DRS) procedure, which is an adjustment of the idealized procedure, meant to address the above issues.

On the difficulty of actually computing :

Given that items 2 and 3 suggest we may not want to compute exactly, we should perhaps not be too concerned with item 1, which suggests that we can’t. The best argument we can make that it is OK to approximate is that doing so seems to be successful empirically. We speculate that training a regularized with SGD gives a final result that is further from but perhaps is less over-fit to the finite sample from used for training. We also hypothesize that the we end up with will distinguish between “good” and “bad” samples, even if those samples would both have zero density under the true . We qualitatively evaluate this hypothesis in Figures 4 and 5. We suspect that more could be done theoretically to quantify the effect of this approximation, but we leave this to future work.

On the difficulty of actually computing :

It’s nontrivial to compute , at the very least because we can’t compute . In practice, we get around this issue by estimating from samples. We first run an estimation phase, in which 10,000 samples are used to estimate . We then use this estimate in the sampling phase. Throughout the sampling phase we update our estimate of if a larger value is found. It’s true that this will result in slight overestimates of the acceptance probability for samples that were processed before a new maximum was found, but we choose not to worry about this too much, since we don’t find that we have to increase the maximum very often in the sampling phase, and the increase is very small when it does happen.

Dealing with acceptance probabilities that are too low:

Item 5 suggests that we may end up with acceptance probabilities that are too low to be useful when performing this technique on realistic data-sets. If is very large, the acceptance probability will be close to zero, and almost all samples will be rejected, which is undesirable. One simple way to avoid this problem is to compute some such that the acceptance probability can be written as follows:


If we solve for in the above equation we can then perform the following rearrangement:


In practice, we instead compute


where is a small constant added for numerical stability and

is a hyperparameter modulating overall acceptance probability. For very positive

, all samples will be rejected. For very negative , all samples will be accepted. See Figure 2 for an analysis of the effect of adding . A summary of our proposed algorithm is presented in Figure 1 (Right).

Figure 2: (A) Histogram of the sigmoid inputs, (left plot), and acceptance probabilities, (center plot), on fake samples before (purple) and after (green) adding the constant to all . Before adding gamma, 98.9% of the samples had an acceptance probability 1e-4. (B) Histogram of from a pre-trained Inception network where is the predicted probability of sample belonging to the category (from 1,000 ImageNet categories). The green bars correspond to 25,000 accepted samples and the red bars correspond to 25,000 rejected samples. The rejected images are less recognizable as belonging to a distinct class.

4 Experiments

In this section we justify the modifications made to the idealized algorithm. We do this by conducting two experiments in which we show that (according to popular measures of how well a GAN has learned the target distribution) Discriminator Rejection Sampling yields improvements for actual GANs. We start with a toy example that yields insight into how DRS can help, after which we demonstrate DRS on the ImageNet dataset (Russakovsky et al., 2015).

4.1 Mixture of 25 Gaussians

We investigate the impact of DRS on a low-dimensional synthetic data set consisting of a mixture of twenty-five 2D isotropic Gaussian distributions (each with standard deviation of 0.05) arranged in a grid 

(Dumoulin et al., 2016; Srivastava et al., 2017; Lin et al., 2017)

. We train a GAN model where the generator and discriminator are neural networks with four fully connected layers with ReLu activations. The prior is a 2D Gaussian with mean of 0 and standard deviation of 1 and the GAN is trained using the standard loss function. We generate 10,000 samples from the generator with and without DRS. The target distribution and both sets of generated samples are depicted in Figure 

3. Here, we have set dynamically to the percentile of for all in a pool of samples.

Figure 3: Real samples from 25 2D-Gaussian Distributions (left) as well as fake samples generated from a trained GAN model without (middle) and with DRS (right). Results are computed as an average over five models randomly initialized and trained independently.

To measure performance, we assign each generated sample to its closest mixture component. As in Srivastava et al. (2017), we define a sample as “high quality” if it is within four standard deviations of its assigned mixture component. As shown in Table 1, DRS increases the fraction of high-quality samples from to . As in Dumoulin et al. (2016) and Srivastava et al. (2017) we call a mode “recovered” if at least one high-quality sample was assigned to it. Table 1 shows that DRS does not reduce the number of recovered modes – that is, it does not trade off quality for mode coverage. It does reduce the standard deviation of the high-quality samples slightly, but this is a good thing in this case (since the standard deviation of the target Gaussian distribution is 0.05). It also confirms that DRS does not accept samples only near the center of each Gaussian but near the tails as well.

of recovered modes “high quality” std of “high quality” samples
Without DRS
With DRS
Table 1: Results with and without DRS on 10,000 generated samples from a model of a 2D grid of Gaussian components.

4.2 ImageNet Dataset

Since it is presently the state-of-the-art model on the conditional ImageNet synthesis task, we have reimplemented the Self-Attention GAN (Zhang et al., 2018) as a baseline. After reproducing the results reported by Zhang et al. (2018) (with the learning rate of ), we fine-tuned a trained SAGAN with a much lower learning rate () for both generator and discriminator. This improved both the Inception Score and FID significantly as can be seen in the Improved-SAGAN column in Table 2. Plots of Inception score and FID during training are given in Figure 5(A).

Since SAGAN uses a hinge loss and DRS requires a sigmoid output, we added a fully-connected layer “on top of” the trained discriminator and trained it to distinguish real images from fake ones using the binary cross-entropy loss. We trained this extra layer with 10,000 generated samples from the model and 10,000 examples from ImageNet.

We then generated 50,000 samples from normal SAGAN and Improved SAGAN with and without DRS, repeating the sampling process 4 times. We set dynamically to the percentile of the values. The averages of Inception Score and FID over these four trials are presented in Table 2. Both scores were substantially improved for both models, indicating that DRS can indeed be useful in realistic settings involving large data-sets and sophisticated GAN variants.

Without DRS
With DRS
Table 2: Results with and without DRS on ImageNet samples. Low FID and high IS are better.
Qualitative Analysis of ImageNet results:

From a pool of 50,000 samples, we visualize the “best” and the “worst” 100 samples based on their acceptance probabilities. Figure 4 shows that the subjective visual quality of samples with high acceptance probability is considerably better. Figure 2(B) also shows that the accepted images are on average more recognizable as belonging to a distinct class.

We also study the behavior of the discriminator in another way. We choose an ImageNet category randomly, then generate samples from that category until we have found two images such that appears visually realistic and appears visually unrealistic. Here, and

are the input latent vectors. We then generate many images by interpolating in latent space between the two images according to

with . In Figure 5, the first and last columns correspond with and , respectively. The color bar in the figure represents the acceptance probability assigned to each sample. In general, acceptance probabilities decrease from left to right. There is no reason to expect a priori that the acceptance probability should decrease monotonically as a function of the interpolated , so it says something interesting about the discriminator that most rows basically follow this pattern.

Figure 4: Synthesized images with the highest (left) and lowest (right) acceptance probability scores.
Figure 5: (A) Inception Score and FID during ImageNet training, computed on 50,000 samples. (B) Each row shows images synthesized by interpolating in latent space. The color bar above each row represents the acceptance probabilities for each sample: red for high and white for low. Subjective visual quality of samples with high acceptance probability is considerably better: objects are more coherent and more recognizable as belonging to a specific class. There are fewer indistinct textures, and fewer scenes without recognizable objects.

5 Conclusion

We have proposed a rejection sampling scheme using the GAN discriminator to approximately correct errors in the GAN generator distribution. We’ve shown that under strict assumptions, we can recover the data distribution exactly. We’ve also examined where those assumptions break down and designed a practical algorithm (Discriminator Rejection Sampling) to address that. Finally, we have demonstrated the efficacy of this algorithm on a mixture of Gaussians and on the state-of-the-art SAGAN model.

Opportunities for future work include the following:

  • There is a newly published work improving the image generation task (Brock et al., 2018) which is not available yet but we expect DRS to be compatible with it too.

  • There’s no reason that our scheme can only be applied to GAN generators. It seems worth investigating whether rejection sampling can improve e.g. VAE decoders. This seems like it might help, because VAEs may have trouble with “spreading mass around” too much.

  • In one ideal case, the critic used for rejection sampling would be a human. Can we use better proxies for the human visual system to improve rejection sampling’s effect on image synthesis models?

  • It would be interesting to theoretically characterize the efficacy of rejection sampling under the breakdown-of-assumptions that we have described earlier. For instance, if one can’t recover but can train some other critic that has bounded divergence from , how does the efficacy depend on this bound?


We thank Colin Raffel for helpful discussions.



To confirm that our Discriminator Rejection Sampling is not duplicating the training samples, we show the nearest neighbor of a few visually-realistic generated samples in the ImageNet training data in Figures 6-13. The nearest neighbors are found based on their fc7 features from the pre-trained VGG16 model.

In addition, we represent Inception score as a function of acceptance rate in Figure 14-left. Different acceptance rates are achieved by changing from the percentile of (acceptance rate = ) to its percentile (acceptance rate = ). Decreasing the acceptance rate filters more non-realistic samples and increases the final Inception score. After an specific rate, rejecting more samples does not gain any benefit in collecting a better pool of samples.

Moreover, Figure 14-right shows the correlation between the acceptance probabilities that DRS assigns to the synthesized samples and the recognizability of those samples from the view-point of a pre-trained Inception network. The latter is measured by computing which is the probability of sample belonging to the category from the 1,000 ImageNet classes. As expected, there is a large mass of the recognizable images accepted with high acceptance probabilities on the top right corner. The small mass of images which cannot be easily classified into one of the 1,000 categories while having high acceptance probability scores (the top left corner of the graph) can be due to the non-optimal GAN discriminator in practice. Therefore, we expect that improving the discriminator performance boosts the final inception score even more substantially.

Figure 6: Nearest neighbors of the top left generated image in ImageNet training set in terms of VGG16 fc7 features
Figure 7: Nearest neighbors of the top left generated image in ImageNet training set in terms of VGG16 fc7 features
Figure 8: Nearest neighbors of the top left generated image in ImageNet training set in terms of VGG16 fc7 features
Figure 9: Nearest neighbors of the top left generated image in ImageNet training set in terms of VGG16 fc7 features
Figure 10: Nearest neighbors of the top left generated image in ImageNet training set in terms of VGG16 fc7 features
Figure 11: Nearest neighbors of the top left generated image in ImageNet training set in terms of VGG16 fc7 features
Figure 12: Nearest neighbors of the top left generated image in ImageNet training set in terms of VGG16 fc7 features
Figure 13: Nearest neighbors of the top left generated image in ImageNet training set in terms of VGG16 fc7 features
Figure 14: Inception Score versus the rate of accepting samples on average (left), and the acceptance probability assigned to each sample by DRS versus the maximum probability of belonging to one of the 1K categories based on a pre-trained Inception network, (right).