Tensorflow implementation of Discriminator Rejection Sampling
We propose a rejection sampling scheme using the discriminator of a GAN to approximately correct errors in the GAN generator distribution. We show that under quite strict assumptions, this will allow us to recover the data distribution exactly. We then examine where those strict assumptions break down and design a practical algorithm - called Discriminator Rejection Sampling (DRS) - that can be used on real data-sets. Finally, we demonstrate the efficacy of DRS on a mixture of Gaussians and on the state of the art SAGAN model. On ImageNet, we train an improved baseline that increases the best published Inception Score from 52.52 to 62.36 and reduces the Frechet Inception Distance from 18.65 to 14.79. We then use DRS to further improve on this baseline, improving the Inception Score to 76.08 and the FID to 13.75.READ FULL TEXT VIEW PDF
Generative adversarial networks (GANs) have shown great promise in gener...
Within a broad class of generative adversarial networks, we show that
We show that the sum of the implicit generator log-density log p_g of a
We propose to improve unconditional Generative Adversarial Networks (GAN...
In this paper, we study several GAN related topics mathematically, inclu...
Thanks to their remarkable generative capabilities, GANs have gained gre...
Recent works propose using the discriminator of a GAN to filter out
Tensorflow implementation of Discriminator Rejection Sampling
Generative Adversarial Networks (GANs) (Goodfellow et al., 2014)
are a powerful tool for image synthesis. They have also been applied successfully to semi-supervised and unsupervised learning(Springenberg, 2015; Odena, 2016; Kumar et al., 2017), image editing (Yu et al., 2018; Ledig et al., 2017), and image style transfer (Zhu et al., 2017; Isola et al., 2017; Yi et al., 2017; Azadi et al., 2018)
. Informally, the GAN training procedure pits two neural networks against each other, a generator and a discriminator. The discriminator is trained to distinguish between samples from the target distribution and samples from the generator. The generator is trained to fool the discriminator into thinking its outputs are real. The GAN training procedure is thus a two-player differentiable game, and the game dynamics are largely what distinguishes the study of GANs from the study of other generative models. These game dynamics have well-known and heavily studied stability issues. Addressing these issues is an active area of research(Mao et al., 2017; Arjovsky et al., 2017; Gulrajani et al., 2017; Odena et al., 2018; Li et al., 2017).
However, we are interested in studying something different: Instead of trying to improve the training procedure, we (temporarily) accept its flaws and attempt to improve the quality of trained generators by post-processing their samples using information from the trained discriminator. It’s well known that (under certain very strict assumptions) the equilibrium of this training procedure is reached when sampling from the generator is identical to sampling from the target distribution and the discriminator always outputs . However, these assumptions don’t hold in practice. In particular, GANs as presently trained don’t learn to reproduce the target distribution (Arora & Zhang, 2017). Moreover, trained GAN discriminators aren’t just identically — they can even be used to perform chess-type skill ratings of other trained generators (Olsson et al., 2018).
We ask if the information retained in the weights of the discriminator at the end of the training procedure can be used to “improve” the generator. At face value, this might seem unlikely. After all, if there is useful information left in the discriminator, why doesn’t it find its way into the generator via the training procedure? Further reflection reveals that there are many possible reasons. First, the assumptions made in various analyses of the training procedure surely don’t hold in practice (e.g. the discriminator and generator have finite capacity and are optimized in parameter space rather than density-space). Second, due to the concrete realization of the discriminator and the generator as neural networks, it may be that it is harder for the generator to model a given distribution than it is for the discriminator to tell that this distribution is not being modeled precisely. Finally, we may simply not train GANs long enough in practice for computational reasons.
In this paper, we focus on using the discriminator as part of a probabilistic rejection sampling scheme. In particular, this paper makes the following contributions:
We propose a rejection sampling scheme using the GAN discriminator to approximately correct errors in the GAN generator distribution.
We show that under quite strict assumptions, this scheme allows us to recover the data distribution exactly.
We then examine where those strict assumptions break down and design a practical algorithm – called DRS – that takes this into account.
We conduct experiments demonstrating the effectiveness of DRS. First, as a baseline, we train an improved version of the Self-Attention GAN, improving its performance from the best published Inception Score of 52.52 up to 62.36, and from a Fréchet Inception Distance of 18.65 down to 14.79. We then show that DRS yields further improvement over this baseline, increasing the Inception Score to 76.08 and decreasing the Fréchet Inception Distance to 13.75.
A generative adversarial network (GAN) (Goodfellow et al., 2014) consists of two separate neural networks — a generator, and a discriminator — trained in tandem. The generator takes as input a sample from the prior and produces a sample . The discriminator takes an observation
as input and produces a probabilitythat the observation is real. The observation is sampled either according to the density (the data generating distribution) or
(the implicit density given by the generator and the prior). Using the standard non-saturating variant, the discriminator and generator are then trained using the following loss functions:
The two most popular techniques for evaluating GANs on image synthesis tasks are the Inception Score and the Fréchet Inception Distance. The Inception Score (Salimans et al., 2016) is given by , where
is the output of a pre-trained Inception classifier(Szegedy et al., 2014). This measures the ability of the GAN to generate samples that the pre-trained classifier confidently assigns to a particular class, and also the ability of the GAN to generate samples from all classes. The Fréchet Inception Distance (FID) (Heusel et al., 2017)
, is computed by passing samples through an Inception network to yield “semantic embeddings”, after which the Fréchet distance is computed between Gaussians with moments given by these embeddings.
We use a Self-Attention GAN (SAGAN) (Zhang et al., 2018) in our experiments. We do so because SAGAN is considered state of the art on the ImageNet conditional-image-synthesis task (in which images are synthesized conditioned on class identity). SAGAN differs from a vanilla GAN in the following ways: First, it uses large residual networks (He et al., 2016) instead of normal convolutional layers. Second, it uses spectral normalization (Miyato et al., 2018) in the generator and the discriminator and a much lower learning rate for the generator than is conventional (Heusel et al., 2017). Third, SAGAN makes use of self-attention layers (Wang et al., ), in order to better model long range dependencies in natural images. Finally, this whole model is trained using a special hinge version of the adversarial loss (Lim & Ye, 2017; Miyato & Koyama, 2018; Tran et al., 2017):
Rejection sampling is a method for sampling from a target distribution which may be hard to sample from directly. Samples are instead drawn from a proposal distribution , which is easier to sample from, and which is chosen such that there exists a finite value such that for . A given sample drawn from is kept with acceptance probability , and rejected otherwise. See the blue points in Figure 1 (Left) for a visualization. Ideally, should be close to , otherwise many samples will be rejected, reducing the efficiency of the algorithm (MacKay, 2003). In Section 3 we explain how to apply this rejection sampling algorithm to the GAN framework: in brief, we draw samples from the trained generator, , and then reject some of those samples using the discriminator to attain a closer approximation to the true data distribution, .
In this section we introduce our proposed rejection sampling scheme for GANs (which we call Discriminator Rejection Sampling, or DRS). We’ll first derive an idealized version of the algorithm that will rely on assumptions that don’t necessarily hold in realistic settings. We’ll then discuss the various ways in which these assumptions might break down. Finally, we’ll describe the modifications we made to the idealized version in order to overcome these challenges.
Suppose that we have a GAN and our generator has been trained to the point that and have the same support. That is, for all , if and only if . If desired, we can make and have support everywhere in
if we add low-variance Gaussian noise to the observations. Now further suppose that we have some way to compute. Then, if , then for all , so we can perform rejection sampling with as the proposal distribution and as the target distribution as long as we can evaluate the quantity 111 Why go through all this trouble when we could instead just pick some threshold and throw out when ? This doesn’t allow us to recover in general. If, for example, there is s.t. , we still want some probability of observing . See the red points in Figure 1 (Left) for a visual explanation. . In this case, we can exactly sample from (Casella et al., 2004), though we may have to reject many samples to do so.
But how can we evaluate ? is defined only implicitly. One thing we can do is to borrow an analysis from the original GAN paper (Goodfellow et al., 2014), which assumes that we can optimize the discriminator in the space of density functions rather than via changing its parameters. If we make this assumption, as well as the assumption that the discriminator is defined by a sigmoid applied to some function of and trained with a cross-entropy loss, then by Proposition 1 of that paper, we have that, for any fixed generator and in particular for the generator that we have when we stop training, training the discriminator to completely minimize its own loss yields
We will discuss the validity of these assumptions later, but for now consider that this allows us to solve for as follows: As noted above, we can assume the discriminator is defined as:
where is the final discriminator output after the sigmoid, and is the logit. Thus,
Now suppose one last thing, which is that we can tractably compute . We would find that for some (not necessarily unique) . Given all these assumptions, we can now perform rejection sampling as promised. If we define , then for any input , the acceptance probability can be written as . To decide whether to keep any particular example, we can just draw a random number uniformly from and accept the sample if .
As we hinted at, the above analysis has a number of practical issues. In particular:
Since we can’t actually perform optimization over density functions, we can’t actually compute . Thus, our acceptance probability won’t necessarily be proportional to .
At least on large datasets, it’s quite obvious that the supports of and are not the same. If the support of and has a low volume intersection, we may not even want to compute , because then would just evaluate to 0 most places.
The analysis yielding the formula for also assumes that we can draw infinite samples from , which is not true in practice. If we actually optimized all the way given a finite data-set, it would give nonzero results on a set of measure 0.
In general it won’t be tractable to compute .
Rejection sampling is known to have too low an acceptance probability when the target distribution is high dimensional (MacKay, 2003).
This section describes the Discriminator Rejection Sampling (DRS) procedure, which is an adjustment of the idealized procedure, meant to address the above issues.
Given that items 2 and 3 suggest we may not want to compute exactly, we should perhaps not be too concerned with item 1, which suggests that we can’t. The best argument we can make that it is OK to approximate is that doing so seems to be successful empirically. We speculate that training a regularized with SGD gives a final result that is further from but perhaps is less over-fit to the finite sample from used for training. We also hypothesize that the we end up with will distinguish between “good” and “bad” samples, even if those samples would both have zero density under the true . We qualitatively evaluate this hypothesis in Figures 4 and 5. We suspect that more could be done theoretically to quantify the effect of this approximation, but we leave this to future work.
It’s nontrivial to compute , at the very least because we can’t compute . In practice, we get around this issue by estimating from samples. We first run an estimation phase, in which 10,000 samples are used to estimate . We then use this estimate in the sampling phase. Throughout the sampling phase we update our estimate of if a larger value is found. It’s true that this will result in slight overestimates of the acceptance probability for samples that were processed before a new maximum was found, but we choose not to worry about this too much, since we don’t find that we have to increase the maximum very often in the sampling phase, and the increase is very small when it does happen.
Item 5 suggests that we may end up with acceptance probabilities that are too low to be useful when performing this technique on realistic data-sets. If is very large, the acceptance probability will be close to zero, and almost all samples will be rejected, which is undesirable. One simple way to avoid this problem is to compute some such that the acceptance probability can be written as follows:
If we solve for in the above equation we can then perform the following rearrangement:
In practice, we instead compute
where is a small constant added for numerical stability and
is a hyperparameter modulating overall acceptance probability. For very positive, all samples will be rejected. For very negative , all samples will be accepted. See Figure 2 for an analysis of the effect of adding . A summary of our proposed algorithm is presented in Figure 1 (Right).
In this section we justify the modifications made to the idealized algorithm. We do this by conducting two experiments in which we show that (according to popular measures of how well a GAN has learned the target distribution) Discriminator Rejection Sampling yields improvements for actual GANs. We start with a toy example that yields insight into how DRS can help, after which we demonstrate DRS on the ImageNet dataset (Russakovsky et al., 2015).
We investigate the impact of DRS on a low-dimensional synthetic data set consisting of a mixture of twenty-five 2D isotropic Gaussian distributions (each with standard deviation of 0.05) arranged in a grid(Dumoulin et al., 2016; Srivastava et al., 2017; Lin et al., 2017)
. We train a GAN model where the generator and discriminator are neural networks with four fully connected layers with ReLu activations. The prior is a 2D Gaussian with mean of 0 and standard deviation of 1 and the GAN is trained using the standard loss function. We generate 10,000 samples from the generator with and without DRS. The target distribution and both sets of generated samples are depicted in Figure3. Here, we have set dynamically to the percentile of for all in a pool of samples.
To measure performance, we assign each generated sample to its closest mixture component. As in Srivastava et al. (2017), we define a sample as “high quality” if it is within four standard deviations of its assigned mixture component. As shown in Table 1, DRS increases the fraction of high-quality samples from to . As in Dumoulin et al. (2016) and Srivastava et al. (2017) we call a mode “recovered” if at least one high-quality sample was assigned to it. Table 1 shows that DRS does not reduce the number of recovered modes – that is, it does not trade off quality for mode coverage. It does reduce the standard deviation of the high-quality samples slightly, but this is a good thing in this case (since the standard deviation of the target Gaussian distribution is 0.05). It also confirms that DRS does not accept samples only near the center of each Gaussian but near the tails as well.
|of recovered modes||“high quality”||std of “high quality” samples|
Since it is presently the state-of-the-art model on the conditional ImageNet synthesis task, we have reimplemented the Self-Attention GAN (Zhang et al., 2018) as a baseline. After reproducing the results reported by Zhang et al. (2018) (with the learning rate of ), we fine-tuned a trained SAGAN with a much lower learning rate () for both generator and discriminator. This improved both the Inception Score and FID significantly as can be seen in the Improved-SAGAN column in Table 2. Plots of Inception score and FID during training are given in Figure 5(A).
Since SAGAN uses a hinge loss and DRS requires a sigmoid output, we added a fully-connected layer “on top of” the trained discriminator and trained it to distinguish real images from fake ones using the binary cross-entropy loss. We trained this extra layer with 10,000 generated samples from the model and 10,000 examples from ImageNet.
We then generated 50,000 samples from normal SAGAN and Improved SAGAN with and without DRS, repeating the sampling process 4 times. We set dynamically to the percentile of the values. The averages of Inception Score and FID over these four trials are presented in Table 2. Both scores were substantially improved for both models, indicating that DRS can indeed be useful in realistic settings involving large data-sets and sophisticated GAN variants.
From a pool of 50,000 samples, we visualize the “best” and the “worst” 100 samples based on their acceptance probabilities. Figure 4 shows that the subjective visual quality of samples with high acceptance probability is considerably better. Figure 2(B) also shows that the accepted images are on average more recognizable as belonging to a distinct class.
We also study the behavior of the discriminator in another way. We choose an ImageNet category randomly, then generate samples from that category until we have found two images such that appears visually realistic and appears visually unrealistic. Here, andwith . In Figure 5, the first and last columns correspond with and , respectively. The color bar in the figure represents the acceptance probability assigned to each sample. In general, acceptance probabilities decrease from left to right. There is no reason to expect a priori that the acceptance probability should decrease monotonically as a function of the interpolated , so it says something interesting about the discriminator that most rows basically follow this pattern.
We have proposed a rejection sampling scheme using the GAN discriminator to approximately correct errors in the GAN generator distribution. We’ve shown that under strict assumptions, we can recover the data distribution exactly. We’ve also examined where those assumptions break down and designed a practical algorithm (Discriminator Rejection Sampling) to address that. Finally, we have demonstrated the efficacy of this algorithm on a mixture of Gaussians and on the state-of-the-art SAGAN model.
Opportunities for future work include the following:
There is a newly published work improving the image generation task (Brock et al., 2018) which is not available yet but we expect DRS to be compatible with it too.
There’s no reason that our scheme can only be applied to GAN generators. It seems worth investigating whether rejection sampling can improve e.g. VAE decoders. This seems like it might help, because VAEs may have trouble with “spreading mass around” too much.
In one ideal case, the critic used for rejection sampling would be a human. Can we use better proxies for the human visual system to improve rejection sampling’s effect on image synthesis models?
It would be interesting to theoretically characterize the efficacy of rejection sampling under the breakdown-of-assumptions that we have described earlier. For instance, if one can’t recover but can train some other critic that has bounded divergence from , how does the efficacy depend on this bound?
We thank Colin Raffel for helpful discussions.
Improved Semi-supervised Learning with GANs using Manifold Invariances.ArXiv e-prints, May 2017.
Photo-realistic single image super-resolution using a generative adversarial network.In CVPR, volume 2, pp. 4, 2017.
Generative image inpainting with contextual attention.arXiv preprint, 2018.
To confirm that our Discriminator Rejection Sampling is not duplicating the training samples, we show the nearest neighbor of a few visually-realistic generated samples in the ImageNet training data in Figures 6-13. The nearest neighbors are found based on their fc7 features from the pre-trained VGG16 model.
In addition, we represent Inception score as a function of acceptance rate in Figure 14-left. Different acceptance rates are achieved by changing from the percentile of (acceptance rate = ) to its percentile (acceptance rate = ). Decreasing the acceptance rate filters more non-realistic samples and increases the final Inception score. After an specific rate, rejecting more samples does not gain any benefit in collecting a better pool of samples.
Moreover, Figure 14-right shows the correlation between the acceptance probabilities that DRS assigns to the synthesized samples and the recognizability of those samples from the view-point of a pre-trained Inception network. The latter is measured by computing which is the probability of sample belonging to the category from the 1,000 ImageNet classes. As expected, there is a large mass of the recognizable images accepted with high acceptance probabilities on the top right corner. The small mass of images which cannot be easily classified into one of the 1,000 categories while having high acceptance probability scores (the top left corner of the graph) can be due to the non-optimal GAN discriminator in practice. Therefore, we expect that improving the discriminator performance boosts the final inception score even more substantially.