Investigating Under and Overfitting in Wasserstein Generative Adversarial Networks

10/30/2019 ∙ by Ben Adlam, et al. ∙ 0

We investigate under and overfitting in Generative Adversarial Networks (GANs), using discriminators unseen by the generator to measure generalization. We find that the model capacity of the discriminator has a significant effect on the generator's model quality, and that the generator's poor performance coincides with the discriminator underfitting. Contrary to our expectations, we find that generators with large model capacities relative to the discriminator do not show evidence of overfitting on CIFAR10, CIFAR100, and CelebA.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative adversarial networks (GANs) are a widely used type of generative model that have found success in many data modalities Goodfellow (2016). For image datasets, GANs have been able to generate diverse, high-fidelity samples that are almost indistinguishable from real images Brock et al. (2018); Karras et al. (2017, 2018). However, achieving such results remains difficult due to many types of training failure Salimans et al. (2016), a lack of effective measurements of model quality Theis et al. (2015); Barratt and Sharma (2018)

, and the need for substantial hyperparameter tuning

Lucic et al. (2018); Kurach et al. (2018).

There is a growing body of work centered on investigating the statistical issues faced by GANs Arora et al. (2017); Arora and Zhang (2017); Bai et al. (2018) and optimization challenges specific to GANs Mescheder et al. (2017); Nagarajan and Kolter (2017); Hsieh et al. (2018); Rafique et al. (2018). We add to this work by analyzing the effect of function class complexity on the model quality of the generator and GAN generalization. We begin by discussing the theoretical underpinnings of under and overfitting for GANs. We then introduce the auxiliary discriminator and independent discriminator as mechanisms to help measure GAN performance. We use these tools to empirically probe how model complexity impacts GAN outcomes.

2 Preliminaries

In generative modeling, the goal is to find model parameters that minimize a divergence between the true data distribution and the model distribution . These divergences can be specified as the solution to a variational problem. For example, the Kantorovich-Rubinstein duality states Wasserstein distance is


where the supremum is over all 1-Lipschitz functions and

. In general, these divergences cannot be easily estimated from samples (let alone optimized with respect to) unless the distributions

and have a specific parametric form.

In practice, optimizing over all 1-Lipschitz functions is infeasible, so GANs replace with a function class of neural nets that are constrained to be 1-Lipschitz by weight clipping Arjovsky et al. (2017), gradient penalization Gulrajani et al. (2017), or spectral normalization Miyato et al. (2018). This leads to the notion of neural net divergence Arora et al. (2017). The learning task of the discriminator is to find parameters that maximize .

Despite relaxing to solving for is still difficult, since the expectations in cannot be computed exactly and the parameter space of is non-convex. We can at least compute the expectations for a training set and optimize with respect to this to obtain some , and then consistently estimate using a test set. Note even if approximately achieves the supremum on the training set and has no generalization gap to the test set, we cannot conclude that is approximately , only that it provides a lower bound. All of this conspires to make using as an objective measure of the generator’s model quality challenging.

2.1 Under and Overfitting

For any metric, a model’s generalization gap is the difference between the metric’s value on the true data distribution less its value on the training set. As a model changes, classical machine learning theory divides the behavior of the generalization gap into two regimes: underfitting and overfitting. Overfitting is when the metric is improved on the training set at the expense of its value on the true data distribution. In supervised learning, metrics such as accuracy can be easily measured on the training set and estimated on the true data distribution using an independent test set, but as we have discussed estimating divergences is more difficult.

Generative models can also suffer from under and overfitting. Specifically, for GANs, we say a generator is overfitting if it is minimizing at the expense of increasing —that is, the generalization gap between and increases faster than decreases. We say a discriminator is overfitting if it is increasing at the expensive of decreasing . This can happen when is too complex or the training set size is too small, since standard Rademacher complexity arguments can be used to show is close to for all , and hence the supremums must also be close. See Arora et al. (2017) for an example where this generalization fails because is too complex.

Underfitting is harder to define, but the comparison of to is relevant, as we motivated the former as a tractable version of the later. Specifically, if has insufficient capacity may be significantly larger, and we argue this corresponds to underfitting.

2.2 Additional discriminators

To compare to the original discriminator (that the generator learns from), we train two additional discriminators that from the perspective of NN divergence are solving the same variational problem (1). The first , called an auxiliary discriminator, has a different, random initialization, learns with the same loss, but does not provide gradients to the generator. The idea being that its optimization is similar to the original discriminator, (it has the same training data, it is learning a non-stationary objective). The second , called an independent discriminator, attempts to abstract away many of the details of GAN training and simply compute NN divergence. Using either the training set , which the generator learned from, or , which was not used during training of the GAN, and an equal number of samples from an already trained generator, the independent discriminator optimizes . Moreover, we can either use the same architecture for all three discriminators, or we can vary the original discriminator’s architecture while fixing the auxiliary and independent discriminators to act as baselines.

Since they all solve (1), the divergences computed by these discriminators, , , and can be compared. The auxiliary discriminator provides an additional measure of the model quality of the generator, and by comparing it to the original discriminator we can detect underfitting when , as is greater than

The divergence can be used in the same ways as outlined above, but it can also be used to probe for overfitting in the generator. We use the difference between and to approximate the generator’s generalization gap and a large gap could suggest overfitting.

3 Experimental Setup

We consider the image datasets CIFAR10, CIFAR100, and CelebA at the resolutions , , and respectively. We split the training data in half, denoted and , where for CIFAR10 and CIFAR100, and for CelebA. Additionally, we use a test set, denoted where to monitor overfitting in the discriminators.

We train each Wasserstein GAN for 500K steps to reach an equilibrium and use a batch size of 64. As per Miyato et al. (2018), optimization is done with Adam using a learning rate of 0.0001, of 0.5, and of 0.999. The discriminator and generator are both updated once in each training step. Contrastingly, it sufficed to optimize the independent discriminator using SGD with cosine decay over 100K steps to reach convergence.

We use the losses


for the discriminator and generator respectively, where and are the functions computed by the GAN and the sum is over a batch of real data and noise . This softplus loss can be thought of as a smooth version of hinge loss and a monotonic function of (3) Miyato et al. (2018). Since we interpret the discriminators as computing a divergence between distributions, we plot the negation of without the softplus in all figures.

Figure 1: Results for CIFAR10. We observe little generalization gap (difference between solid and dashed lines) for the auxiliary discriminator, whereas the original discriminator’s gap appears to increase with model capacity. Note it completely fails to generalize to the test set. The decreasing auxiliary divergence suggest that the model quality of the generator improves with additional discriminator model capacity, but this trend is not seen in the original discriminator–in fact, it is reversed on the training set. The significant gap between original and auxiliary divergences is symptomatic of underfitting, but interestingly since the original and auxiliary discriminators have the same architecture, we find the cause must be an interaction between model capacity and optimization.

We use a DCGAN as in Miyato et al. (2018)

. The generator has batch normalization, no spectral normalization, and ingests samples from a standard 128 dimensional Gaussian. The generator’s architecture is fixed through all experiments. The discriminator has spectral normalization (to enforce the Lipschitz continuity) and no batch normalization. The number of channels in the discriminator’s convolutional layers are changed during experiments to vary model capacity from 8 to 256.

Figure 2: Samples from generators trained on CIFAR10 ordered in increasing number of discriminator channels. Qualitatively the generator’s image quality seems to improve as the discriminator’s model capacity increases.

We also compare the divergence computed by the discriminators and Fréchet Inception distance (FID), which is a commonly used metric for evaluating GANs Salimans et al. (2016); Heusel et al. (2017). We compute FID using 10K samples for both and for .

4 Results

To begin with we discuss under and overfitting in the discriminators. In general, we observed no overfitting in any of the discriminators, and mostly the gap between the discriminator’s divergence on the training and test sets was small but sometimes noisy. The only consistent generalization gap was found in the original discriminator’s divergence on and . The discriminator’s divergence on decreased consistently with the number of channels, whereas its divergence on continued to fluctuate around zero, which implies no overfitting. Note that a divergence of zero is what is achieved by a random discriminator. The same behavior was observed on CIFAR100 (see supplementary figures). For CelebA (see supplementary figures), the test divergence decreased with number of channels, which caused the generalization gap to stay relatively constant. It is puzzling that the generator is able to learn from and produce a better model using a discriminator that completely fails to generalize. We did however see a significant difference between the original discriminator and the auxiliary and independent discriminators (see Figure 1) that suggested underfitting in the original discriminator.

For CIFAR10 and CIFAR100, the divergences computed by the discriminators for each generator were broadly similar, regardless of the discriminator’s architecture. On CelebA, the discriminators agreed in the rank ordering of divergences, but the independent discriminators often reported larger divergences than the auxiliary discriminators.

Figure 3: The independent discriminators generalize well for CIFAR10. See difference between test (solid) and training (dashed) sets. Their divergences are broadly similar whether the architecture matches the original discriminator or is fixed at a baseline of 64 channels. We note that for 8 channels, was larger than , which would suggest a generalization gap for the generator, but this not replicated in the baseline 64 channel independent discriminators.

Figure 4: For CIFAR 10, the auxiliary and independent discriminators achieve broadly similar divergences despite differences in how they are optimized, but at the extremes the auxiliary discriminators’ divergences are smaller.

Turning to the generator, we note the divergences computed by the auxiliary and independent discriminators generally decreased with number of channels—indicating that the generator is actually producing a better model. Note that this was true whether the architectures used for the auxiliary and independent discriminators matched that of the original discriminator, or if they had a consistent, baseline of 64 channels.

For the generator, we observed no overfitting when evaluated by the independent discriminator or by FID. This is somewhat intuitive given the samples in Figure 2

. Taking a different perspective, the gap between the original and auxiliary discriminators, could be viewed as a type of overfitting by the generator in the following sense: when the discriminator has few channels and the generator is relatively overparameterized, it lowers its loss, as computed by the original discriminator, to zero without decreasing the auxiliary discriminator’s divergence. This is similar to a classifier decreasing its loss, at the expense of accuracy. Interestingly, the difference between the original and auxiliary discriminator reduces as the original discriminator’s generalization gap increases.

Figure 5: FID correlates relatively well with the independent discriminator’s divergence on CIFAR10. There is little difference between the independent discriminator’s divergence or FID on the training set and test set, indicating that the generator is not overfitting with respect to this metric.

5 Conclusion and further research

We find that the relative model capacity of the discriminator has a significant effect on the model quality of the generator. However, we do not observe any overfitting in either the generator or discriminator.

We saw that the original discriminator either does not generalize at all or has a large generalization gap. Moreover, its divergence does not correlate with the generator’s model quality, whereas the auxiliary and independent discriminators divergences do. These divergences also correlate with FID. We wonder whether these divergences might be used to evaluate GANs for data modalities where a canonical trained embedding, like Inception, is not available.

In the future, we plan to extend this analysis to investigating the effects of dataset size and batch size on GANs. While we have focused here on Wasserstein GANs, we hope to extend out analysis to other types of GAN.


  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.
  • S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang (2017) Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 224–232. Cited by: §1, §2.1, §2.
  • S. Arora and Y. Zhang (2017) Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224. Cited by: §1.
  • Y. Bai, T. Ma, and A. Risteski (2018) Approximability of discriminators implies diversity in gans. arXiv preprint arXiv:1806.10586. Cited by: §1.
  • S. Barratt and R. Sharma (2018) A note on the inception score. arXiv preprint arXiv:1801.01973. Cited by: §1.
  • A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1.
  • I. Goodfellow (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §1.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §2.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §3.
  • Y. Hsieh, C. Liu, and V. Cevher (2018) Finding mixed nash equilibria of generative adversarial networks. arXiv preprint arXiv:1811.02002. Cited by: §1.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §1.
  • T. Karras, S. Laine, and T. Aila (2018) A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948. Cited by: §1.
  • K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly (2018) The gan landscape: losses, architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720. Cited by: §1.
  • M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2018) Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 700–709. Cited by: §1.
  • L. Mescheder, S. Nowozin, and A. Geiger (2017) The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1825–1835. Cited by: §1.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §2, §3, §3, §3.
  • V. Nagarajan and J. Z. Kolter (2017) Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pp. 5585–5595. Cited by: §1.
  • H. Rafique, M. Liu, Q. Lin, and T. Yang (2018) Non-convex min-max optimization: provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060. Cited by: §1.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §1, §3.
  • L. Theis, A. v. d. Oord, and M. Bethge (2015) A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844. Cited by: §1.

Appendix A Additional figure for CIFAR10

Appendix B Figures for CIFAR100

Appendix C Figures for CelebA