Generative adversarial networks (GANs) are a widely used type of generative model that have found success in many data modalities Goodfellow (2016). For image datasets, GANs have been able to generate diverse, high-fidelity samples that are almost indistinguishable from real images Brock et al. (2018); Karras et al. (2017, 2018). However, achieving such results remains difficult due to many types of training failure Salimans et al. (2016), a lack of effective measurements of model quality Theis et al. (2015); Barratt and Sharma (2018)
, and the need for substantial hyperparameter tuningLucic et al. (2018); Kurach et al. (2018).
There is a growing body of work centered on investigating the statistical issues faced by GANs Arora et al. (2017); Arora and Zhang (2017); Bai et al. (2018) and optimization challenges specific to GANs Mescheder et al. (2017); Nagarajan and Kolter (2017); Hsieh et al. (2018); Rafique et al. (2018). We add to this work by analyzing the effect of function class complexity on the model quality of the generator and GAN generalization. We begin by discussing the theoretical underpinnings of under and overfitting for GANs. We then introduce the auxiliary discriminator and independent discriminator as mechanisms to help measure GAN performance. We use these tools to empirically probe how model complexity impacts GAN outcomes.
In generative modeling, the goal is to find model parameters that minimize a divergence between the true data distribution and the model distribution . These divergences can be specified as the solution to a variational problem. For example, the Kantorovich-Rubinstein duality states Wasserstein distance is
where the supremum is over all 1-Lipschitz functions and
. In general, these divergences cannot be easily estimated from samples (let alone optimized with respect to) unless the distributionsand have a specific parametric form.
In practice, optimizing over all 1-Lipschitz functions is infeasible, so GANs replace with a function class of neural nets that are constrained to be 1-Lipschitz by weight clipping Arjovsky et al. (2017), gradient penalization Gulrajani et al. (2017), or spectral normalization Miyato et al. (2018). This leads to the notion of neural net divergence Arora et al. (2017). The learning task of the discriminator is to find parameters that maximize .
Despite relaxing to solving for is still difficult, since the expectations in cannot be computed exactly and the parameter space of is non-convex. We can at least compute the expectations for a training set and optimize with respect to this to obtain some , and then consistently estimate using a test set. Note even if approximately achieves the supremum on the training set and has no generalization gap to the test set, we cannot conclude that is approximately , only that it provides a lower bound. All of this conspires to make using as an objective measure of the generator’s model quality challenging.
2.1 Under and Overfitting
For any metric, a model’s generalization gap is the difference between the metric’s value on the true data distribution less its value on the training set. As a model changes, classical machine learning theory divides the behavior of the generalization gap into two regimes: underfitting and overfitting. Overfitting is when the metric is improved on the training set at the expense of its value on the true data distribution. In supervised learning, metrics such as accuracy can be easily measured on the training set and estimated on the true data distribution using an independent test set, but as we have discussed estimating divergences is more difficult.
Generative models can also suffer from under and overfitting. Specifically, for GANs, we say a generator is overfitting if it is minimizing at the expense of increasing —that is, the generalization gap between and increases faster than decreases. We say a discriminator is overfitting if it is increasing at the expensive of decreasing . This can happen when is too complex or the training set size is too small, since standard Rademacher complexity arguments can be used to show is close to for all , and hence the supremums must also be close. See Arora et al. (2017) for an example where this generalization fails because is too complex.
Underfitting is harder to define, but the comparison of to is relevant, as we motivated the former as a tractable version of the later. Specifically, if has insufficient capacity may be significantly larger, and we argue this corresponds to underfitting.
2.2 Additional discriminators
To compare to the original discriminator (that the generator learns from), we train two additional discriminators that from the perspective of NN divergence are solving the same variational problem (1). The first , called an auxiliary discriminator, has a different, random initialization, learns with the same loss, but does not provide gradients to the generator. The idea being that its optimization is similar to the original discriminator, (it has the same training data, it is learning a non-stationary objective). The second , called an independent discriminator, attempts to abstract away many of the details of GAN training and simply compute NN divergence. Using either the training set , which the generator learned from, or , which was not used during training of the GAN, and an equal number of samples from an already trained generator, the independent discriminator optimizes . Moreover, we can either use the same architecture for all three discriminators, or we can vary the original discriminator’s architecture while fixing the auxiliary and independent discriminators to act as baselines.
Since they all solve (1), the divergences computed by these discriminators, , , and can be compared. The auxiliary discriminator provides an additional measure of the model quality of the generator, and by comparing it to the original discriminator we can detect underfitting when , as is greater than
The divergence can be used in the same ways as outlined above, but it can also be used to probe for overfitting in the generator. We use the difference between and to approximate the generator’s generalization gap and a large gap could suggest overfitting.
3 Experimental Setup
We consider the image datasets CIFAR10, CIFAR100, and CelebA at the resolutions , , and respectively. We split the training data in half, denoted and , where for CIFAR10 and CIFAR100, and for CelebA. Additionally, we use a test set, denoted where to monitor overfitting in the discriminators.
We train each Wasserstein GAN for 500K steps to reach an equilibrium and use a batch size of 64. As per Miyato et al. (2018), optimization is done with Adam using a learning rate of 0.0001, of 0.5, and of 0.999. The discriminator and generator are both updated once in each training step. Contrastingly, it sufficed to optimize the independent discriminator using SGD with cosine decay over 100K steps to reach convergence.
We use the losses
for the discriminator and generator respectively, where and are the functions computed by the GAN and the sum is over a batch of real data and noise . This softplus loss can be thought of as a smooth version of hinge loss and a monotonic function of (3) Miyato et al. (2018). Since we interpret the discriminators as computing a divergence between distributions, we plot the negation of without the softplus in all figures.
We use a DCGAN as in Miyato et al. (2018)
. The generator has batch normalization, no spectral normalization, and ingests samples from a standard 128 dimensional Gaussian. The generator’s architecture is fixed through all experiments. The discriminator has spectral normalization (to enforce the Lipschitz continuity) and no batch normalization. The number of channels in the discriminator’s convolutional layers are changed during experiments to vary model capacity from 8 to 256.
To begin with we discuss under and overfitting in the discriminators. In general, we observed no overfitting in any of the discriminators, and mostly the gap between the discriminator’s divergence on the training and test sets was small but sometimes noisy. The only consistent generalization gap was found in the original discriminator’s divergence on and . The discriminator’s divergence on decreased consistently with the number of channels, whereas its divergence on continued to fluctuate around zero, which implies no overfitting. Note that a divergence of zero is what is achieved by a random discriminator. The same behavior was observed on CIFAR100 (see supplementary figures). For CelebA (see supplementary figures), the test divergence decreased with number of channels, which caused the generalization gap to stay relatively constant. It is puzzling that the generator is able to learn from and produce a better model using a discriminator that completely fails to generalize. We did however see a significant difference between the original discriminator and the auxiliary and independent discriminators (see Figure 1) that suggested underfitting in the original discriminator.
For CIFAR10 and CIFAR100, the divergences computed by the discriminators for each generator were broadly similar, regardless of the discriminator’s architecture. On CelebA, the discriminators agreed in the rank ordering of divergences, but the independent discriminators often reported larger divergences than the auxiliary discriminators.
Turning to the generator, we note the divergences computed by the auxiliary and independent discriminators generally decreased with number of channels—indicating that the generator is actually producing a better model. Note that this was true whether the architectures used for the auxiliary and independent discriminators matched that of the original discriminator, or if they had a consistent, baseline of 64 channels.
For the generator, we observed no overfitting when evaluated by the independent discriminator or by FID. This is somewhat intuitive given the samples in Figure 2
. Taking a different perspective, the gap between the original and auxiliary discriminators, could be viewed as a type of overfitting by the generator in the following sense: when the discriminator has few channels and the generator is relatively overparameterized, it lowers its loss, as computed by the original discriminator, to zero without decreasing the auxiliary discriminator’s divergence. This is similar to a classifier decreasing its loss, at the expense of accuracy. Interestingly, the difference between the original and auxiliary discriminator reduces as the original discriminator’s generalization gap increases.
5 Conclusion and further research
We find that the relative model capacity of the discriminator has a significant effect on the model quality of the generator. However, we do not observe any overfitting in either the generator or discriminator.
We saw that the original discriminator either does not generalize at all or has a large generalization gap. Moreover, its divergence does not correlate with the generator’s model quality, whereas the auxiliary and independent discriminators divergences do. These divergences also correlate with FID. We wonder whether these divergences might be used to evaluate GANs for data modalities where a canonical trained embedding, like Inception, is not available.
In the future, we plan to extend this analysis to investigating the effects of dataset size and batch size on GANs. While we have focused here on Wasserstein GANs, we hope to extend out analysis to other types of GAN.
- Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §2.
- Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 224–232. Cited by: §1, §2.1, §2.
- Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224. Cited by: §1.
- Approximability of discriminators implies diversity in gans. arXiv preprint arXiv:1806.10586. Cited by: §1.
- A note on the inception score. arXiv preprint arXiv:1801.01973. Cited by: §1.
- Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1.
- NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §1.
- Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §2.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §3.
- Finding mixed nash equilibria of generative adversarial networks. arXiv preprint arXiv:1811.02002. Cited by: §1.
- Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §1.
- A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948. Cited by: §1.
- The gan landscape: losses, architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720. Cited by: §1.
- Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 700–709. Cited by: §1.
- The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1825–1835. Cited by: §1.
- Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §2, §3, §3, §3.
- Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pp. 5585–5595. Cited by: §1.
- Non-convex min-max optimization: provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060. Cited by: §1.
- Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §1, §3.
- A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844. Cited by: §1.