Generative Adversarial Networks (GANs) are a class of unsupervised generative models (Goodfellow et al., 2014). GANs involve training a generator and discriminator
network in an adversarial game, such that the generator learns to produce samples from a desired data distribution. Training GANs is challenging because one searches for a Nash equilibrium of a non-convex game in a high-dimensional parameter space. GANs are typically trained with alternating stochastic gradient descent. However, this training procedure is unstable and lacks guaranteesSalimans et al. (2016). Training may exhibit instability, divergence, cyclic behaviour, or mode collapse (Mescheder et al., 2018). As a result, many works propose techniques to stabilize GAN training (Mao et al., 2016; Gulrajani et al., 2017; Miyato et al., 2018; Chen et al., 2018; Radford et al., 2016; Zhang et al., 2018; Karras et al., 2017).
Recently, several papers frame GAN training as a continual, or online, learning problem (Thanh-Tung et al., 2018; Anonymous, 2019; Seff et al., 2017; Shrivastava et al., 2017; Kim et al., 2018; Grnarova et al., 2017)
. GANs may be viewed as online learning because each network, the generator and discriminator, must learn in the context of a non-stationary environment. The discriminator, for example, is a binary classifier where one class (the fake samples) is non-stationary. With this view,Thanh-Tung et al. (2018) study catastrophic forgetting in the discriminator and mode collapse, relating these to training instability. Anonymous (2019) counter discriminator forgetting by leveraging techniques from continual learning directly (Elastic Weight Sharing Kirkpatrick et al. (2017)
and Intelligent Synapses(Zenke et al., 2017)). Other works address forgetting in GANs by retraining on old data, or adding an explicit memory (Shrivastava et al., 2017; Kim et al., 2018). Finally, Seff et al. (2017) extend continual learning for GANs by training on nonstationary true-data distributions also.
We follow this line of work, but propose an alternative approach. Instead of adding explicit replay or memorization strategies, we apply representation learning. We train the discriminator to learn stable representations that are useful for identifying real images. We propose to add self-supervision to the discriminator, in particular, we apply a rotation-based loss (Gidaris et al., 2018).
Conditional GANs, which use side information such as class labels, are state-of-the-art in generating high fidelity complex images Mirza and Osindero (2014); Odena et al. (2017); Miyato and Koyama (2018). A major contributor to their success is augmentation of the discriminator with supervised information, which encourages it to learn stable and diverse representations. However, labels are not always available, and unconditional GANs perform much worse. Our self-supervised GAN requires no additional information, and closes the performance gap for image generation between conditional and unconditional GANs. As intended, we observe that the self-supervised GAN learns better discriminator representations than other GANs for the task of image classification. We hope that this work encourages further investigation into representation and continual learning to improve GANs.
2 The Self-supervised GAN
We first discuss discriminator forgetting and motivate representation learning for GANs. We then present our solution using rotation based self-supervision.
2.1 Discriminator Forgetting
The original value function for GAN training is:
where is the true data distribution, and is the data distribution induced by feeding noise drawn from a simple distribution through the generator, .
is the discriminator’s Bernoulli distribution over the sources (real or fake).
The generator maximizes Equation 1, while the discriminator minimizes it. Training is typically performed via alternating stochastic gradient descent. The discriminator, therefore, classifies between two data distributions, , and , where indexes training iterations. The latter, however, is non-stationary because the generator’s parameters are updated over time. This is an online learning problem for which explicit temporal dependencies have been proposed to improve training (Anonymous, 2019; Shrivastava et al., 2017; Grnarova et al., 2017; Salimans et al., 2016).
In online learning, neural networks tend to forget previous tasks Kirkpatrick et al. (2017); McCloskey and Cohen (1989); French (1999). We demonstrate forgetting in a toy setting, Figure 1(a). Here, a classifier to performs sequential 1-vs.-all classification task on each of the ten classes in cifar10. The classifier is trained for k iterations on each task before switching to the next. Figure 1(a) shows substantial forgetting, despite the tasks being similar. The classifier returns almost to random accuracy, , each time the task switches. The accuracy even drops substantially when returning to the original task at iteration k. This demonstrates that the model does not retain generalizable representations during online learning.
Consideration of convergence provides further indication that discriminators may forget. Goodfellow et al. (2014)
show that the optimal discriminator estimates the likelihood ratio between the generated and real data distributions. Therefore, given a perfect generator, where, the optimal discriminator simply outputs , and has no requirement to retain any meaningful representations.
Discriminator forgetting may cause training difficulties because (1) it does not learn meaningful representations to guide the generator, and (2) the generator can revert to generating old images to fool it (Thanh-Tung et al., 2018). This is particularly problematic if the data is diverse and complex and the generator exhibits mode collapse. Therefore, we add self-supervision to encourage the discriminator to retain useful representations.
2.2 The Self-Supervised GAN
Self-supervised learning is a family of methods for building representations from unsupervised data. Self-supervision works by creating artificial supervised tasks from unsupervised data, training on these tasks, and then extracting representations from the resulting networks(Dosovitskiy et al., 2014; Doersch et al., 2015; Zhang et al., 2016). Here, we apply the successful image rotation self-supervision method (Gidaris et al., 2018). In this method, the self-supervised network predicts the angle of rotation of an image. In Figure 1(b) we motivate this loss using our toy problem. When we add the self-supervised loss, the network learns features that transfer across tasks; performance continually improves, and does not drop to when the distribution shifts.
For the self-supervised GAN, the specific losses we use for the generator and discriminator are:
where is the original GAN loss in Equation 1. is a rotation selected from a set of possible rotations. We use as in Gidaris et al. (2018). is the discriminator’s distribution over rotations, and is the image transformed by rotation . Architecturally, we use a single discriminator network with two heads to compute and . See supplmentary Figure 4 for an overview.
A note on convergence
With convergence, even under optimal conditions, to the true data distribution is not guaranteed. This may not be a concern because current GANs are far from attaining the optimal solution. If it is, one could anneal to zero during training. Our intuition is that the proposed loss encourages the discriminator to learn and retain meaningful representations that allow it to distinguish rotations as well as true/fake images. The generator is then trained to match distributions in this feature space which encourages the generation of realistic objects.
We show that self-supervision improves GAN training, even matching the performance of conditional GANs. We also demonstrate, by way of representation evaluation, that the features learned by the self-supervised GAN are more meaningful than those of a vanilla GAN.
Datasets and Settings
Our main result uses the imagenet dataset. For the self-supervised GAN (SsGAN), we adopt the architectures of, and compare to, the unconditional GAN in Miyato et al. (2018) (SN-GAN) and conditional GAN in Miyato and Koyama (2018) (P-cGAN). Both the generator and discriminator consist of ResNet architectures. For the conditional generator in P-cGAN, we apply label-conditional batch norm. SsGAN does not use labels, so by default the generator does not use condition batch-norm, as in SN-GAN. Therefore, we also consider a variant of the SsGAN with self-modulated batch norm (sBN) (Chen et al., 2018) on the generator to replace label-conditional batch norm. We compare image generation quality using the Frechet Inception Distance (FID) (Heusel et al., 2017). We train the model for 1M steps using a single P100 GPU. Further details and results can be found in the supplementary material.
We also test three smaller datasets, cifar10, lsun-bedroon, and celeba-hq. Labels for Pc-GAN are only available for cifar10. For all datasets we set and
, chosen using a small hyperparameter sweep. For all other hyperparameters, we use the values inMiyato et al. (2018) and Miyato and Koyama (2018). We train on these datasets for k steps on a single P100 GPU.
Figure 2 shows FID training curves on imagenet. The unconditional GAN is unstable, and sometimes diverges. The conditional P-cGAN substantially outperforms the unconditional SN-GAN. The SsGAN is stable, and even performs as well as the P-cGAN.
Table 1 shows FID of the best run across three random seeds for each dataset/model combination. Again, for cifar the SsGAN (with sBN) improves unconditional performance, and even equals P-cGAN. For bedroom, we also see some improvement. Self supervision appears not to help in celeba-hq, we believe that this data is too simple and face orientations do not provide useful signal.
We compare each model’s discriminator representations. We also ablate the GAN loss from SsGAN to assess the representation learned from self-supervision alone. We followed the evaluation protocol of Zhang et al. (2016)
. We train a logistic regression to predict the class labels from the feature maps of the discriminator and report top-1 classification accuracy.
Figure 3 shows the classification performance using the final ResNet block as a function of GAN training iterations. SsGAN produces better representations than SN-GAN. The same is true for all the other blocks (see supplementary, Figures 8 and 9). Interestingly, during training the SsGAN overtakes P-cGAN, which sees the class labels. This indicates that P-cGAN is overfitting the training data. Using rotation loss alone substantially decreases representation quality of the SsGAN. Note that our model uses hyperparameters designed for image generation; therefore, we do not seek state-of-the-art representations. Instead, we use representation evaluation to understand and motivate our strategy. Indeed it seems that self-supervision improves both discriminator representations and image generation.
We would like to thank Sylvain Gelly, Ilya Tolstikhin, Mario Lucic, Alexander Kolesnikov and Lucas Beyer for help with and discussions on this project.
- Thanh-Tung et al.  Hoang Thanh-Tung, Truyen Tran, and Svetha Venkatesh. On catastrophic forgetting and mode collapse in generative adversarial networks. ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
- Anonymous  Anonymous. Generative adversarial network training is a continual learning problem. In Submitted to International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=SJzuHiA9tQ. under review.
- Seff et al.  Ari Seff, Alex Beatson, Daniel Suo, and Han Liu. Continual learning in generative adversarial nets. arXiv preprint arXiv:1705.08395, 2017.
- Shrivastava et al.  Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In Computer Vision and Pattern Recognition (CVPR), 2017.
- Kim et al.  Youngjin Kim, Minjung Kim, and Gunhee Kim. Memorization precedes generation: Learning unsupervised gans with memory networks. International Conference on Learning Representations (ICLR), 2018.
- Grnarova et al.  Paulina Grnarova, Kfir Y Levy, Aurelien Lucchi, Thomas Hofmann, and Andreas Krause. An online learning approach to generative adversarial networks. arXiv preprint arXiv:1706.03269, 2017.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014.
- Salimans et al.  Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems (NIPS), 2016.
Mescheder et al. 
Lars Mescheder, Andreas Geiger, and Sebastian Nowozin.
Which training methods for gans do actually converge?
International Conference on Machine Learning (ICML), pages 3478–3487, 2018.
- Mao et al.  Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. International Conference on Computer Vision (ICCV), 2016.
- Gulrajani et al.  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of Wasserstein GANs. Advances in Neural Information Processing Systems (NIPS), 2017.
- Miyato et al.  Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. International Conference on Learning Representations (ICLR), 2018.
- Chen et al.  Ting Chen, Mario Lucic, Neil Houlsby, and Sylvain Gelly. On self modulation for generative adversarial networks. arXiv preprint arXiv:1810.01365, 2018.
- Radford et al.  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. International Conference on Learning Representations (ICLR), 2016.
- Zhang et al.  Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
- Karras et al.  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. Advances in Neural Information Processing Systems (NIPS), 2017.
- Kirkpatrick et al.  James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 2017.
- Zenke et al.  Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. International Conference on Machine Learning (ICML), 2017.
- Gidaris et al.  Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), 2018.
- Mirza and Osindero  Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Odena et al.  Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier GANs. In International Conference on Machine Learning (ICML), 2017.
- Miyato and Koyama  Takeru Miyato and Masanori Koyama. cgans with projection discriminator. International Conference on Learning Representations (ICLR), 2018.
- McCloskey and Cohen  Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. 1989.
- French  Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 1999.
Dosovitskiy et al. 
Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas
Discriminative unsupervised feature learning with convolutional neural networks.Advances in Neural Information Processing Systems (NIPS), 2014.
- Doersch et al.  Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In International Conference on Computer Vision (ICCV), 2015.
Zhang et al. 
Richard Zhang, Phillip Isola, and Alexei A Efros.
Colorful image colorization.In European Conference on Computer Vision (ECCV), 2016.
- Heusel et al.  Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a Nash equilibrium. In Advances in Neural Information Processing Systems (NIPS), 2017.
- Yu et al.  Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
Sajjadi et al. 
Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain
Assessing generative models via precision and recall.In To Appear in Advances in Neural Information Processing Systems (NIPS), 2018.
- Lucic et al.  Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs Created Equal? A Large-scale Study. In To appear in Advances in Neural Information Processing Systems (NIPS), 2018.
- Zhou et al.  Zhiming Zhou, Yuxuan Song, Lantao Yu, and Yong Yu. Understanding the effectiveness of lipschitz constraint in training of gans via gradient analysis. arXiv preprint arXiv:1807.00751, 2018.
- Donahue et al.  Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. International Conference on Learning Representations (ICLR), 2017.
Appendix A Additional Details of The Proposed Method
Figure 4 depicts the proposed method. Real and fake/generated images are all rotated four directions. For the shared discriminator D, its goal is to predict true/fake labels of the non-rotated images, as well as detecting the rotation of the real images. For the generator, its goal is to generate images to fool discriminator, at the same time trying to generate images that are easier for discriminator to detect rotations (adjusted by ).
Appendix B Further Experiments and Details
We consider four datasets: cifar10, celeba-hq, lsun-bedroom, and imagenet. The lsun-bedroom dataset (Yu et al., 2015) contains around 3M images. We partition the images randomly into a test set containing 30588 images and a train set containing the rest. celeba-hq contains 30k images (Karras et al., 2017). We use the version obtained by running the code provided by the authors111Available at https://github.com/tkarras/progressive_growing_of_gans.. We use 3000 examples as the test set and the remaining examples as the training set. cifar10 contains 70K images (), partitioned into 60000 training instances and 10000 testing instances. Finally, we evaluate our method on imagenet, which contains M training images and K test images. We re-size the images to as done in Miyato and Koyama (2018) and Zhang et al. (2018).
To quantitatively evaluate generated samples from different methods, we mainly Frechet Inception Distance (FID) (Heusel et al., 2017). In FID, the true data and generated samples are first embedded in some feature space (a specific layer of Inception). Then, a multivariate Gaussian is fit to the data and the distance computed as , where and denote the empirical mean and covariance and subscripts and denote the true and generated data respectively. As shown in Sajjadi et al. (2018); Lucic et al. (2018), FID is shown to be sensitive to both the addition of spurious modes and to mode dropping.
Additional experimental details
When the rotation loss is added, we take a quarter of images in the batch, rotate for all four directions to synthetic images of known rotation orientations. This subset rotation trick could reduce and keep the extra computation small. It is worth noting that other than the extra loss term, other parts (architectures, hyper-parameters) are all kept the same for our method and baselines (both unconditional and conditional GANs). For simplicity, we fix and compare in ; is used by default.
We use the Adam optimizer. The learning rate is fixed for both datastes as . All models are trained for k steps222Results in the main paper for imagenet are obtained from training with 1M steps.. All models are trained with batch size of 64 in a single Nvidia P100 GPU. Reported results by default are the mean of from 3 independent runs with different random seeds.
We test the effects of training stabilization brought by the SsGAN. We consider two types of hyper-parameter settings: First, controlling the Lipschitz constant of the discriminator, which is a central quantity analyzed in the GAN literature (Miyato et al., 2018; Zhou et al., 2018). We consider two state-of-the-art techniques: Gradient Penalty (Gulrajani et al., 2017), and Spectral Normalization (Miyato et al., 2018). For the Gradient Penalty regularizer we consider regularization strength . Second, we consider different Adam optimizer hyper-parameters. We test two popular settings : and . Previous studies find that multiple discriminator steps per generator step help training (Goodfellow et al., 2014; Salimans et al., 2016), thus we also consider both and discriminator steps per generator step333We also experimented with 5 steps which didn’t outperform the step setting.. In total, this amounts to three different sets of hyper-parameters for : , , .
b.1 Quantitative Comparisons of the Generative Samples
Figure 5 shows the convergence curves of different methods. We can clearly see the proposed SsGAN significantly improves the performance over the other unconditional GAN, SN-GAN, and even matches the conditional P-cGAN.
The use of self-supervised loss not only improves the performance of the default hyper-parameters we use, but also improves under multiple types of hyper-parameters. Table 2 shows the performance comparison with Gradient Penalty with various hyper-parameters, and Table 3 shows the performance comparison with Spectral Normalization with various hyper-parameters. Both figures demonstrate SsGAN reduces the sensitivity of GAN training to hyper-parameters, hence the training is more robust.
Selection of the hyper-parameters
We further compare different choices of (while fixing ). As discussed, an inappropriate value of could bias the generator, while a reasonable one would help. As shown in Table 6, we indeed see the effectiveness of . In the values compared, the optimal is 1 for cifar10, and 0.2 for imagenet. By default we use 0.2 for all datasets for simplicity.
b.2 Qualitative Comparisons of the Generative Samples
Figure 7 shows the generated examples from different GANs for imagenet. Although the baseline unconditional GAN generates sharp images, majority of the generated objects e.g. dogs are distorted. With the supervised labels, P-cGAN’s samples are improved, e.g. we see the dog faces and diverse objects. SsGAN generates similar quality images to the conditional P-cGAN.
b.3 Comparison of Representation Quality
We compare the representation qualities of the discriminator’s intermediate layers. The comparison is performed by training logistic regression classifiers on top of the feature maps from each ResNet block to perform the 1000-way imagenet classification task, as proposed in Zhang et al. (2016)
. The representation features are spatially resized to around 9000 dimension with adaptive max pooling. We report top-1 classification accuracy.
We compare the SN-GAN, P-cGAN and SsGAN (SBN) models with the best FID score from Section B.1
. We also consider ablating the GAN loss from SsGAN, to yield a rotation only model (Rotation). Rotation uses the same architecture and hyper parameters as the SsGAN discriminator. Evaluation is performed on three independent models with different random seeds, where the mean accuracy and the standard deviation are reported. We tune the learning rate for linear regression on a validation set created by holding outof the training set. The linear regression results are insensitive to the hyperparameters. On cifar10
, we use SGD with batch size 256, momentum 0.9 and learning rate 0.1. We drop the learning rates by a factor of 10 after every 30 epochs. We train in total for 100 epochs, where the learning rates are dropped by a factor of 10 after every 30 epochs. Onimagenet, we use SGD with batch size 64, momentum 0.9 and learning rate 0.01. We train in total for 30 epochs, where the learning rates are dropped by a factor of 10 after every 10 epochs.
Table 4 shows the top-1 classification accuracy on cifar10. Figure 8 shows the results as s function of training steps of the original model. We plot the quality of representations extracted from each of the the 4 blocks for ResNet cifar10architecture. The conditional model, P-cGAN, produces similar results to the other unsupervised model GAN and Rotation model. The SsGAN outperforms the other models under most conditions.
Table 5 and Figure 9 show results on imagenet. There are six blocks in the imagenet Resnet architecture. For SN-GAN and Rotation, block3 performs best, for P-cGAN and SsGAN, block 5 performs best. It is expected that P-cGAN would benefit from representations closer to the classification layer because it sees the actual labels being used in the downstream representation evaluation. In block5, we observe that SsGAN and P-cGAN representation quality is not improving after 200K steps, while SsGAN improves the representation quality for all 1M training steps. Overall, representations are improved with self-supervised GAN and conditional GAN, which correlated with their improved samples.
Table 6 shows the comparison of our method to other self supervised learning methods. We select the best run according to the validation results. Overall, the SsGAN model achieves comparable results on imagenet. Note that our method is not tuned for representation, but rather for image generation. Among those methods, BiGAN (Donahue et al., 2017) learns the representation from GANs as well. BiGAN learns the representation with an additional encoder network, while our method is arguably simpler because it extracts the representation directly from the discriminator network.