1 Introduction
Deep generative models have received a great deal of attention due to their power to learn complex high-dimensional distributions, such as distributions over natural images (Zhang et al., 2018; Brock et al., 2019), videos (Kalchbrenner et al., 2017), and audio (Van Den Oord et al., 2016). Recent progress was driven by scalable training of large-scale models (Brock et al., 2019; Menick & Kalchbrenner, 2019), architectural modifications (Zhang et al., 2018; Chen et al., 2019a; Karras et al., 2018), and normalization techniques (Miyato et al., 2018).

High-fidelity natural image generation (typically trained on ImageNet) hinges upon having access to vast quantities of labeled data. This is unsurprising as labels induce rich side information into the training process, effectively dividing the extremely challenging image generation task into semantically meaningful sub-tasks.
However, this dependence on vast quantities of labeled data is at odds with the fact that most data is unlabeled, and labeling itself is often costly and error-prone. Despite the recent progress on unsupervised image generation, the gap between conditional and unsupervised models in terms of sample quality is significant.
In this work, we take a significant step towards closing the gap between conditional and unsupervised generation of high-fidelity images using generative adversarial networks (GANs). We leverage two simple yet powerful concepts:
-
[itemsep=0pt,topsep=-4pt,parsep=0pt]
-
Self-supervised learning: A semantic feature extractor for the training data can be learned via self-supervision, and the resulting feature representation can then be employed to guide the GAN training process.
-
Semi-supervised learning: Labels for the entire training set can be inferred from a small subset of labeled training images and the inferred labels can be used as conditional information for GAN training.
Our contributions In this work, we
-
[itemsep=0pt,topsep=-5pt,parsep=1pt,leftmargin=5mm]
-
propose and study various approaches to reduce or fully omit ground-truth label information for natural image generation tasks,
-
achieve a new SOTA in unsupervised generation on ImageNet, match the SOTA on imagenet using only 10% of the labels, and set a new SOTA using only 20% of the labels (measured by FID), and
-
open-source all the code used for the experiments at github.com/google/compare_gan.
2 Background and related work
High-fidelity GANs on imagenet
Besides BigGAN (Brock et al., 2019)
only a few prior methods have managed to scale GANs to ImageNet, most of them relying on class-conditional generation using labels. One of the earliest attempts are GANs with auxiliary classifier (AC-GANs)
(Odena et al., 2017)which feed one-hot encoded label-information with the latent code to the generator and equip the discriminator with an auxiliary head predicting the image class in addition to whether the input is real or fake. More recent approaches rely on a label projection layer in the discriminator essentially resulting in per-class real/fake classification
(Miyato & Koyama, 2018) and self-attention in the generator (Zhang et al., 2018). Both methods use modulated batch normalization
(De Vries et al., 2017) to provide label information to the generator. On the unsupervised side, Chen et al. (2019b) showed that auxiliary rotation loss added to the discriminator has a stabilizing effect on the training. Finally, appropriate gradient regularization enables scaling MMD-GANs to ImageNet without using labels (Arbel et al., 2018).Semi-supervised GANs Several recent works leveraged GANs for semi-supervised learning of classifiers. Both Salimans et al. (2016) and Odena (2016) train a discriminator that classifies its input into classes: image classes for real images, and one class for generated images. Similarly, Springenberg (2016) extends the standard GAN objective to classes. This approach was also considered by Li et al. (2017) where separate discriminator and classifier models are applied. Other approaches incorporate inference models to predict missing labels (Deng et al., 2017)
or harness the joint distribution (of labels and data) matching for semi-supervised learning
(Gan et al., 2017). We emphasize that this line of work focuses on training a classifier from a few labels, rather than using few labels to improve the quality of the generated model. Up to our knowledge, improvements in sample quality through partial label information are reported in Li et al. (2017); Deng et al. (2017); Sricharan et al. (2017), all of which consider only low-resolution data sets from a restricted domain.Self-supervised learning Self-supervised learning methods employ a label-free auxiliary task to learn a semantic feature representation of the data. This approach was successfully applied to different data modalities, such as images (Doersch et al., 2015; Caron et al., 2018), video (Agrawal et al., 2015; Lee et al., 2017), and robotics (Jang et al., 2018; Pinto & Gupta, 2016). The current state-of-the-art method on imagenet is due to Gidaris et al. (2018) who proposed predicting the rotation angle of rotated training images as an auxiliary task. This simple self-supervision approach yields representations which are useful for downstream image classification tasks. Other forms of self-supervision include predicting relative locations of disjoint image patches of a given image (Doersch et al., 2015; Mundhenk et al., 2018)
or estimating the permutation of randomly swapped image patches on a regular grid
(Noroozi & Favaro, 2016). A study on self-supervised learning with modern neural architectures is provided in Kolesnikov et al. (2019).3 Reducing the appetite for labeled data
In a nutshell, instead of providing hand-annotated ground truth labels for real images to the discriminator, we will provide inferred ones. To obtain these labels we will make use of recent advancements in self- and semi-supervised learning. Before introducing these methods in detail, we first discuss how label information is used in state-of-the-art GANs. The following exposition assumes familiarity with the basics of the GAN framework (Goodfellow et al., 2014).
![]() |
![]() |
Incorporating the labels To provide the label information to the discriminator we employ a linear projection layer as proposed by Miyato & Koyama (2018). To make the exposition self-contained, we will briefly recall the main ideas. In a ”vanilla” (unconditional) GAN, the discriminator learns to predict whether the image at its input is real or generated by the generator . We decompose the discriminator into a learned discriminator representation, , which is fed into a linear classifier, , i.e., the discriminator is given by . In the projection discriminator, one learns an embedding for each class of the same dimension as the representation . Then, for a given image, label input the decision on whether the sample is real or generated is based on two components: (a) on whether the representation itself is consistent with the real data, and (b) on whether the representation is consistent with the real data from class . More formally, the discriminator takes the form , where
is a linear projection layer applied to a feature vector
and the one-hot encoded label as an input. As for the generator, the label information is incorporated through class-conditional BatchNorm (Dumoulin et al., 2017; De Vries et al., 2017). The conditional GAN with projection discriminator is illustrated in Figure 3. We proceed with describing the pre-trained and co-training approaches to infer labels for GAN training in Sections 3.1 and 3.2, respectively.3.1 Pre-trained approaches
Unsupervised clustering-based method
We first learn a representation of the real training data using a state-of-the-art self-supervised approach (Gidaris et al., 2018; Kolesnikov et al., 2019), perform clustering on this representation, and use the cluster assignments as a replacement for labels. Following Gidaris et al. (2018) we learn the feature extractor
(typically a convolutional neural network) by minimizing the following
self-supervision loss(1) |
where is the set of the rotation degrees , is the image rotated by , and is a linear classifier predicting the rotation degree . After learning the feature extractor , we apply mini batch -Means clustering (Sculley, 2010) on the representations of the training images. Finally, given the cluster assignment function we train the GAN using the hinge loss, alternatively minimizing the discriminator loss and generator loss , namely
where is the prior distribution with and the empirical distribution of the cluster labels over the training set. We call this approach Clustering and illustrate it in Figure 4.
Semi-supervised method
While semi-supervised learning is an active area of research and a large variety of algorithms has been proposed, we follow Beyer et al. (2019) and simply extend the self-supervised approach described in the previous paragraph with a semi-supervised loss. This ensures that the two approaches are comparable in terms of model capacity and computational cost. Assuming we are provided with labels for a subset of the training data, we attempt to learn a good feature representation via self-supervision and simultaneously train a good linear classifier on the so-obtained representation (using the provided labels).111Note that an even simpler approach would be to first learn the representation via self-supervision and subsequently the linear classifier, but we observed that learning the representation and classifier simultaneously leads to better results. More formally, we minimize the loss
(2) |
where and are linear classifiers predicting the rotation angle and the label , respectively, and balances the loss terms. The first term in (2) corresponds to the self-supervision loss from (1) and the second term to a (semi-supervised) cross-entropy loss. During training, the latter expectation is replaced by the empirical average over the subset of labeled training examples, whereas the former is set to the empirical average over the entire training set (this convention is followed throughout the paper). After we obtain and we proceed with GAN training where we label the real images as . In particular, we alternatively minimize the same generator and discriminator losses as for Clustering except that we use and obtained by minimizing (2):
where with and uniform categorical. We use the abbreviation S2GAN for this method.
3.2 Co-training approach
The main drawback of the transfer-based methods is that one needs to train a feature extractor via self supervision and learn an inference mechanism for the labels (linear classifier or clustering). In what follows we detail co-training approaches that avoid this two-step procedure and learn to infer label information during GAN training.
Unsupervised method We consider two approaches. In the first one, we completely remove the labels by simply labeling all real and generated examples with the same label222Note that this is not necessarily equivalent to replacing class-conditional BatchNorm with standard (unconditional) BatchNorm as the variant of conditional BatchNorm used in this paper also uses chunks of the latent code as input; besides the label information. and removing the projection layer from the discriminator, i.e., we set . We use the abbreviation Single label for this method. For the second approach we assign random labels to (unlabeled) real images. While the labels for the real images do not provide any useful signal to the discriminator, the sampled labels could potentially help the generator by providing additional randomness with different statistics than , as well as additional trainable parameters due to the embedding matrices in class-conditional BatchNorm. Furthermore, the labels for the fake data could facilitate the discrimination as they provide side information about the fake images to the discriminator. We term this method Random label.
Semi-supervised method When labels are available for a subset of the real data, we train an auxiliary linear classifier directly on the feature representation of the discriminator, during GAN training, and use it to predict labels for the unlabeled real images. In this case the discriminator loss takes the form
(3) |
where the first term corresponds to standard conditional training on () labeled real images, the second term is the cross-entropy loss (with weight ) for the auxiliary classifier on the labeled real images, the third term is an unsupervised discriminator loss where the labels for the unlabeled real images are predicted by , and the last term is the standard conditional discriminator loss on the generated data. We use the abbreviation S2GAN-CO for this method. See Figure 5 for an illustration.
3.3 Self-supervision during GAN training
So far we leveraged self-supervision to either craft good feature representations, or to learn a semi-supervised model (cf. Section 3.1). However, given that the discriminator itself is just a classifier, one may benefit from augmenting this classifier with an auxiliary task—namely self-supervision through rotation prediction. This approach was already explored in (Chen et al., 2019b), where it was observed to stabilize GAN training. Here we want to assess its impact when combined with the methods introduced in Sections 3.1 and 3.2. To this end, similarly to the training of in (1) and (2), we train an additional linear classifier on the discriminator feature representation to predict rotations of the rotated real images and rotated fake images . The corresponding loss terms added to the discriminator and generator losses are
(4) |
and
(5) |
respectively, where are weights to balance the loss terms. This approach is illustrated in Figure 6.
4 Experimental setup
Architecture and hyperparameters
GANs are notoriously unstable to train and their performance strongly depends on the capacity of the neural architecture, optimization hyperparameters, and appropriate regularization
(Lucic et al., 2018; Kurach et al., 2018). We implemented the conditional BigGAN architecture (Brock et al., 2019) which achieves state-of-the-art results on ImageNet.333We dissected the model checkpoints released by Brock et al. (2019) to obtain exact counts of trainable parameters and their dimensions, and match them to byte level (cf. Tables 10 and 11). We want to emphasize that at this point this methodology is bleeding-edgeand successful state-of-the-art methods require careful architecture-level tuning. To foster reproducibility we meticulously detail this architecture at tensor-level detail in Appendix
B and open-source our code at https://github.com/google/compare_gan. We use exactly the same optimization hyper-parameters as Brock et al. (2019). Specifically, we employ the Adam Optimizer with the learning rates for the generator and for the discriminator ( ). We train for 250k generator steps with 2 discriminator iterations before each generator step. The batch size was fixed to 2048, and we use a latent code with dimensions. We employ spectral normalization in both generator and discriminator. In contrast to BigGAN, we do not apply orthogonal regularization as this was observed to only marginally improve sample quality (cf. Table 1 in Brock et al. (2019)) and we do not use the truncation trick.Datasets We focus primarily on imagenet, the largest and most diverse image data set commonly used to evaluate GANs. imagenet contains M training images and k test images, each corresponding to one of 1k object classes. We resize the images to as done in Miyato & Koyama (2018) and Zhang et al. (2018). Partially labeled data sets for the semi-supervised approaches are obtained by randomly selecting of the samples from each class.
Evaluation metrics
We use the Fréchet Inception Distance (FID) (Heusel et al., 2017) and Inception Score (Salimans et al., 2016) to evaluate the quality of the generated samples. To compute the FID, the real data and generated samples are first embedded in a specific layer of a pre-trained Inception network. Then, a multivariate Gaussian is fit to the data and the distance computed as , where and denote the empirical mean and covariance, and subscripts and denote the real and generated data respectively. FID was shown to be sensitive to both the addition of spurious modes and to mode dropping (Sajjadi et al., 2018; Lucic et al., 2018). Inception Score posits that conditional label distribution of samples containing meaningful objects should have low entropy, and the variability of the samples should be high leading to the following formulation: . Although it has some flaws (Barratt & Sharma, 2018), we report it to enable comparison with existing methods. Following (Brock et al., 2019), the FID is computed using the 50k imagenet testing images and 50k randomly sampled fake images, and the IS is computed from 50k randomly sampled fake images. All metrics are computed for 5 different randomly sampled sets of fake images and we report the mean.
Methods We conduct an extensive comparison of methods detailed in Table 1, namely: Unmodified BigGAN, the unsupervised methods Single label, Random label, Clustering, and the semi-supervised methods S2GAN and S2GAN-CO. In all S2GAN-CO experiments we use soft labels, i.e., the soft-max output of instead of one-hot encoded hard estimates, as we observed in preliminary experiments that this stabilizes training. For S2GAN we use hard labels by default, but investigate the effect of soft labels in separate experiments. For all semi-supervised methods we have access only to of the ground truth labels where . As an additional baseline, we retain labeled real images and discard all unlabeled real images, then using the remaining labeled images to train BigGAN (the resulting model is designated by BigGAN-). Finally, we explore the effect of self-supervision during GAN training on the unsupervised and semi-supervised methods.
We train every model three times with a different random seed and report the median FID and the median IS. With the exception of the Single label and BigGAN-
, the standard deviation of the mean across three runs is very low. We therefore defer tables with the mean FID and IS values and standard deviations to Appendix
D. All models are trained on 128 cores of a Google TPU v3 Pod with BatchNorm statistics synchronized across cores.Unsupervised approaches For Clustering we simply used the best available self-supervised rotation model from (Kolesnikov et al., 2019). The number of clusters for Clustering is selected from the set . The other unsupervised approaches do not have hyper-parameters.
Pre-trained and co-training approaches We employ the wide ResNet-50 v2 architecture with widening factor 16 (Zagoruyko & Komodakis, 2016) for the feature extractor in the pre-trained approaches described in Section 3.1.
Method | Description |
---|---|
BigGAN | Conditional (Brock et al., 2019) |
Single label | Co-training: Single label |
Random label | Co-training: Random labels |
Clustering | Pre-trained: Clustering |
BigGAN- | Drop all but labeled data |
S2GAN-CO | Co-training: Semi-supervised |
S2GAN | Pre-trained: Semi-supervised |
S3GAN | S2GAN with self-supervision |
S3GAN-CO | S2GAN-CO with self-supervision |
We optimize the loss in (2
) using SGD for 65 epochs. The batch size is set to
, composed of unlabeled examples and labeled examples. Following the recommendations from Goyal et al. (2017) for training with large batch size, we (i) set the learning rate to , and (ii) use linear learning rate warm-up during the initial 5 epochs. The learning rate is decayed twice with a factor of 10 at epoch 45 and epoch 55. The parameter in (2) is set to and the number of unlabeled examples per batch is 1536. The parameters and are tuned on labeled examples held out from the training set, the search space is . The accuracy of the so-obtained classifier on the imagenet validation set is reported in Table 3. The parameter in the loss used for S2GAN-CO in (3) is selected form the set .5 Results and discussion
Recall that the main goal of this work is to match (or outperform) the fully supervised BigGAN in an unsupervised fashion, or with a small subset of labeled data. In the following, we discuss the advantages and drawbacks of the analyzed approaches with respect to this goal.
As a baseline, our reimplementation of BigGAN obtains an FID of 8.4 and IS of 75.0, and hence reproduces the result reported by Brock et al. (2019) in terms of FID. We observed some differences in training dynamics, which we discuss in detail in Section 5.4.
5.1 Unsupervised approaches
fid | is | |
Random label | 26.5 | 20.2 |
Single label | 25.3 | 20.4 |
Single label (SS) | 23.7 | 22.2 |
Clustering | 23.2 | 22.7 |
Clustering (SS) | 22.0 | 23.5 |
The results for unsupervised approaches are summarized in Figure 7 and Table 2. The fully unsupervised Random label and Single label models both achieve a similar FID of and IS of . This is a quite considerable gap compared to BigGAN and indicates that additional supervision is necessary. We note that one of the three Single label models collapsed whereas all three Random label models trained stably for 250k generator iterations.
Pre-training a semantic representation using self-supervision and clustering the training data on this representation as done by Clustering reduces the FID by about and increases IS by about . These results were obtained for 50 clusters, all other options led to worse results. While this performance is still considerably worse than that of BigGAN this result is the current state-of-the-art in unsupervised image generation (Chen et al. (2019b) report an FID of 33 for unsupervised generation).
Example images from the clustering are shown in Figures 14, 15, and 16 in the supplementary material. The clustering is clearly meaningful and groups similar objects within the same cluster. Furthermore, the objects generated by Clustering conditionally on a given cluster index reflect the distribution of the training data belonging the corresponding cluster. On the other hand, we can clearly observe multiple classes being present in the same cluster. This is to be expected when under-clustering to clusters. Interestingly, clustering to many more clusters (say ) yields results similar to Single label.

5.2 Semi-supervised approaches
Pre-trained The S2GAN model where we use the classifier pre-trained with both a self-supervised and semi-supervised loss (cf. Section 3.1) suffers a very minor increase in FID for and labeled real training data, and matches BigGAN both in terms of FID and IS when of the labels are used (cf. Table 3). We stress that this is despite the fact that the classifier used to infer the labels has a top-1 accuracy of only , , and for , , and labeled data, respectively (cf. Table 3), compared to of the original labels. The results are shown in Table 4 and Figure 8
, and random samples as well as interpolations can be found in Figures
9–17 in the supplementary material.Labels | |||
---|---|---|---|
Metric | |||
Top-1 error | |||
Top-5 error |
FID | IS | |||||
---|---|---|---|---|---|---|
S2GAN | 10.8 | 8.9 | 8.4 | 57.6 | 73.4 | 77.4 |
S2GAN-CO | 21.8 | 17.7 | 13.9 | 30.0 | 37.2 | 49.2 |
S3GAN | 10.4 | 8.0 | 7.7 | 59.6 | 78.7 | 83.1 |
S3GAN-CO | 20.2 | 16.6 | 12.7 | 31.0 | 38.5 | 53.1 |
Co-trained
The results for our co-trained model S2GAN-CO which trains a linear classifier in semi-supervised fashion on top of the discriminator representation during GAN training (cf. Section 3.2) are shown in Table 4. It can be seen that S2GAN-CO outperforms all fully unsupervised approaches for all considered label percentages. While the gap between S2GAN-CO with labels and Clustering in terms of FID is small, S2GAN-CO has a considerably larger IS. When using labeled training examples S2GAN-CO obtains an FID of 13.9 and an IS of 49.2, which is remarkably close to BigGAN and S2GAN given the simplicity of the S2GAN-CO approach. As the the percentage of labels decreases, the gap between S2GAN and S2GAN-CO increases.
Interestingly, S2GAN-CO does not seem to train less stably than S2GAN approaches even though it is forced to learn the classifier during GAN training. This is particularly remarkable as the BigGAN- approaches, where we only retain the labeled data for training and discard all unlabeled data, are very unstable and collapse after 60k to 120k iterations, for all three random seeds and for both and labeled data.
5.3 Self-supervision during GAN training
So far we have seen that the pre-trained semi-supervised approach, namely S2GAN, is able to achieve state-of-the-art performance for labeled data. Here we investigate whether self-supervision during GAN training as described in Section 3.3 can lead to further improvements. Table 4 and Figure 8 show the experimental results for S3GAN, namely S2GAN coupled with self-supervision in the discriminator.
Self-supervision leads to a reduction in FID and increase in IS across all considered settings. In particular we can match the state-of-the-art BigGAN with only of the labels and outperform it using labels, both in terms of FID and IS.
For S3GAN the improvements due to self-supervision during GAN training in FID are considerable, around in most of the cases. Tuning the parameter of the discriminator self-supervision loss in (4) did not dramatically increase the benefits of self-supervision during GAN training, at least for the range of values considered. As shown in Tables 2 and 4, self-supervision during GAN training (with default parameters ) also leads to improvements by to for both S2GAN-CO and Single label. In summary, self-supervision during GAN training with default parameters leads to a stable improvement across all approaches.

5.4 Other insights
Effect of soft labels
A design choice available to practitioners is whether to use hard labels (i.e., the argmax over the logits), or soft labels (softmax over the logits) for
S2GAN and S3GAN (recall that we use soft labels by default for S2GAN-CO and S3GAN-CO). Our initial expectation was that soft labels should help when very little labeled data is available, as soft labels carry more information which can potentially be exploited by the projection discriminator. Surprisingly, the results presented in Table 5 show clearly that the opposite is true. Our current hypothesis is that this is due to the way labels are incorporated in the projection discriminator, but we do not have empirical evidence yet.Optimization dynamics Brock et al. (2019) report the FID and IS of the model just before the collapse, which can be seen as a form of early stopping. In contrast, we manage to stably train the proposed models for 250k generator iterations. In particular, we also observe stable training for our vanilla BigGAN implementation. The evolution of the FID and IS as a function of the training steps is shown in Figure 21 in the appendix. At this point we can only speculate about the origin of this difference.
FID | IS | |||||
---|---|---|---|---|---|---|
S2GAN | 10.8 | 8.9 | 8.4 | 57.6 | 73.4 | 77.4 |
+soft | 15.4 | 12.9 | 10.4 | 40.3 | 49.8 | 62.1 |
Higher resolution and going below 5% labels Training these models at higher resolution becomes computationally harder and it necessitates tuning the learning rate. We trained several S3GAN models at resolution and show the resulting samples in Figures 12–13 and interpolations in Figures 19–20. We also conducted S3GAN experiments in which only of the labels are used and observed FID of and IS of . This indicates that given a small number of samples one can significantly outperform the unsupervised approaches (c.f. Figure 7).
6 Conclusion and future Work
In this work we investigated several avenues to reduce the appetite for labeled data in state-of-the-art generative adversarial networks. We showed that recent advances in self- and semi-supervised learning can be used to achieve a new state of the art, both for unsupervised and supervised natural image synthesis.
We believe that this is a great first step towards the ultimate goal of few-shot high-fidelity image synthesis. There are several important directions for future work: (i) investigating the applicability of these techniques for even larger and more diverse data sets, and (ii) investigating the impact of other self- and semi-supervised approaches on the model quality. (iii) investigating the impact of self-supervision in other deep generative models. Finally, we would like to emphasize that further progress might be hindered by the engineering challenges related to training large-scale generative adversarial networks. To help alleviate this issue and to foster reproducibility, we have open-sourced all the code used for the experiments.
Acknowledgments
We would like to thank Ting Chen and Neil Houlsby for fruitful discussions on self-supervision and its application to GANs. We would like to thank Lucas Beyer, Alexander Kolesnikov, and Avital Oliver for helpful discussions on self-supervised semi-supervised learning. We would like to thank Karol Kurach and Marcin Michalski their major contributions the Compare GAN library. We would also like to thank the BigGAN team (Andy Brock, Jeff Donahue, and Karen Simonyan) for their insights into training GANs on TPUs. Finally, we are grateful for the support of members of the Google Brain team in Zurich.
References
-
Agrawal et al. (2015)
Agrawal, P., Carreira, J., and Malik, J.
Learning to see by moving.
In
International Conference on Computer Vision
, 2015. - Arbel et al. (2018) Arbel, M., Sutherland, D., Bińkowski, M. a., and Gretton, A. On gradient regularizers for mmd gans. In Advances in Neural Information Processing Systems. 2018.
- Barratt & Sharma (2018) Barratt, S. and Sharma, R. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
- Beyer et al. (2019) Beyer, L., Kolesnikov, A., Oliver, A., Xiaohua, Z., and Gelly, S. Self-supervised Semi-supervised Learning. In Manuscript in preparation, 2019.
- Brock et al. (2019) Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.
-
Caron et al. (2018)
Caron, M., Bojanowski, P., Joulin, A., and Douze, M.
Deep clustering for unsupervised learning of visual features.
European Conference on Computer Vision, 2018. - Chen et al. (2019a) Chen, T., Lucic, M., Houlsby, N., and Gelly, S. On self modulation for generative adversarial networks. In International Conference on Learning Representations, 2019a.
-
Chen et al. (2019b)
Chen, T., Zhai, X., Ritter, M., Lucic, M., and Houlsby, N.
Self-Supervised GANs via Auxiliary Rotation Loss.
In
Computer Vision and Pattern Recognition
, 2019b. - De Vries et al. (2017) De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., and Courville, A. C. Modulating early visual processing by language. In Advances in Neural Information Processing Systems, 2017.
- Deng et al. (2017) Deng, Z., Zhang, H., Liang, X., Yang, L., Xu, S., Zhu, J., and Xing, E. P. Structured Generative Adversarial Networks. In Advances in Neural Information Processing Systems, 2017.
- Doersch et al. (2015) Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual representation learning by context prediction. In International Conference on Computer Vision, 2015.
- Dumoulin et al. (2017) Dumoulin, V., Shlens, J., and Kudlur, M. A learned representation for artistic style. In International Conference on Learning Representations, 2017.
- Gan et al. (2017) Gan, Z., Chen, L., Wang, W., Pu, Y., Zhang, Y., Liu, H., Li, C., and Carin, L. Triangle generative adversarial networks. In Advances in Neural Information Processing Systems, 2017.
- Gidaris et al. (2018) Gidaris, S., Singh, P., and Komodakis, N. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, 2018.
- Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
- Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a Nash equilibrium. In Advances in Neural Information Processing Systems, 2017.
- Jang et al. (2018) Jang, E., Devin, C., Vanhoucke, V., and Levine, S. Grasp2Vec: Learning Object Representations from Self-Supervised Grasping. In Conference on Robot Learning, 2018.
- Kalchbrenner et al. (2017) Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., and Kavukcuoglu, K. Video pixel networks. In International Conference on Machine Learning, 2017.
- Karras et al. (2018) Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
- Kolesnikov et al. (2019) Kolesnikov, A., Zhai, X., and Beyer, L. Revisiting Self-supervised Visual Representation Learning. In Computer Vision and Pattern Recognition, 2019.
- Kurach et al. (2018) Kurach, K., Lucic, M., Zhai, X., Michalski, M., and Gelly, S. The GAN Landscape: Losses, architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720, 2018.
- Lee et al. (2017) Lee, H.-Y., Huang, J.-B., Singh, M., and Yang, M.-H. Unsupervised representation learning by sorting sequences. In International Conference on Computer Vision, 2017.
- Li et al. (2017) Li, C., Xu, T., Zhu, J., and Zhang, B. Triple generative adversarial nets. In Advances in Neural Information Processing Systems. 2017.
- Lucic et al. (2018) Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bousquet, O. Are GANs Created Equal? A Large-scale Study. In Advances in Neural Information Processing Systems, 2018.
- Menick & Kalchbrenner (2019) Menick, J. and Kalchbrenner, N. Generating high fidelity images with subscale pixel networks and multidimensional upscaling. In International Conference on Learning Representations, 2019.
- Miyato & Koyama (2018) Miyato, T. and Koyama, M. cgans with projection discriminator. In International Conference on Learning Representations, 2018.
- Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. International Conference on Learning Representations, 2018.
- Mundhenk et al. (2018) Mundhenk, T. N., Ho, D., and Chen, B. Y. Improvements to context based self-supervised learning. In Computer Vision and Pattern Recognition, 2018.
- Noroozi & Favaro (2016) Noroozi, M. and Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, 2016.
- Odena (2016) Odena, A. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583, 2016.
- Odena et al. (2017) Odena, A., Olah, C., and Shlens, J. Conditional Image Synthesis with Auxiliary Classifier GANs. In International Conference on Machine Learning, 2017.
- Pinto & Gupta (2016) Pinto, L. and Gupta, A. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In IEEE International Conference on Robotics and Automation, 2016.
-
Sajjadi et al. (2018)
Sajjadi, M. S., Bachem, O., Lucic, M., Bousquet, O., and Gelly, S.
Assessing generative models via precision and recall.
In Advances in Neural Information Processing Systems, 2018. - Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, 2016.
-
Sculley (2010)
Sculley, D.
Web-scale k-means clustering.
In International Conference on World Wide Web. ACM, 2010. - Springenberg (2016) Springenberg, J. T. Unsupervised and semi-supervised learning with categorical generative adversarial networks. In International Conference on Learning Representations, 2016.
- Sricharan et al. (2017) Sricharan, K., Bala, R., Shreve, M., Ding, H., Saketh, K., and Sun, J. Semi-supervised conditional GANs. arXiv preprint arXiv:1708.05789, 2017.
- Van Den Oord et al. (2016) Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. 2016.
- Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. British Machine Vision Conference, 2016.
- Zhang et al. (2018) Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Self-Attention Generative Adversarial Networks. arXiv preprint arXiv:1805.08318, 2018.
Appendix A Additional samples and interpolations








Appendix B Architectural details
The ResNet architecture implemented following Brock et al. (2019) is described in Tables 6 and 7. We use the abbreviations RS for resample and BN for batch normalization. In the resample column, we indicate downscale(D)/upscale(U)/none(-) setting. In Table 7, stands for the labels and is the output from the layer before (i.e., the pre-logit layer). Tables 8 and 9 show ResBlock details. The addition layer merges the shortcut path and the convolution path by adding them. and are the input height and width of the ResBlock, and are the input channels and output channels for a ResBlock. For the last ResBlock in the discriminator without resampling, we simply drop the shortcut layer from ResBlock. We list all the trainable variables and their shape in Tables 10 and 11.
Layer | RS | Output |
---|---|---|
- | ||
Dense | - | |
ResBlock | U | |
ResBlock | U | |
ResBlock | U | |
ResBlock | U | |
Non-local block | - | |
ResBlock | U | |
BN, ReLU |
- | |
Conv | - | |
Tanh | - |
Layer | RS | Output |
---|---|---|
Input image | - | |
ResBlock | D | |
Non-local block | - | |
ResBlock | D | |
ResBlock | D | |
ResBlock | D | |
ResBlock | D | |
ResBlock | - | |
ReLU | - | |
Global sum pooling | - | |
Sum(embed())+(dense) | - |
Layer | Kernel | RS | Output |
---|---|---|---|
Shortcut | D | ||
BN, ReLU | - | - | |
Conv | - | ||
BN, ReLU | - | - | |
Conv | D | ||
Addition | - | - |
Layer | Kernel | RS | Output |
---|---|---|---|
Shortcut | U | ||
BN, ReLU | - | - | |
Conv | U | ||
BN, ReLU | - | - | |
Conv | - | ||
Addition | - | - |
Name | Shape | Size |
---|---|---|
discriminator/B1/same_conv1/kernel:0 | (3, 3, 3, 96) | 2,592 |
discriminator/B1/same_conv1/bias:0 | (96,) | 96 |
discriminator/B1/down_conv2/kernel:0 | (3, 3, 96, 96) | 82,944 |
discriminator/B1/down_conv2/bias:0 | (96,) | 96 |
discriminator/B1/down_conv_shortcut/kernel:0 | (1, 1, 3, 96) | 288 |
discriminator/B1/down_conv_shortcut/bias:0 | (96,) | 96 |
discriminator/non_local_block/conv2d_theta/kernel:0 | (1, 1, 96, 12) | 1,152 |
discriminator/non_local_block/conv2d_phi/kernel:0 | (1, 1, 96, 12) | 1,152 |
discriminator/non_local_block/conv2d_g/kernel:0 | (1, 1, 96, 48) | 4,608 |
discriminator/non_local_block/sigma:0 | () | 1 |
discriminator/non_local_block/conv2d_attn_g/kernel:0 | (1, 1, 48, 96) | 4,608 |
discriminator/B2/same_conv1/kernel:0 | (3, 3, 96, 192) | 165,888 |
discriminator/B2/same_conv1/bias:0 | (192,) | 192 |
discriminator/B2/down_conv2/kernel:0 | (3, 3, 192, 192) | 331,776 |
discriminator/B2/down_conv2/bias:0 | (192,) | 192 |
discriminator/B2/down_conv_shortcut/kernel:0 | (1, 1, 96, 192) | 18,432 |
discriminator/B2/down_conv_shortcut/bias:0 | (192,) | 192 |
discriminator/B3/same_conv1/kernel:0 | (3, 3, 192, 384) | 663,552 |
discriminator/B3/same_conv1/bias:0 | (384,) | 384 |
discriminator/B3/down_conv2/kernel:0 | (3, 3, 384, 384) | 1,327,104 |
discriminator/B3/down_conv2/bias:0 | (384,) | 384 |
discriminator/B3/down_conv_shortcut/kernel:0 | (1, 1, 192, 384) | 73,728 |
discriminator/B3/down_conv_shortcut/bias:0 | (384,) | 384 |
discriminator/B4/same_conv1/kernel:0 | (3, 3, 384, 768) | 2,654,208 |
discriminator/B4/same_conv1/bias:0 | (768,) | 768 |
discriminator/B4/down_conv2/kernel:0 | (3, 3, 768, 768) | 5,308,416 |
discriminator/B4/down_conv2/bias:0 | (768,) | 768 |
discriminator/B4/down_conv_shortcut/kernel:0 | (1, 1, 384, 768) | 294,912 |
discriminator/B4/down_conv_shortcut/bias:0 | (768,) | 768 |
discriminator/B5/same_conv1/kernel:0 | (3, 3, 768, 1536) | 10,616,832 |
discriminator/B5/same_conv1/bias:0 | (1536,) | 1,536 |
discriminator/B5/down_conv2/kernel:0 | (3, 3, 1536, 1536) | 21,233,664 |
discriminator/B5/down_conv2/bias:0 | (1536,) | 1,536 |
discriminator/B5/down_conv_shortcut/kernel:0 | (1, 1, 768, 1536) | 1,179,648 |
discriminator/B5/down_conv_shortcut/bias:0 | (1536,) | 1,536 |
discriminator/B6/same_conv1/kernel:0 | (3, 3, 1536, 1536) | 21,233,664 |
discriminator/B6/same_conv1/bias:0 | (1536,) | 1,536 |
discriminator/B6/same_conv2/kernel:0 | (3, 3, 1536, 1536) | 21,233,664 |
discriminator/B6/same_conv2/bias:0 | (1536,) | 1,536 |
discriminator/final_fc/kernel:0 | (1536, 1) | 1,536 |
discriminator/final_fc/bias:0 | (1,) | 1 |
discriminator_projection/kernel:0 | (1000, 1536) | 1,536,000 |
Name | Shape | Size |
---|---|---|
generator/embed_y/kernel:0 | (1000, 128) | 128,000 |
generator/fc_noise/kernel:0 | (20, 24576) | 491,520 |
generator/fc_noise/bias:0 | (24576,) | 24,576 |
generator/B1/bn1/condition/gamma/kernel:0 | (148, 1536) | 227,328 |
generator/B1/bn1/condition/beta/kernel:0 | (148, 1536) | 227,328 |
generator/B1/up_conv1/kernel:0 | (3, 3, 1536, 1536) | 21,233,664 |
generator/B1/up_conv1/bias:0 | (1536,) | 1,536 |
generator/B1/bn2/condition/gamma/kernel:0 | (148, 1536) | 227,328 |
generator/B1/bn2/condition/beta/kernel:0 | (148, 1536) | 227,328 |
generator/B1/same_conv2/kernel:0 | (3, 3, 1536, 1536) | 21,233,664 |
generator/B1/same_conv2/bias:0 | (1536,) | 1,536 |
generator/B1/up_conv_shortcut/kernel:0 | (1, 1, 1536, 1536) | 2,359,296 |
generator/B1/up_conv_shortcut/bias:0 | (1536,) | 1,536 |
generator/B2/bn1/condition/gamma/kernel:0 | (148, 1536) | 227,328 |
generator/B2/bn1/condition/beta/kernel:0 | (148, 1536) | 227,328 |
generator/B2/up_conv1/kernel:0 | (3, 3, 1536, 768) | 10,616,832 |
generator/B2/up_conv1/bias:0 | (768,) | 768 |
generator/B2/bn2/condition/gamma/kernel:0 | (148, 768) | 113,664 |
generator/B2/bn2/condition/beta/kernel:0 | (148, 768) | 113,664 |
generator/B2/same_conv2/kernel:0 | (3, 3, 768, 768) | 5,308,416 |
generator/B2/same_conv2/bias:0 | (768,) | 768 |
generator/B2/up_conv_shortcut/kernel:0 | (1, 1, 1536, 768) | 1,179,648 |
generator/B2/up_conv_shortcut/bias:0 | (768,) | 768 |
generator/B3/bn1/condition/gamma/kernel:0 | (148, 768) | 113,664 |
generator/B3/bn1/condition/beta/kernel:0 | (148, 768) | 113,664 |
generator/B3/up_conv1/kernel:0 | (3, 3, 768, 384) | 2,654,208 |
generator/B3/up_conv1/bias:0 | (384,) | 384 |
generator/B3/bn2/condition/gamma/kernel:0 | (148, 384) | 56,832 |
generator/B3/bn2/condition/beta/kernel:0 | (148, 384) | 56,832 |
generator/B3/same_conv2/kernel:0 | (3, 3, 384, 384) | 1,327,104 |
generator/B3/same_conv2/bias:0 | (384,) | 384 |
generator/B3/up_conv_shortcut/kernel:0 | (1, 1, 768, 384) | 294,912 |
generator/B3/up_conv_shortcut/bias:0 | (384,) | 384 |
generator/B4/bn1/condition/gamma/kernel:0 | (148, 384) | 56,832 |
generator/B4/bn1/condition/beta/kernel:0 | (148, 384) | 56,832 |
generator/B4/up_conv1/kernel:0 | (3, 3, 384, 192) | 663,552 |
generator/B4/up_conv1/bias:0 | (192,) | 192 |
generator/B4/bn2/condition/gamma/kernel:0 | (148, 192) | 28,416 |
generator/B4/bn2/condition/beta/kernel:0 | (148, 192) | 28,416 |
generator/B4/same_conv2/kernel:0 | (3, 3, 192, 192) | 331,776 |
generator/B4/same_conv2/bias:0 | (192,) | 192 |
generator/B4/up_conv_shortcut/kernel:0 | (1, 1, 384, 192) | 73,728 |
generator/B4/up_conv_shortcut/bias:0 | (192,) | 192 |
generator/non_local_block/conv2d_theta/kernel:0 | (1, 1, 192, 24) | 4,608 |
generator/non_local_block/conv2d_phi/kernel:0 | (1, 1, 192, 24) | 4,608 |
generator/non_local_block/conv2d_g/kernel:0 | (1, 1, 192, 96) | 18,432 |
generator/non_local_block/sigma:0 | () | 1 |
generator/non_local_block/conv2d_attn_g/kernel:0 | (1, 1, 96, 192) | 18,432 |
generator/B5/bn1/condition/gamma/kernel:0 | (148, 192) | 28,416 |
generator/B5/bn1/condition/beta/kernel:0 | (148, 192) | 28,416 |
generator/B5/up_conv1/kernel:0 | (3, 3, 192, 96) | 165,888 |
generator/B5/up_conv1/bias:0 | (96,) | 96 |
generator/B5/bn2/condition/gamma/kernel:0 | (148, 96) | 14,208 |
generator/B5/bn2/condition/beta/kernel:0 | (148, 96) | 14,208 |
generator/B5/same_conv2/kernel:0 | (3, 3, 96, 96) | 82,944 |
generator/B5/same_conv2/bias:0 | (96,) | 96 |
generator/B5/up_conv_shortcut/kernel:0 | (1, 1, 192, 96) | 18,432 |
generator/B5/up_conv_shortcut/bias:0 | (96,) | 96 |
generator/final_norm/gamma:0 | (96,) | 96 |
generator/final_norm/beta:0 | (96,) | 96 |
generator/final_conv/kernel:0 | (3, 3, 96, 3) | 2,592 |
generator/final_conv/bias:0 | (3,) | 3 |
Appendix C FID and IS training curves

Appendix D FID and IS: Mean and standard deviations
FID | IS | |||||
---|---|---|---|---|---|---|
S2GAN | 11.00.31 | 9.00.30 | 8.40.02 | 57.60.86 | 72.91.41 | 77.71.24 |
S2GAN-CO | 21.60.64 | 17.60.27 | 13.80.48 | 29.80.21 | 37.10.54 | 50.11.45 |
S3GAN | 10.30.16 | 8.10.14 | 7.80.20 | 59.90.74 | 78.31.08 | 82.11.89 |
S3GAN-CO | 20.20.14 | 16.50.12 | 12.80.51 | 31.10.18 | 38.70.36 | 52.71.08 |
FID | IS | |||||
---|---|---|---|---|---|---|
S2GAN | 11.00.31 | 9.00.30 | 8.40.02 | 57.60.86 | 72.91.41 | 77.71.24 |
S2GAN Soft | 15.60.58 | 13.31.71 | 11.31.42 | 40.10.97 | 49.34.67 | 58.55.84 |
fid | IS | |
---|---|---|
Clustering | 22.70.80 | 22.80.42 |
Clustering(SS) | 21.90.08 | 23.60.19 |
Random label | 27.21.46 | 20.20.33 |
Single label | 71.766.32 | 15.47.57 |
Single label(SS) | 23.60.14 | 22.20.10 |