High-Fidelity Image Generation With Fewer Labels

03/06/2019 ∙ by Mario Lucic, et al. ∙ 14

Deep generative models are becoming a cornerstone of modern machine learning. Recent work on conditional generative adversarial networks has shown that learning complex, high-dimensional distributions over natural images is within reach. While the latest models are able to generate high-fidelity, diverse natural images at high resolution, they rely on a vast quantity of labeled data. In this work we demonstrate how one can benefit from recent work on self- and semi-supervised learning to outperform state-of-the-art (SOTA) on both unsupervised ImageNet synthesis, as well as in the conditional setting. In particular, the proposed approach is able to match the sample quality (as measured by FID) of the current state-of-the art conditional model BigGAN on ImageNet using only 10 labels.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 12

page 13

page 14

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep generative models have received a great deal of attention due to their power to learn complex high-dimensional distributions, such as distributions over natural images (Zhang et al., 2018; Brock et al., 2019), videos (Kalchbrenner et al., 2017), and audio (Van Den Oord et al., 2016). Recent progress was driven by scalable training of large-scale models (Brock et al., 2019; Menick & Kalchbrenner, 2019), architectural modifications (Zhang et al., 2018; Chen et al., 2019a; Karras et al., 2018), and normalization techniques (Miyato et al., 2018).

Figure 1: FID of the baselines and the proposed method. The vertical line indicates the baseline (BigGAN) which uses all the labeled data. The proposed method (S3GAN) is able to match the state-of-the-art while using only of the labeled data and outperform it with 20.

High-fidelity natural image generation (typically trained on ImageNet) hinges upon having access to vast quantities of labeled data. This is unsurprising as labels induce rich side information into the training process, effectively dividing the extremely challenging image generation task into semantically meaningful sub-tasks.

However, this dependence on vast quantities of labeled data is at odds with the fact that most data is unlabeled, and labeling itself is often costly and error-prone. Despite the recent progress on unsupervised image generation, the gap between conditional and unsupervised models in terms of sample quality is significant.

In this work, we take a significant step towards closing the gap between conditional and unsupervised generation of high-fidelity images using generative adversarial networks (GANs). We leverage two simple yet powerful concepts:

  1. [itemsep=0pt,topsep=-4pt,parsep=0pt]

  2. Self-supervised learning: A semantic feature extractor for the training data can be learned via self-supervision, and the resulting feature representation can then be employed to guide the GAN training process.

  3. Semi-supervised learning: Labels for the entire training set can be inferred from a small subset of labeled training images and the inferred labels can be used as conditional information for GAN training.

Our contributions In this work, we

  1. [itemsep=0pt,topsep=-5pt,parsep=1pt,leftmargin=5mm]

  2. propose and study various approaches to reduce or fully omit ground-truth label information for natural image generation tasks,

  3. achieve a new SOTA in unsupervised generation on ImageNet, match the SOTA on imagenet using only 10% of the labels, and set a new SOTA using only 20% of the labels (measured by FID), and

  4. open-source all the code used for the experiments at github.com/google/compare_gan.

2 Background and related work

High-fidelity GANs on imagenet

Besides BigGAN (Brock et al., 2019)

only a few prior methods have managed to scale GANs to ImageNet, most of them relying on class-conditional generation using labels. One of the earliest attempts are GANs with auxiliary classifier (AC-GANs)

(Odena et al., 2017)

which feed one-hot encoded label-information with the latent code to the generator and equip the discriminator with an auxiliary head predicting the image class in addition to whether the input is real or fake. More recent approaches rely on a label projection layer in the discriminator essentially resulting in per-class real/fake classification

(Miyato & Koyama, 2018) and self-attention in the generator (Zhang et al., 2018)

. Both methods use modulated batch normalization

(De Vries et al., 2017) to provide label information to the generator. On the unsupervised side, Chen et al. (2019b) showed that auxiliary rotation loss added to the discriminator has a stabilizing effect on the training. Finally, appropriate gradient regularization enables scaling MMD-GANs to ImageNet without using labels (Arbel et al., 2018).

Semi-supervised GANs  Several recent works leveraged GANs for semi-supervised learning of classifiers. Both Salimans et al. (2016) and Odena (2016) train a discriminator that classifies its input into classes: image classes for real images, and one class for generated images. Similarly, Springenberg (2016) extends the standard GAN objective to classes. This approach was also considered by Li et al. (2017) where separate discriminator and classifier models are applied. Other approaches incorporate inference models to predict missing labels (Deng et al., 2017)

or harness the joint distribution (of labels and data) matching for semi-supervised learning

(Gan et al., 2017). We emphasize that this line of work focuses on training a classifier from a few labels, rather than using few labels to improve the quality of the generated model. Up to our knowledge, improvements in sample quality through partial label information are reported in Li et al. (2017); Deng et al. (2017); Sricharan et al. (2017), all of which consider only low-resolution data sets from a restricted domain.

Self-supervised learning Self-supervised learning methods employ a label-free auxiliary task to learn a semantic feature representation of the data. This approach was successfully applied to different data modalities, such as images  (Doersch et al., 2015; Caron et al., 2018), video (Agrawal et al., 2015; Lee et al., 2017), and robotics (Jang et al., 2018; Pinto & Gupta, 2016). The current state-of-the-art method on imagenet is due to Gidaris et al. (2018) who proposed predicting the rotation angle of rotated training images as an auxiliary task. This simple self-supervision approach yields representations which are useful for downstream image classification tasks. Other forms of self-supervision include predicting relative locations of disjoint image patches of a given image  (Doersch et al., 2015; Mundhenk et al., 2018)

or estimating the permutation of randomly swapped image patches on a regular grid 

(Noroozi & Favaro, 2016). A study on self-supervised learning with modern neural architectures is provided in Kolesnikov et al. (2019).

3 Reducing the appetite for labeled data

In a nutshell, instead of providing hand-annotated ground truth labels for real images to the discriminator, we will provide inferred ones. To obtain these labels we will make use of recent advancements in self- and semi-supervised learning. Before introducing these methods in detail, we first discuss how label information is used in state-of-the-art GANs. The following exposition assumes familiarity with the basics of the GAN framework (Goodfellow et al., 2014).

Figure 2: Top row: samples from the fully supervised current state-of-the-art model BigGAN. Bottom row: Samples form the proposed S3GAN which matches BigGAN in terms of FID and IS using only of the ground-truth labels.

Incorporating the labels To provide the label information to the discriminator we employ a linear projection layer as proposed by Miyato & Koyama (2018). To make the exposition self-contained, we will briefly recall the main ideas. In a ”vanilla” (unconditional) GAN, the discriminator learns to predict whether the image at its input is real or generated by the generator . We decompose the discriminator into a learned discriminator representation, , which is fed into a linear classifier, , i.e., the discriminator is given by . In the projection discriminator, one learns an embedding for each class of the same dimension as the representation . Then, for a given image, label input the decision on whether the sample is real or generated is based on two components: (a) on whether the representation itself is consistent with the real data, and (b) on whether the representation is consistent with the real data from class . More formally, the discriminator takes the form , where

is a linear projection layer applied to a feature vector

and the one-hot encoded label as an input. As for the generator, the label information is incorporated through class-conditional BatchNorm (Dumoulin et al., 2017; De Vries et al., 2017). The conditional GAN with projection discriminator is illustrated in Figure 3. We proceed with describing the pre-trained and co-training approaches to infer labels for GAN training in Sections 3.1 and 3.2, respectively.

Figure 3: Conditional GAN with projection discriminator. The discriminator tries to predict from the representation whether a real image (with label ) or a generated image (with label ) is at its input, by combining an unconditional classifier and a class-conditional classifier implemented through the projection layer . This form of conditioning is used in BigGAN. Outward-pointing arrows feed into losses.

3.1 Pre-trained approaches

Unsupervised clustering-based method

We first learn a representation of the real training data using a state-of-the-art self-supervised approach (Gidaris et al., 2018; Kolesnikov et al., 2019), perform clustering on this representation, and use the cluster assignments as a replacement for labels. Following Gidaris et al. (2018) we learn the feature extractor

(typically a convolutional neural network) by minimizing the following

self-supervision loss

(1)

where is the set of the rotation degrees , is the image rotated by , and is a linear classifier predicting the rotation degree . After learning the feature extractor , we apply mini batch -Means clustering (Sculley, 2010) on the representations of the training images. Finally, given the cluster assignment function we train the GAN using the hinge loss, alternatively minimizing the discriminator loss and generator loss , namely

where is the prior distribution with and the empirical distribution of the cluster labels over the training set. We call this approach Clustering and illustrate it in Figure 4.

Figure 4: Clustering: Unsupervised approach based on clustering the representations obtained by solving a self-supervised task. corresponds to the feature extractor learned via self-supervision and is the cluster assignment function. After learning and on the real training images in the pre-training step, we proceed with conditional GAN training by inferring the labels as .

Semi-supervised method

While semi-supervised learning is an active area of research and a large variety of algorithms has been proposed, we follow Beyer et al. (2019) and simply extend the self-supervised approach described in the previous paragraph with a semi-supervised loss. This ensures that the two approaches are comparable in terms of model capacity and computational cost. Assuming we are provided with labels for a subset of the training data, we attempt to learn a good feature representation via self-supervision and simultaneously train a good linear classifier on the so-obtained representation (using the provided labels).111Note that an even simpler approach would be to first learn the representation via self-supervision and subsequently the linear classifier, but we observed that learning the representation and classifier simultaneously leads to better results. More formally, we minimize the loss

(2)

where and are linear classifiers predicting the rotation angle and the label , respectively, and balances the loss terms. The first term in (2) corresponds to the self-supervision loss from (1) and the second term to a (semi-supervised) cross-entropy loss. During training, the latter expectation is replaced by the empirical average over the subset of labeled training examples, whereas the former is set to the empirical average over the entire training set (this convention is followed throughout the paper). After we obtain and we proceed with GAN training where we label the real images as . In particular, we alternatively minimize the same generator and discriminator losses as for Clustering except that we use and obtained by minimizing (2):

where with and uniform categorical. We use the abbreviation S2GAN for this method.

3.2 Co-training approach

The main drawback of the transfer-based methods is that one needs to train a feature extractor via self supervision and learn an inference mechanism for the labels (linear classifier or clustering). In what follows we detail co-training approaches that avoid this two-step procedure and learn to infer label information during GAN training.

Figure 5: S2GAN-CO: During GAN training we learn an auxiliary classifier on the discriminator representation , based on the labeled real examples, to predict labels for the unlabeled ones. This avoids training a feature extractor and classifier prior to GAN training as in S2GAN.

Unsupervised method We consider two approaches. In the first one, we completely remove the labels by simply labeling all real and generated examples with the same label222Note that this is not necessarily equivalent to replacing class-conditional BatchNorm with standard (unconditional) BatchNorm as the variant of conditional BatchNorm used in this paper also uses chunks of the latent code as input; besides the label information. and removing the projection layer from the discriminator, i.e., we set . We use the abbreviation Single label for this method. For the second approach we assign random labels to (unlabeled) real images. While the labels for the real images do not provide any useful signal to the discriminator, the sampled labels could potentially help the generator by providing additional randomness with different statistics than , as well as additional trainable parameters due to the embedding matrices in class-conditional BatchNorm. Furthermore, the labels for the fake data could facilitate the discrimination as they provide side information about the fake images to the discriminator. We term this method Random label.

Semi-supervised method When labels are available for a subset of the real data, we train an auxiliary linear classifier directly on the feature representation of the discriminator, during GAN training, and use it to predict labels for the unlabeled real images. In this case the discriminator loss takes the form

(3)

where the first term corresponds to standard conditional training on () labeled real images, the second term is the cross-entropy loss (with weight ) for the auxiliary classifier on the labeled real images, the third term is an unsupervised discriminator loss where the labels for the unlabeled real images are predicted by , and the last term is the standard conditional discriminator loss on the generated data. We use the abbreviation S2GAN-CO for this method. See Figure 5 for an illustration.

3.3 Self-supervision during GAN training

Figure 6: Self-supervision by rotation-prediction during GAN training. Additionally to predicting whether the images at its input are real or generated, the discriminator is trained to predict rotations of both rotated real and fake images via an auxiliary linear classifier . This approach was successfully applied by Chen et al. (2019b) to stabilize GAN training. Here we combine it with our pre-trained and co-training approaches, replacing the ground truth labels with predicted ones.

So far we leveraged self-supervision to either craft good feature representations, or to learn a semi-supervised model (cf. Section 3.1). However, given that the discriminator itself is just a classifier, one may benefit from augmenting this classifier with an auxiliary task—namely self-supervision through rotation prediction. This approach was already explored in (Chen et al., 2019b), where it was observed to stabilize GAN training. Here we want to assess its impact when combined with the methods introduced in Sections 3.1 and 3.2. To this end, similarly to the training of in (1) and (2), we train an additional linear classifier on the discriminator feature representation to predict rotations of the rotated real images and rotated fake images . The corresponding loss terms added to the discriminator and generator losses are

(4)

and

(5)

respectively, where are weights to balance the loss terms. This approach is illustrated in Figure 6.

4 Experimental setup

Architecture and hyperparameters

GANs are notoriously unstable to train and their performance strongly depends on the capacity of the neural architecture, optimization hyperparameters, and appropriate regularization 

(Lucic et al., 2018; Kurach et al., 2018). We implemented the conditional BigGAN architecture (Brock et al., 2019) which achieves state-of-the-art results on ImageNet.333We dissected the model checkpoints released by Brock et al. (2019) to obtain exact counts of trainable parameters and their dimensions, and match them to byte level (cf. Tables 10 and  11). We want to emphasize that at this point this methodology is bleeding-edge

and successful state-of-the-art methods require careful architecture-level tuning. To foster reproducibility we meticulously detail this architecture at tensor-level detail in Appendix 

B and open-source our code at https://github.com/google/compare_gan.
We use exactly the same optimization hyper-parameters as Brock et al. (2019). Specifically, we employ the Adam Optimizer with the learning rates for the generator and for the discriminator ( ). We train for 250k generator steps with 2 discriminator iterations before each generator step. The batch size was fixed to 2048, and we use a latent code with dimensions. We employ spectral normalization in both generator and discriminator. In contrast to BigGAN, we do not apply orthogonal regularization as this was observed to only marginally improve sample quality (cf. Table 1 in Brock et al. (2019)) and we do not use the truncation trick.

Datasets We focus primarily on imagenet, the largest and most diverse image data set commonly used to evaluate GANs. imagenet contains M training images and k test images, each corresponding to one of 1k object classes. We resize the images to as done in Miyato & Koyama (2018) and  Zhang et al. (2018). Partially labeled data sets for the semi-supervised approaches are obtained by randomly selecting of the samples from each class.

Evaluation metrics

We use the Fréchet Inception Distance (FID) (Heusel et al., 2017) and Inception Score (Salimans et al., 2016) to evaluate the quality of the generated samples. To compute the FID, the real data and generated samples are first embedded in a specific layer of a pre-trained Inception network. Then, a multivariate Gaussian is fit to the data and the distance computed as , where and denote the empirical mean and covariance, and subscripts and denote the real and generated data respectively. FID was shown to be sensitive to both the addition of spurious modes and to mode dropping (Sajjadi et al., 2018; Lucic et al., 2018). Inception Score posits that conditional label distribution of samples containing meaningful objects should have low entropy, and the variability of the samples should be high leading to the following formulation: . Although it has some flaws (Barratt & Sharma, 2018), we report it to enable comparison with existing methods. Following (Brock et al., 2019), the FID is computed using the 50k imagenet testing images and 50k randomly sampled fake images, and the IS is computed from 50k randomly sampled fake images. All metrics are computed for 5 different randomly sampled sets of fake images and we report the mean.

Methods We conduct an extensive comparison of methods detailed in Table 1, namely: Unmodified BigGAN, the unsupervised methods Single label, Random label, Clustering, and the semi-supervised methods S2GAN and S2GAN-CO. In all S2GAN-CO experiments we use soft labels, i.e., the soft-max output of instead of one-hot encoded hard estimates, as we observed in preliminary experiments that this stabilizes training. For S2GAN we use hard labels by default, but investigate the effect of soft labels in separate experiments. For all semi-supervised methods we have access only to of the ground truth labels where . As an additional baseline, we retain labeled real images and discard all unlabeled real images, then using the remaining labeled images to train BigGAN (the resulting model is designated by BigGAN-). Finally, we explore the effect of self-supervision during GAN training on the unsupervised and semi-supervised methods.

We train every model three times with a different random seed and report the median FID and the median IS. With the exception of the Single label and BigGAN-

, the standard deviation of the mean across three runs is very low. We therefore defer tables with the mean FID and IS values and standard deviations to Appendix 

D. All models are trained on 128 cores of a Google TPU v3 Pod with BatchNorm statistics synchronized across cores.

Unsupervised approaches For Clustering we simply used the best available self-supervised rotation model from (Kolesnikov et al., 2019). The number of clusters for Clustering is selected from the set . The other unsupervised approaches do not have hyper-parameters.

Pre-trained and co-training approaches We employ the wide ResNet-50 v2 architecture with widening factor 16  (Zagoruyko & Komodakis, 2016) for the feature extractor in the pre-trained approaches described in Section 3.1.

Method Description
BigGAN Conditional (Brock et al., 2019)
Single label Co-training: Single label
Random label Co-training: Random labels
Clustering Pre-trained: Clustering
BigGAN- Drop all but labeled data
S2GAN-CO Co-training: Semi-supervised
S2GAN Pre-trained: Semi-supervised
S3GAN S2GAN with self-supervision
S3GAN-CO S2GAN-CO with self-supervision
Table 1: A short summary of the analyzed methods. The detailed descriptions of pre-training and co-trained approaches can be found in Sections 3.1 and 3.2, respectively. Self-supervision during GAN training is described in Section 3.3.

We optimize the loss in (2

) using SGD for 65 epochs. The batch size is set to

, composed of unlabeled examples and labeled examples. Following the recommendations from  Goyal et al. (2017) for training with large batch size, we (i) set the learning rate to , and (ii) use linear learning rate warm-up during the initial 5 epochs. The learning rate is decayed twice with a factor of 10 at epoch 45 and epoch 55. The parameter in (2) is set to and the number of unlabeled examples per batch is 1536. The parameters and are tuned on labeled examples held out from the training set, the search space is . The accuracy of the so-obtained classifier on the imagenet validation set is reported in Table 3. The parameter in the loss used for S2GAN-CO in (3) is selected form the set .

Self-supervision during GAN training For all approaches we use the recommend parameter from (Chen et al., 2019b) in (5) and do a small sweep for in (4). For the values tried () we do not see a huge effect and use for S3GAN. For S3GAN-CO we did not repeat the sweep used .

5 Results and discussion

Recall that the main goal of this work is to match (or outperform) the fully supervised BigGAN in an unsupervised fashion, or with a small subset of labeled data. In the following, we discuss the advantages and drawbacks of the analyzed approaches with respect to this goal.

As a baseline, our reimplementation of BigGAN obtains an FID of 8.4 and IS of 75.0, and hence reproduces the result reported by Brock et al. (2019) in terms of FID. We observed some differences in training dynamics, which we discuss in detail in Section 5.4.

5.1 Unsupervised approaches

fid is
Random label 26.5 20.2
Single label 25.3 20.4
Single label (SS) 23.7 22.2
Clustering 23.2 22.7
Clustering (SS) 22.0 23.5
Table 2: Median FID and IS for the unsupervised approaches (see Table 14 in the appendix for mean and standard deviation).

The results for unsupervised approaches are summarized in Figure 7 and Table 2. The fully unsupervised Random label and Single label models both achieve a similar FID of and IS of . This is a quite considerable gap compared to BigGAN and indicates that additional supervision is necessary. We note that one of the three Single label models collapsed whereas all three Random label models trained stably for 250k generator iterations.

Pre-training a semantic representation using self-supervision and clustering the training data on this representation as done by Clustering reduces the FID by about and increases IS by about . These results were obtained for 50 clusters, all other options led to worse results. While this performance is still considerably worse than that of BigGAN this result is the current state-of-the-art in unsupervised image generation (Chen et al. (2019b) report an FID of 33 for unsupervised generation).

Example images from the clustering are shown in Figures 14, 15, and 16 in the supplementary material. The clustering is clearly meaningful and groups similar objects within the same cluster. Furthermore, the objects generated by Clustering conditionally on a given cluster index reflect the distribution of the training data belonging the corresponding cluster. On the other hand, we can clearly observe multiple classes being present in the same cluster. This is to be expected when under-clustering to clusters. Interestingly, clustering to many more clusters (say ) yields results similar to Single label.

Figure 7: Median FID obtained by our unsupervised approaches. The vertical line indicates the the median FID of our BigGAN implementation which uses labels for all training images. While the gap between unsupervised and fully supervised approaches remains significant, using a pre-trained self-supervised representation (Clustering) improves the sample quality compared to Single label and Random label, leading to a new state-of-the art in unsupervised generation on imagenet.

5.2 Semi-supervised approaches

Pre-trained The S2GAN model where we use the classifier pre-trained with both a self-supervised and semi-supervised loss (cf. Section 3.1) suffers a very minor increase in FID for and labeled real training data, and matches BigGAN both in terms of FID and IS when of the labels are used (cf. Table 3). We stress that this is despite the fact that the classifier used to infer the labels has a top-1 accuracy of only , , and for , , and labeled data, respectively (cf. Table 3), compared to of the original labels. The results are shown in Table 4 and Figure 8

, and random samples as well as interpolations can be found in Figures 

917 in the supplementary material.

Labels
Metric
Top-1 error
Top-5 error
Table 3: Top-1 and top-5 error rate (%) on the imagenet validation set of using both self- and semi-supervised losses as described in Section 3.1. While the models are clearly not state-of-the-art compared to the fully supervised imagenet classification task, the quality of labels is sufficient to match and in some cases improve the state-of-the-art GAN natural image synthesis.
FID IS
S2GAN 10.8 8.9 8.4 57.6 73.4 77.4
S2GAN-CO 21.8 17.7 13.9 30.0 37.2 49.2
S3GAN 10.4 8.0 7.7 59.6 78.7 83.1
S3GAN-CO 20.2 16.6 12.7 31.0 38.5 53.1
Table 4: Pre-trained vs co-training approaches, and the effect of self-supervision during GAN training (see Table 12 in the appendix for mean and standard deviation). While co-training approaches outperform fully unsupervised approaches, they are clearly outperformed by the pre-trained approaches. Self-supervision during GAN training helps in all cases.

Co-trained

The results for our co-trained model S2GAN-CO which trains a linear classifier in semi-supervised fashion on top of the discriminator representation during GAN training (cf. Section 3.2) are shown in Table 4. It can be seen that S2GAN-CO outperforms all fully unsupervised approaches for all considered label percentages. While the gap between S2GAN-CO with labels and Clustering in terms of FID is small, S2GAN-CO has a considerably larger IS. When using labeled training examples S2GAN-CO obtains an FID of 13.9 and an IS of 49.2, which is remarkably close to BigGAN and S2GAN given the simplicity of the S2GAN-CO approach. As the the percentage of labels decreases, the gap between S2GAN and S2GAN-CO increases.

Interestingly, S2GAN-CO does not seem to train less stably than S2GAN approaches even though it is forced to learn the classifier during GAN training. This is particularly remarkable as the BigGAN- approaches, where we only retain the labeled data for training and discard all unlabeled data, are very unstable and collapse after 60k to 120k iterations, for all three random seeds and for both and labeled data.

5.3 Self-supervision during GAN training

So far we have seen that the pre-trained semi-supervised approach, namely S2GAN, is able to achieve state-of-the-art performance for labeled data. Here we investigate whether self-supervision during GAN training as described in Section 3.3 can lead to further improvements. Table 4 and Figure 8 show the experimental results for S3GAN, namely S2GAN coupled with self-supervision in the discriminator.

Self-supervision leads to a reduction in FID and increase in IS across all considered settings. In particular we can match the state-of-the-art BigGAN with only of the labels and outperform it using labels, both in terms of FID and IS.

For S3GAN the improvements due to self-supervision during GAN training in FID are considerable, around in most of the cases. Tuning the parameter of the discriminator self-supervision loss in (4) did not dramatically increase the benefits of self-supervision during GAN training, at least for the range of values considered. As shown in Tables 2 and 4, self-supervision during GAN training (with default parameters ) also leads to improvements by to for both S2GAN-CO and Single label. In summary, self-supervision during GAN training with default parameters leads to a stable improvement across all approaches.

Figure 8: The vertical line indicates the median FID of our BigGAN implementation which uses all labeled data. The proposed S3GAN approach is able to match the performance of the state-of-the-art BigGAN model using of the ground-truth labels and outperforms it using .

5.4 Other insights

Effect of soft labels

 A design choice available to practitioners is whether to use hard labels (i.e., the argmax over the logits), or soft labels (softmax over the logits) for

S2GAN and S3GAN (recall that we use soft labels by default for S2GAN-CO and S3GAN-CO). Our initial expectation was that soft labels should help when very little labeled data is available, as soft labels carry more information which can potentially be exploited by the projection discriminator. Surprisingly, the results presented in Table 5 show clearly that the opposite is true. Our current hypothesis is that this is due to the way labels are incorporated in the projection discriminator, but we do not have empirical evidence yet.

Optimization dynamicsBrock et al. (2019) report the FID and IS of the model just before the collapse, which can be seen as a form of early stopping. In contrast, we manage to stably train the proposed models for 250k generator iterations. In particular, we also observe stable training for our vanilla BigGAN implementation. The evolution of the FID and IS as a function of the training steps is shown in Figure 21 in the appendix. At this point we can only speculate about the origin of this difference.

FID IS
S2GAN 10.8 8.9 8.4 57.6 73.4 77.4
+soft 15.4 12.9 10.4 40.3 49.8 62.1
Table 5: Training with hard (predicted) labels leads to better models than training with soft (predicted) labels (see Table 13 in the appendix for mean and standard deviation).

Higher resolution and going below 5% labels Training these models at higher resolution becomes computationally harder and it necessitates tuning the learning rate. We trained several S3GAN models at resolution and show the resulting samples in Figures 1213 and interpolations in Figures 1920. We also conducted S3GAN experiments in which only of the labels are used and observed FID of and IS of . This indicates that given a small number of samples one can significantly outperform the unsupervised approaches (c.f. Figure 7).

6 Conclusion and future Work

In this work we investigated several avenues to reduce the appetite for labeled data in state-of-the-art generative adversarial networks. We showed that recent advances in self- and semi-supervised learning can be used to achieve a new state of the art, both for unsupervised and supervised natural image synthesis.

We believe that this is a great first step towards the ultimate goal of few-shot high-fidelity image synthesis. There are several important directions for future work: (i) investigating the applicability of these techniques for even larger and more diverse data sets, and (ii) investigating the impact of other self- and semi-supervised approaches on the model quality. (iii) investigating the impact of self-supervision in other deep generative models. Finally, we would like to emphasize that further progress might be hindered by the engineering challenges related to training large-scale generative adversarial networks. To help alleviate this issue and to foster reproducibility, we have open-sourced all the code used for the experiments.

Acknowledgments

We would like to thank Ting Chen and Neil Houlsby for fruitful discussions on self-supervision and its application to GANs. We would like to thank Lucas Beyer, Alexander Kolesnikov, and Avital Oliver for helpful discussions on self-supervised semi-supervised learning. We would like to thank Karol Kurach and Marcin Michalski their major contributions the Compare GAN library. We would also like to thank the BigGAN team (Andy Brock, Jeff Donahue, and Karen Simonyan) for their insights into training GANs on TPUs. Finally, we are grateful for the support of members of the Google Brain team in Zurich.

References

  • Agrawal et al. (2015) Agrawal, P., Carreira, J., and Malik, J. Learning to see by moving. In

    International Conference on Computer Vision

    , 2015.
  • Arbel et al. (2018) Arbel, M., Sutherland, D., Bińkowski, M. a., and Gretton, A. On gradient regularizers for mmd gans. In Advances in Neural Information Processing Systems. 2018.
  • Barratt & Sharma (2018) Barratt, S. and Sharma, R. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
  • Beyer et al. (2019) Beyer, L., Kolesnikov, A., Oliver, A., Xiaohua, Z., and Gelly, S. Self-supervised Semi-supervised Learning. In Manuscript in preparation, 2019.
  • Brock et al. (2019) Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.
  • Caron et al. (2018) Caron, M., Bojanowski, P., Joulin, A., and Douze, M.

    Deep clustering for unsupervised learning of visual features.

    European Conference on Computer Vision, 2018.
  • Chen et al. (2019a) Chen, T., Lucic, M., Houlsby, N., and Gelly, S. On self modulation for generative adversarial networks. In International Conference on Learning Representations, 2019a.
  • Chen et al. (2019b) Chen, T., Zhai, X., Ritter, M., Lucic, M., and Houlsby, N. Self-Supervised GANs via Auxiliary Rotation Loss. In

    Computer Vision and Pattern Recognition

    , 2019b.
  • De Vries et al. (2017) De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., and Courville, A. C. Modulating early visual processing by language. In Advances in Neural Information Processing Systems, 2017.
  • Deng et al. (2017) Deng, Z., Zhang, H., Liang, X., Yang, L., Xu, S., Zhu, J., and Xing, E. P. Structured Generative Adversarial Networks. In Advances in Neural Information Processing Systems, 2017.
  • Doersch et al. (2015) Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual representation learning by context prediction. In International Conference on Computer Vision, 2015.
  • Dumoulin et al. (2017) Dumoulin, V., Shlens, J., and Kudlur, M. A learned representation for artistic style. In International Conference on Learning Representations, 2017.
  • Gan et al. (2017) Gan, Z., Chen, L., Wang, W., Pu, Y., Zhang, Y., Liu, H., Li, C., and Carin, L. Triangle generative adversarial networks. In Advances in Neural Information Processing Systems, 2017.
  • Gidaris et al. (2018) Gidaris, S., Singh, P., and Komodakis, N. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, 2018.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
  • Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  • Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a Nash equilibrium. In Advances in Neural Information Processing Systems, 2017.
  • Jang et al. (2018) Jang, E., Devin, C., Vanhoucke, V., and Levine, S. Grasp2Vec: Learning Object Representations from Self-Supervised Grasping. In Conference on Robot Learning, 2018.
  • Kalchbrenner et al. (2017) Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., and Kavukcuoglu, K. Video pixel networks. In International Conference on Machine Learning, 2017.
  • Karras et al. (2018) Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
  • Kolesnikov et al. (2019) Kolesnikov, A., Zhai, X., and Beyer, L. Revisiting Self-supervised Visual Representation Learning. In Computer Vision and Pattern Recognition, 2019.
  • Kurach et al. (2018) Kurach, K., Lucic, M., Zhai, X., Michalski, M., and Gelly, S. The GAN Landscape: Losses, architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720, 2018.
  • Lee et al. (2017) Lee, H.-Y., Huang, J.-B., Singh, M., and Yang, M.-H. Unsupervised representation learning by sorting sequences. In International Conference on Computer Vision, 2017.
  • Li et al. (2017) Li, C., Xu, T., Zhu, J., and Zhang, B. Triple generative adversarial nets. In Advances in Neural Information Processing Systems. 2017.
  • Lucic et al. (2018) Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bousquet, O. Are GANs Created Equal? A Large-scale Study. In Advances in Neural Information Processing Systems, 2018.
  • Menick & Kalchbrenner (2019) Menick, J. and Kalchbrenner, N. Generating high fidelity images with subscale pixel networks and multidimensional upscaling. In International Conference on Learning Representations, 2019.
  • Miyato & Koyama (2018) Miyato, T. and Koyama, M. cgans with projection discriminator. In International Conference on Learning Representations, 2018.
  • Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. International Conference on Learning Representations, 2018.
  • Mundhenk et al. (2018) Mundhenk, T. N., Ho, D., and Chen, B. Y. Improvements to context based self-supervised learning. In Computer Vision and Pattern Recognition, 2018.
  • Noroozi & Favaro (2016) Noroozi, M. and Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, 2016.
  • Odena (2016) Odena, A. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583, 2016.
  • Odena et al. (2017) Odena, A., Olah, C., and Shlens, J. Conditional Image Synthesis with Auxiliary Classifier GANs. In International Conference on Machine Learning, 2017.
  • Pinto & Gupta (2016) Pinto, L. and Gupta, A. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In IEEE International Conference on Robotics and Automation, 2016.
  • Sajjadi et al. (2018) Sajjadi, M. S., Bachem, O., Lucic, M., Bousquet, O., and Gelly, S.

    Assessing generative models via precision and recall.

    In Advances in Neural Information Processing Systems, 2018.
  • Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, 2016.
  • Sculley (2010) Sculley, D.

    Web-scale k-means clustering.

    In International Conference on World Wide Web. ACM, 2010.
  • Springenberg (2016) Springenberg, J. T. Unsupervised and semi-supervised learning with categorical generative adversarial networks. In International Conference on Learning Representations, 2016.
  • Sricharan et al. (2017) Sricharan, K., Bala, R., Shreve, M., Ding, H., Saketh, K., and Sun, J. Semi-supervised conditional GANs. arXiv preprint arXiv:1708.05789, 2017.
  • Van Den Oord et al. (2016) Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. 2016.
  • Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. British Machine Vision Conference, 2016.
  • Zhang et al. (2018) Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Self-Attention Generative Adversarial Networks. arXiv preprint arXiv:1805.08318, 2018.

Appendix A Additional samples and interpolations

Figure 9: Samples obtained from S3GAN (20% labels, ) when interpolating in the latent space (left to right).

Figure 10: Samples obtained from S3GAN (20% labels, ) when interpolating in the latent space (left to right).

Figure 11: Samples obtained from S3GAN (20% labels, ) when interpolating in the latent space (left to right).

Figure 12: Samples obtained from S3GAN (10% labels, ) when interpolating in the latent space (left to right).

Figure 13: Samples obtained from S3GAN (10% labels, ) when interpolating in the latent space (left to right).
Real images. Generated images.
Figure 14: Real and generated images () for one of the 50 clusters produced by Clustering. Both real and generated images show mostly underwater scenes.
Real images. Generated images.
Figure 15: Real and generated images () for one of the 50 clusters produced by Clustering. Both real and generated images show mostly outdoor scenes featuring different animals.
Real images. Generated images.
Figure 16: Real and generated images () for one of the 50 clusters produced by Clustering. In contrast to the examples shown in Figures 14 and 15 the clusters show diverse indoor and outdoor scenes.

Figure 17: Samples generated by S3GAN (20% labels, ) for a single class. The model captures the great diversity within the class. Human faces and more dynamic scenes present challenges.

Figure 18: Generated samples by S3GAN (20% labels, ) for different classes. The model correctly learns the different classes and we did not observe class leakage.
Figure 19: Generated samples by S3GAN (10% labels, ) for a single class. The model captures the diversity within the class.
Figure 20: Generated samples by S3GAN (10% labels, ) for a single class. The model captures the diversity within the class.

Appendix B Architectural details

The ResNet architecture implemented following Brock et al. (2019) is described in Tables 6 and 7. We use the abbreviations RS for resample and BN for batch normalization. In the resample column, we indicate downscale(D)/upscale(U)/none(-) setting. In Table 7, stands for the labels and is the output from the layer before (i.e., the pre-logit layer). Tables 8 and 9 show ResBlock details. The addition layer merges the shortcut path and the convolution path by adding them. and are the input height and width of the ResBlock, and are the input channels and output channels for a ResBlock. For the last ResBlock in the discriminator without resampling, we simply drop the shortcut layer from ResBlock. We list all the trainable variables and their shape in Tables 10 and 11.

Layer RS Output
-
Dense -
ResBlock U
ResBlock U
ResBlock U
ResBlock U
Non-local block -
ResBlock U

BN, ReLU

-
Conv -
Tanh -
Table 6: ResNet generator architecture. “ch” represents the channel width multiplier and is set to .
Layer RS Output
Input image -
ResBlock D
Non-local block -
ResBlock D
ResBlock D
ResBlock D
ResBlock D
ResBlock -
ReLU -
Global sum pooling -
Sum(embed())+(dense) -
Table 7: ResNet discriminator architecture. “ch” represents the channel width multiplier and is set to .
Layer Kernel RS Output
Shortcut D
BN, ReLU - -
Conv -
BN, ReLU - -
Conv D
Addition - -
Table 8: ResBlock discriminator.
Layer Kernel RS Output
Shortcut U
BN, ReLU - -
Conv U
BN, ReLU - -
Conv -
Addition - -
Table 9: ResBlock generator.
Name Shape Size
discriminator/B1/same_conv1/kernel:0 (3, 3, 3, 96) 2,592
discriminator/B1/same_conv1/bias:0 (96,) 96
discriminator/B1/down_conv2/kernel:0 (3, 3, 96, 96) 82,944
discriminator/B1/down_conv2/bias:0 (96,) 96
discriminator/B1/down_conv_shortcut/kernel:0 (1, 1, 3, 96) 288
discriminator/B1/down_conv_shortcut/bias:0 (96,) 96
discriminator/non_local_block/conv2d_theta/kernel:0 (1, 1, 96, 12) 1,152
discriminator/non_local_block/conv2d_phi/kernel:0 (1, 1, 96, 12) 1,152
discriminator/non_local_block/conv2d_g/kernel:0 (1, 1, 96, 48) 4,608
discriminator/non_local_block/sigma:0 () 1
discriminator/non_local_block/conv2d_attn_g/kernel:0 (1, 1, 48, 96) 4,608
discriminator/B2/same_conv1/kernel:0 (3, 3, 96, 192) 165,888
discriminator/B2/same_conv1/bias:0 (192,) 192
discriminator/B2/down_conv2/kernel:0 (3, 3, 192, 192) 331,776
discriminator/B2/down_conv2/bias:0 (192,) 192
discriminator/B2/down_conv_shortcut/kernel:0 (1, 1, 96, 192) 18,432
discriminator/B2/down_conv_shortcut/bias:0 (192,) 192
discriminator/B3/same_conv1/kernel:0 (3, 3, 192, 384) 663,552
discriminator/B3/same_conv1/bias:0 (384,) 384
discriminator/B3/down_conv2/kernel:0 (3, 3, 384, 384) 1,327,104
discriminator/B3/down_conv2/bias:0 (384,) 384
discriminator/B3/down_conv_shortcut/kernel:0 (1, 1, 192, 384) 73,728
discriminator/B3/down_conv_shortcut/bias:0 (384,) 384
discriminator/B4/same_conv1/kernel:0 (3, 3, 384, 768) 2,654,208
discriminator/B4/same_conv1/bias:0 (768,) 768
discriminator/B4/down_conv2/kernel:0 (3, 3, 768, 768) 5,308,416
discriminator/B4/down_conv2/bias:0 (768,) 768
discriminator/B4/down_conv_shortcut/kernel:0 (1, 1, 384, 768) 294,912
discriminator/B4/down_conv_shortcut/bias:0 (768,) 768
discriminator/B5/same_conv1/kernel:0 (3, 3, 768, 1536) 10,616,832
discriminator/B5/same_conv1/bias:0 (1536,) 1,536
discriminator/B5/down_conv2/kernel:0 (3, 3, 1536, 1536) 21,233,664
discriminator/B5/down_conv2/bias:0 (1536,) 1,536
discriminator/B5/down_conv_shortcut/kernel:0 (1, 1, 768, 1536) 1,179,648
discriminator/B5/down_conv_shortcut/bias:0 (1536,) 1,536
discriminator/B6/same_conv1/kernel:0 (3, 3, 1536, 1536) 21,233,664
discriminator/B6/same_conv1/bias:0 (1536,) 1,536
discriminator/B6/same_conv2/kernel:0 (3, 3, 1536, 1536) 21,233,664
discriminator/B6/same_conv2/bias:0 (1536,) 1,536
discriminator/final_fc/kernel:0 (1536, 1) 1,536
discriminator/final_fc/bias:0 (1,) 1
discriminator_projection/kernel:0 (1000, 1536) 1,536,000
Table 10: Tensor-level description of the discriminator containing a total of 87,982,370 parameters.
Name Shape Size
generator/embed_y/kernel:0 (1000, 128) 128,000
generator/fc_noise/kernel:0 (20, 24576) 491,520
generator/fc_noise/bias:0 (24576,) 24,576
generator/B1/bn1/condition/gamma/kernel:0 (148, 1536) 227,328
generator/B1/bn1/condition/beta/kernel:0 (148, 1536) 227,328
generator/B1/up_conv1/kernel:0 (3, 3, 1536, 1536) 21,233,664
generator/B1/up_conv1/bias:0 (1536,) 1,536
generator/B1/bn2/condition/gamma/kernel:0 (148, 1536) 227,328
generator/B1/bn2/condition/beta/kernel:0 (148, 1536) 227,328
generator/B1/same_conv2/kernel:0 (3, 3, 1536, 1536) 21,233,664
generator/B1/same_conv2/bias:0 (1536,) 1,536
generator/B1/up_conv_shortcut/kernel:0 (1, 1, 1536, 1536) 2,359,296
generator/B1/up_conv_shortcut/bias:0 (1536,) 1,536
generator/B2/bn1/condition/gamma/kernel:0 (148, 1536) 227,328
generator/B2/bn1/condition/beta/kernel:0 (148, 1536) 227,328
generator/B2/up_conv1/kernel:0 (3, 3, 1536, 768) 10,616,832
generator/B2/up_conv1/bias:0 (768,) 768
generator/B2/bn2/condition/gamma/kernel:0 (148, 768) 113,664
generator/B2/bn2/condition/beta/kernel:0 (148, 768) 113,664
generator/B2/same_conv2/kernel:0 (3, 3, 768, 768) 5,308,416
generator/B2/same_conv2/bias:0 (768,) 768
generator/B2/up_conv_shortcut/kernel:0 (1, 1, 1536, 768) 1,179,648
generator/B2/up_conv_shortcut/bias:0 (768,) 768
generator/B3/bn1/condition/gamma/kernel:0 (148, 768) 113,664
generator/B3/bn1/condition/beta/kernel:0 (148, 768) 113,664
generator/B3/up_conv1/kernel:0 (3, 3, 768, 384) 2,654,208
generator/B3/up_conv1/bias:0 (384,) 384
generator/B3/bn2/condition/gamma/kernel:0 (148, 384) 56,832
generator/B3/bn2/condition/beta/kernel:0 (148, 384) 56,832
generator/B3/same_conv2/kernel:0 (3, 3, 384, 384) 1,327,104
generator/B3/same_conv2/bias:0 (384,) 384
generator/B3/up_conv_shortcut/kernel:0 (1, 1, 768, 384) 294,912
generator/B3/up_conv_shortcut/bias:0 (384,) 384
generator/B4/bn1/condition/gamma/kernel:0 (148, 384) 56,832
generator/B4/bn1/condition/beta/kernel:0 (148, 384) 56,832
generator/B4/up_conv1/kernel:0 (3, 3, 384, 192) 663,552
generator/B4/up_conv1/bias:0 (192,) 192
generator/B4/bn2/condition/gamma/kernel:0 (148, 192) 28,416
generator/B4/bn2/condition/beta/kernel:0 (148, 192) 28,416
generator/B4/same_conv2/kernel:0 (3, 3, 192, 192) 331,776
generator/B4/same_conv2/bias:0 (192,) 192
generator/B4/up_conv_shortcut/kernel:0 (1, 1, 384, 192) 73,728
generator/B4/up_conv_shortcut/bias:0 (192,) 192
generator/non_local_block/conv2d_theta/kernel:0 (1, 1, 192, 24) 4,608
generator/non_local_block/conv2d_phi/kernel:0 (1, 1, 192, 24) 4,608
generator/non_local_block/conv2d_g/kernel:0 (1, 1, 192, 96) 18,432
generator/non_local_block/sigma:0 () 1
generator/non_local_block/conv2d_attn_g/kernel:0 (1, 1, 96, 192) 18,432
generator/B5/bn1/condition/gamma/kernel:0 (148, 192) 28,416
generator/B5/bn1/condition/beta/kernel:0 (148, 192) 28,416
generator/B5/up_conv1/kernel:0 (3, 3, 192, 96) 165,888
generator/B5/up_conv1/bias:0 (96,) 96
generator/B5/bn2/condition/gamma/kernel:0 (148, 96) 14,208
generator/B5/bn2/condition/beta/kernel:0 (148, 96) 14,208
generator/B5/same_conv2/kernel:0 (3, 3, 96, 96) 82,944
generator/B5/same_conv2/bias:0 (96,) 96
generator/B5/up_conv_shortcut/kernel:0 (1, 1, 192, 96) 18,432
generator/B5/up_conv_shortcut/bias:0 (96,) 96
generator/final_norm/gamma:0 (96,) 96
generator/final_norm/beta:0 (96,) 96
generator/final_conv/kernel:0 (3, 3, 96, 3) 2,592
generator/final_conv/bias:0 (3,) 3
Table 11: Tensor-level description of the generator containing a total of 70,433,988 parameters.

Appendix C FID and IS training curves

Figure 21: Mean FID and IS (3 runs) on ImageNet () for the models considered in this paper, as a function of the number of generator steps. All models train stably, except Single label (where one run collapsed).

Appendix D FID and IS: Mean and standard deviations

FID IS
S2GAN 11.00.31 9.00.30 8.40.02 57.60.86 72.91.41 77.71.24
S2GAN-CO 21.60.64 17.60.27 13.80.48 29.80.21 37.10.54 50.11.45
S3GAN 10.30.16 8.10.14 7.80.20 59.90.74 78.31.08 82.11.89
S3GAN-CO 20.20.14 16.50.12 12.80.51 31.10.18 38.70.36 52.71.08
Table 12: Pre-trained vs co-training approaches, and the effect of self-supervision during GAN training. While co-training approaches outperform fully unsupervised approaches, they are clearly outperformed by the pre-trained approaches. Self-supervision during GAN training helps in all cases.
FID IS
S2GAN 11.00.31 9.00.30 8.40.02 57.60.86 72.91.41 77.71.24
S2GAN Soft 15.60.58 13.31.71 11.31.42 40.10.97 49.34.67 58.55.84
Table 13: Training with hard (predicted) labels leads to better models than training with soft (predicted) labels.
fid IS
Clustering 22.70.80 22.80.42
Clustering(SS) 21.90.08 23.60.19
Random label 27.21.46 20.20.33
Single label 71.766.32 15.47.57
Single label(SS) 23.60.14 22.20.10
Table 14: Mean FID and IS for the unsupervised approaches.