An Improved Self-supervised GAN via Adversarial Training

05/14/2019 ∙ by Ngoc-Trung Tran, et al. ∙ 0

We propose to improve unconditional Generative Adversarial Networks (GAN) by training the self-supervised learning with the adversarial process. In particular, we apply self-supervised learning via the geometric transformation on input images and assign the pseudo-labels to these transformed images. (i) In addition to the GAN task, which distinguishes data (real) versus generated (fake) samples, we train the discriminator to predict the correct pseudo-labels of real transformed samples (classification task). Importantly, we find out that simultaneously training the discriminator to classify the fake class from the pseudo-classes of real samples for the classification task will improve the discriminator and subsequently lead better guides to train generator. (ii) The generator is trained by attempting to confuse the discriminator for not only the GAN task but also the classification task. For the classification task, the generator tries to confuse the discriminator recognizing the transformation of its output as one of the real transformed classes. Especially, we exploit that when the generator creates samples that result in a similar loss (via cross-entropy) as that of the real ones, the training is more stable and the generator distribution tends to match better the data distribution. When integrating our techniques into a state-of-the-art Auto-Encoder (AE) based-GAN model, they help to significantly boost the model's performance and also establish new state-of-the-art Fréchet Inception Distance (FID) scores in the literature of unconditional GAN for CIFAR-10 and STL-10 datasets.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GAN) [13] have become the most popular approach to train the generative model. It gets much attention from the community because of its ability to generate visual appealing samples, but not require the explicit analytic form of objective functions. The idea behind GAN is to use a binary classifier, so-called the discriminator. Discriminator learns to distinguish the data (real) versus generated (fake) samples, and as a result, it represents this manifold via its scalar scores in the form of likelihood. Training generator of GAN is to maximize discriminator’s likelihood scores computed over fake samples. In other words, it confuses the discriminator to accept its outputs as the real ones. Training GAN is an the adversarial process, in which the discriminator and generator compete with each other to improve themselves. Although GAN is an attractive approach, using the real/fake label to train GAN is challenging because this supervisory signal is a weak constraint. Hence, the generator can easily cheat the discriminator by, eg., always creating the identical samples but recognized with high likelihood by the discriminator. It explains why GAN has many serious issues, such as the gradient vanishing and mode collapse [12, 1], which prevent the model to possibly cover all modes of the data distribution. Many variants of GAN have proposed new constraints to overcome this ill-pose problem.

In the literature, many constraints have been proposed for the discriminator. These constraints force discriminator’s gradients not to be vanishing so that the generator can use them to learn and improve itself. Intuitively, these constraints smoothen out the decision boundary of the discriminator between real and fake samples in order to avoid the sharp gradients along this region and enable distant samples to contribute more to generator training. One of the most noticeable regularization techniques are towards enforcing Lipschitz conditions [2, 14, 29, 17, 27, 19, 23]. However, these techniques have their own disadvantages, for example, the divergence issue [33] as the regularization becomes over-strength at the end. Overcoming this requires careful designs of training procedure [32, 15].

The alternative constraints, which are also commonly-used, are via auto-encoder. It reconstructs the real samples, hence guides the generator to produce samples resembling the real modes. It increases the chance to occur the competition between discriminator and generator on many modes of data distribution. Therefore, it potentially encourages the discriminator can create better gradients that lead to a better generator. However, the downside of auto-encoder is the blurry issue. Although some recent works [18, 31] overcame this problem by using the high-level features of discriminators, the texture and shape of objects in generated images does not look realistic.

A recent GAN [6] proposed new constraints via self-supervised learning strategy [11]. The authors argument new samples via image rotation and assign them with pseudo-labels. In addition to training discriminator to distinguish real and fake samples, they train discriminator to predict the correct labels of rotated images. They train the generator to minimize the classification loss as the discriminator recognizes the transformation of generator’s outputs. In other words, they train the generator to create images whose correct pseudo-labels of their transformed samples are easily recognized by the discriminator. Although its results are encouraging, the discriminator does not take into account the generated samples for classification task and it’s not precise how self-supervised tasks helped to improve GAN in this work. In fact, the proposed generator objective [6]

minimizes the cross-entropy loss, which does not necessarily help to create samples resembling real samples. For example, like original GAN, the generator may create collapsed samples, but recognized as real by the discriminator with high probability and its rotated samples are still classified correctly according to their pseudo-label ground-truth.

In this work, we propose an improved self-supervised GAN, which introduces the adversarial way of using self-supervised learning. In particular, we first propose to train discriminator to classify correct pseudo-labels of real transformed samples (obtained from data samples via geometric transformation) as the classification task. This classification task improves GAN model when being combined with the original GAN task [13] that learns to distinguish data (real) versus generated (fake) samples. Then, we propose two further improvements: (i) We propose to train the discriminator to simultaneously classify the class of generated samples from pseudo-classes of real samples. We consider it as the adversarial training for the discriminator. This adversarial training significantly improves the discriminator, hence improves the generator and the model performance. (ii) In addition to confusing the GAN task, we propose a new generator objective to fool the classification task of the discriminator by creating samples that the discriminator recognizes their transformed ones as real pseudo-classes. Importantly, instead of minimizing the cross-entropy of transformed fake samples like the previous work, we do match the cross-entropy loss computed over fake transformed samples to that of the real transformed ones. We exploit that it stabilizes the training, boosts the significantly performance as being combined with adversarial training of discriminator. We investigate our proposed techniques with the state-of-the-art AE-based GAN model [31]. Although [31] demonstrated that the combination of auto-encoder and gradient penalty constraints combined together improve the training of GAN and achieve state-of-the-art performance, integrating our techniques can further boost the performance of this baseline model. We see that benefiting all kind of constraints in a good way will stabilize GAN, and establish a new state-of-the-art performance on CIFAR-10 and STL-10 datasets.

2 Related Work

While training GAN with conditional signals (e.g., class labels) [25, 33, 4] are attaining promising results, training GAN in the unconditional setting is still challenging. In the original GAN [13], the single signal (real or fake) of samples are provided to train discriminator and use the discriminator to guide the generator. With these signals, the generator or discriminator may fall into ill-pose settings, where easily being stuck at bad local minimums though still satisfying the signal constraints. Therefore, many regularizations have been proposed to reduce this problem, and the most popular technique is to enforce (or towards) Lipschitz condition of the discriminator by weight-clipping [1], gradient penalty constraints [14, 29, 17, 27, 19], consensus constraint, [22, 21], or spectral norm [23]. Constraining the discriminator in such ways to prevent its gradients vanishing, and avoid the sharp boundary decision between real and fake classes. Otherwise, because the data points are very sparse in a high-dimensional manifold, without strong constraints, the discriminator is able to always find the perfect decision boundary between real and generated data points as it is powerful enough. It is likely the main reason causing the gradient vanishing issues of GAN.

Although regularizations improve the stability of GAN, using a single supervisory signal like original GAN [13] still leads to challenging optimization problems. It is because that discriminator scores are highly dependent on generated samples. Therefore, if the generator is collapsed to some particular modes of data distribution, it is only able to create samples around these modes. Subsequently, there is no competition to train the discriminator around other modes. As a result, the gradients of these modes may be vanishing, and it is impossible to guide the generator to model the entire data distribution. Using more supervisory signals simplifies the optimization process. For example, using self-supervised learning in the form of auto-encoder. AAE [20] guides the generator towards creating more realistic samples. It is a potential solution to partly prevents the generator from generating identical samples. It steers the generated samples towards real samples to reduce the disjoint issue between two distributions, therefore, less be over-fitting and gradient vanishing. However, the problem of using auto-encoder is that pixel-wise reconstruction with -norm would cause the blurry issue. VAE/GAN [18], which combined VAE [16] and GAN, suggest a better solution: while the discriminator of GAN enables the usage of feature-wise reconstruction to overcome the blur, the VAE constrains the generator better to reduce the mode collapse. ALI [10] and BiGAN [9] jointly train the data/latent samples in GAN framework like to put more constraints on the discriminator and the generator. InfoGAN [7] infer the disentangled representation of latent code by maximizing the mutual information. In addition to using feature-wise, [31, 30] combine the two different types of supervisory signals: real/fake signals and self-supervised signal in the form of auto-encoder, which lead to stable convergence and better-generated images and prevent the model from the mode collapse. Although feature-wise distance for auto-encoder is often good to reconstruct the sharper images, its reconstructed images still cannot produce realistic detail of textures or shapes.

Recently, self-supervised learning is getting much attention from the community as it helps to close the gap between supervised and unsupervised models in classification tasks [8, 26, 34, 35, 24, 11]. This technique encourages the classifiers to learn better feature representation with pseudo-labels, which has been also applied for GAN [6]. However, the usage of the self-supervised task in this work is simply following the idea of [11]. It’s unclear how the classification tasks help the model. Moreover, although the usage of self-supervised learning to train discriminator is simple, making use of self-supervised learning effective for the generator is not trivial.

3 Proposed Method

Figure 1: The diagram of our model. , , are the encoder, the generator, and the discriminator. The parameters of are shared for the generator and decoder of auto-encoder. Two discriminators () are shared parameters excepts two different heads: one dimension of the real/fake classes and dimensions of pseudo-classes of geometric transformation. The real image () is encoded and decoded into the reconstruction (). Here, we show the reconstruction for clarification, in our implementation we use the features of discriminator . The constraint is to regularize the reconstruction like [31]. The construction is considered as the “real” sample when optimizing the discriminator with the objective . The input is transformed into new samples with their pseudo-labels, and the discriminator is trained to recognize the correct labels, and also to classify the fake samples from the K real classes.

In our work, we adopt an auto-encoder based method, Dist-GAN [31], to be our baseline model because it has already demonstrated the combination of gradient penalty and auto-encoder constraints achieves the state-of-the art-results of GAN. We discuss adversarial self-supervised learning (in short of training self-supervised learning with the adversarial process) for the discriminator and the generator and how to integrate them into the baseline model. Our model consists of three main components: we use the regularized auto-encoder (consisting of the encoder (E) and decoder (G)) like [31], and we propose new objectives of the discriminator (D) and the generator (G) to improve the model. The decoder and the generator share all parameters. In our model, we first train the auto-encoder, after that we train the discriminator to distinguish real and fake samples (GAN task) and also learn to predict correct augmented labels (classification task) and finally we train the generator to match real and fake scores in combination with matching the cross-entropy losses computed over transformations of these samples. Our components and the training algorithm are represented in Fig. 1 and Alg. 1. To highlight our main contributions, we will first discuss our proposed discriminator and generator objectives and then remind the regularized auto-encoder.

1:  Initializing parameters of discriminator, encoder and generator respectively. is the number of iterations.
2:  repeat
3:       Randomizing mini-batch of samples from dataset.
4:       Argument-ing samples by image transformation task .
5:       Randomizing samples from noise distribution
6:      // Training the auto-encoder using and according to Eqn. 7
8:      // Training discriminator/classifier according to Eqn. 1 on
10:      // Training the generator on according to Eqn. 4.
12:  until 
13:  return  
Algorithm 1 Our training algorithm

3.1 Discriminator Objective

Our discriminator objective (Eq. 1) consists of two parts: (i) The GAN objective to train discriminator to distinguish between real/fake samples . (ii) The classification objective to train the classifier to predict the correct labels of the argument-ed samples via image transformations, . The discriminator and classifier are the same (shared parameters), excepts two different heads: the last fully-connected layer which returns dimension (real or fake) for the discriminator and the other returns dimensions of pseudo-classes for the classifier respectively. is the constant selected through empirical experiments.


3.1.1 GAN-based Objective

The discriminator part for GAN is written in Eq. 2. It’s different from GAN objective [13] that our model considers the reconstructed samples as “real” represented by the term , so that the gradients from discriminator are not saturated too quickly. This constraint slows down the convergence of discriminator and couples the convergence between discriminator and auto-encoder. It’s likely another regularization technique, which has the similar goal as [3], [23] and [31, 30]. In our method, we use a small weight for with for the discriminator objective. We observe that is important at the beginning of training. However, towards the end, especially for complex image datasets, the reconstructed samples are less useful as it may result in lower quality than the real samples. We also observe that after training iterations, most of models does not much significantly improve the quality of images when continue the training with the same . From this point, we start to decay the value of according to the iterations , where start to be counted from . Here, is the expectation, and may be written as and respectively for short. and are data distribution and prior noise distribution. is a constant, and , is a uniform random number . enforces sufficient gradients from the discriminator to train the generator. For hinge loss, replacing by in Eq. 2.


3.1.2 Classification-based Objective

The second part of the discriminator objective is for the classification task. We apply the self-supervised learning techniques to argument samples with geometric transformations and train the discriminator to predict correct pseudo-labels of these samples. In particular, we apply geometric transformations on original input to create new samples , and assign the transformed with pseudo-labels . We consider these argument-ed data samples are real transformation classes (from -st to -th classes), and simultaneously the generated samples are the fake transformation class (-th class). In order to train D as the multi-class classifier, we add another head into in addition to the conventional real/fake output. It is a fully-connected (FC) layer with soft-max outputs. Therefore, the discriminator can be also called as a classifier in this case. The goal in this section is to train the classifier to predict the geometric transformation applied to the image. We train the classifier to distinguish the real classes and fake class by minimizing the objective of Eq. 3. Note that we do not rotate the generated samples when training the discriminator, because enforcing the discriminator to recognize the correct classes of transformed fake samples makes the discriminator getting worse. It’s due to that the generated samples themselves can be very noisy, especially at the beginning of the training. In addition, it seems to have some overlapping between GAN task and classification task because they both learn to classify the fake samples, however, it is important to have both tasks because each task may have its responsibility. GAN task is to distinguish between real and fake samples to approximate the distribution and classification task is to learn the useful feature representation to improve the first one. Indeed, If one of them is removed, the performance gets significantly worse. It’s also worth noting that [6] only proposes the first term of our objective (Eq. 3) and does not get benefits of generated samples in the training.


Here, is the soft-max predicted probability of -th class on data sample which is transformed by . Training the classifier to predict the pseudo-labels of real transformation classes encourages the discriminator to learn the useful feature representation of images and therefore leads to a better decision as distinguishing the real and fake samples. In addition, we train the classifier simultaneously distinguish with the fake transformation class, which is a type of adversarial training like original GAN [13]. The classifier learns to recognize the fake samples from the pseudo-classes of real ones is probably to create better gradients to guide the generator. Here, it’s an adversarial training because there is a competition between discriminator and generator for the classification task. It’s an important finding of our work, which is helpful to further improve the baseline model. It’s worth noting that when we discuss adversarial training in our work, we would mean for self-supervised learning (classification task). The adversarial training for GAN task is a default. In our model, the well-trained discriminator/classifier also produces good feature-wise distance for the reconstruction task (Section 3.3) to train better auto-encoder for our model because we’re using discriminator features to form the reconstruction objective. It was shown in previous experiments [31] that on synthetic data as the reconstruction is nearly perfect, this auto-encoder based model can approximate well the data distribution. We constrain the discriminator by the reconstruction; therefore, if the higher-quality reconstruction leads to better quality and convergence of discriminator and hence generated samples are more realistic.

3.2 Generator Objective

A recent work [6] proposed a way to integrate the self-supervised technique into GAN via image rotations [11]. However, it is unclear how much the discriminator and generator contribute to these improvements. Not mentioning that this technique is not always applicable to other GAN methods. For example, using this self-supervised technique [6] to our generator causes our model diverged and reduces the quality of generated images (Section 4.1).


In this work, we propose a new generator objective (Eq. 4) including two terms. The first term is the GAN task , which is motivated from [31] as shown in (Eq. 5). The intuition of this term is that the discriminator can model the data manifold by its scalar values. To approximate the data distribution in general, we match the two manifolds together. However, it’s challenging due to high-dimensions. Therefore, we indirectly align the distribution of real discriminator scores to the distributions of generated discriminator scores.


The second term is the classification task, . In [6], the generator aims to create samples that the discriminator can easily predict its pseudo-labels for the transformed sample . In contrast, our term is to match the self-supervised tasks to train the generator. Our intuition is that if generator distribution is similar to the real distribution, the classification performance on its transformed samples should be similar to that of those transformed from real samples. In other words, if real and fake samples are from similar distributions, the same tasks applied for real and fake samples should have resulted in similar behaviors. In particular, given the cross-entropy loss computed on real samples, we train the generator to create samples that are able to match this loss. We form the cross-entropy loss of multi-class classification as shown in Eq. 6. Here, we train the generator to confuse the classifier to recognize fake transformed samples as the same performance as it recognizes transformed classes obtained from the real ones. When the classifier learns to distinguish the real versus fake transformation classes, it learns to create good gradients and the generator gets benefits of these gradients to learn and confuse the classifiers. This adversarial process is similar to original GAN [13], but now applied for multi-classes. Here, is a constant selected through empirical experiments, and we use -norm for both the GAN task and the classification task. In our implementation, we randomly select a geometric transformation for each data sample when training the discriminator. And the same are applied for generated samples when matching the self-supervised tasks to train the generator.


3.3 Regularized Auto-encoder

We use the regularized auto-encoder (AE) in our model to prevent the generator from being severely collapsed and guide the generator in producing samples that resemble real ones as shown in recent works [31, 30]. We propose to use the similar auto-encoder objective function [31]:


Eq. 7 is the objective of our regularized AE. The first term is reconstruction error in conventional AE. The second term is the distance constraint, similar to [31], to regularize the mapping from latent to data samples. Here, is GAN generator (decoder in AE), is the encoder and the constant as suggested by the original work. is the features of the sample computed through the last convolution layer of the discriminator , is the dimension of latent samples . Here, we re-use parameters of auto-encoder from the original model and focus the analysis on our main contributions as discussed in previous sections (3.1, 3.2).

4 Experimental Results

We conduct experiments to investigate the effectiveness of our proposed adversarial self-supervised learning on CIFAR-10 and STL-10 datasets. Images of STL-10 are resized into like [23]. We use DCGAN [28] architecture with standard “log” loss, and SN-GAN [23] and ResNet [14] architectures with “hinge” loss. We use “hinge” loss for SN-GAN and ResNet because it attains better performance than standard “log” loss as shown in [23]. We remind these networks in the supplementary material. In our model, the encoder network is the mirror of the generator network. We measure the diversity and quality of generated samples via the Fréchet Inception Distance (FID) [15]. FID is computed with 10K real samples and 5K generated samples like SN-GAN [23] if not precisely mentioned. FID is computed every 10K iterations in training and visualized with the smoothening windows of 5. We train our method with 300K iterations, and report the FID of the last iteration excepts the standard SN-GAN for CIFAR-10 where we report it at about 120K because continuing the training does not improve the FID. We conduct the ablation studies and fine-tuning parameters on DCGAN, SN-GAN and ResNet architectures, and will use their best settings to compare to the state-of-the-art methods. Dist-GAN [31] is our main baseline in ablation studies. We train models using Adam optimizer with learning rate , , for DCGAN and SN-GAN architectures and , for ResNet architecture [14]. We set , latent dimension is and mini-batch size is 64 for our all experiments.

4.1 Ablation Study

Figure 2: The ablation study with DCGAN architecture on CIFAR-10 dataset. (a) Fine-tuning of discriminator (without adversarial training). In this experiment, we set . (b) Fine-tuning of generator with . We also train generator with a similar objective of SSGAN [6] with and without adversarial training for the discriminator. (c) Our experiment on the combination of GAN and classification tasks. When we remove the GAN task in our model and rotate the fake samples when training the discriminator, they get collapsed or decreased the quality of generated samples. We set , for this experiment.

At first, we aim to seek good and

of our proposed method. However, estimating both at the same time is expensive. Therefore, we propose to first seek the good

of the classification task for the discriminator (Eq. 1). We train the classification task of the discriminator with only the real transformed samples like [6]. We follow the geometric transformation of [11], which is simple but effective and achieved the best performance in self-supervised tasks, to argument images and their pseudo labels. In particular, we train discriminator to recognize the 2D rotations which are applied to the input image. We rotate the input image with rotations () and assign them the pseudo-labels from 1 to . In this experiment, we set . This ablation study is with DCGAN on CIFAR-10. Fig. 2a shows that training discriminator with the self-supervised learning task stabilizes the baseline and makes the model converging faster. This technique helps to improve FID score from our baseline. This study suggests the good for DCGAN architecture.

The second study seeks a good of the self-supervised task for the generator (Eq. 4). Experimental conditions are exactly the same as the first one, excepts we fix the best from the previous experiment. We consider the version with the best from the previous study as the self-supervised baseline (SS). First, we investigate the influence of our self-supervised task for the generator on the overall performance. For that, we train the classification task of the discriminator without adversarial training (no fake class) as the previous experiment and train the generator for two cases: the similar objective of SSGAN [5] and our proposed objective in Eq. 6. We carry out the investigation with DCGAN architecture on CIFAR-10 dataset as shown in Fig. 2b. The results show that training our generator with a similar objective of [5] causes the divergence issue and this generator objective does not help to improve the performance. However, when we use our proposed generator objective (Eq. 4), the performance () is better than the self-supervised baseline. This confirms the usefulness of our proposed generator objective.

Figure 3: The ablation study with SN-GAN and ResNet architectures on CIFAR-10 and STL-10 datasets. First row is fine-tuning and second row is fine-tuning . From first to fourth columns: SN-GAN on CIFAR-10, ResNet on CIFAR-10, SN-GAN on STL-10 and ResNet on STL-10. The results suggest for SN-GAN and for ResNet. SS is the self-supervised baseline (with best ). Adversarial training is applied for the discriminator when .
Method CIFAR-10 STL-10 CIFAR-10 (R) STL-10 (R) CIFAR-10 (R) (10K-10K)
GAN-GP [23] 37.7 - - - -
WGAN-GP [23] 40.2 55.1 - - -
SN-GAN [23] 25.5 43.2 21.70 .21 40.10 .50 19.73
SS-GAN [6] - - - - 15.65
Dist-GAN [31] 22.95 36.19 17.61 .30 28.50 .49 13.01
GN-GAN [30] 21.70 30.80 16.47 .28 - -
Ours (SS) 21.40 29.79 14.97 .29 27.98 .38 12.37
Ours (SS + adversarial, G) 19.05 28.70 14.75 .28 28.24 .23 12.15
Table 1: Comparing our best FID scores to the state of the art (Smaller is better). Methods with the SN-GAN [23] and ResNet (R) [14, 23] architectures. FID scores of SN-GAN, Dist-GAN and our method reported with hinge loss. Performance of compared methods are from [23, 30]. We also compare to SS-GAN [6] with the same 10K-10K FID scores for CIFAR-10 dataset with ResNet (R). (SS + adversarial, G) is with adversarial training and the new generator objective.

Third, we want to understand the influence of adversarial training (with a fake class) for the classification task given best , from previous studies. This experiment is also with DCGAN architecture on CIFAR-10 dataset (Fig. 2b). We now train the discriminator with adversarial learning (simultaneously distinguishing fake class in the classification task). Note in our experiments, when we mention about adversarial training, we mean using it for the classification task. We first carry out experiments with our proposed generator objective. When considering the adversarial training (with the fake class) for the discriminator, our method improves FID significantly as comparing the non-adversarial version. We also try with the generator objective of [5]. Although FID is slightly improved from the self-supervised baseline, using this generator objective still gets collapsed. In contrast, our proposed objective is stable and achieve the best FID than other versions (Fig. 2b). This confirms the importance of the combination of our adversarial self-supervised learning and our proposed generator objective. The reason of training generator objective like [5] leads to the corruption is perhaps because of maximizing it violates the GAN task of our generator, which does not support the match of D(x) and D(G(z)) of the first term. This violation is similar to the gradient penalty [14] although it may be useful at the beginning but diverge at the end. Intuitively, our new objective (Eq. 6) does not violate because when data and generator distributions are matched, their classification should be similar either. This study again verifies the hypothesis of our proposed techniques.

Fourth, in previous experiments, we figured out how the classification task helps to improve the GAN task. Although seemly there is overlapping between GAN task and classification task as they both classify the same fake sample, having both tasks in the model is important. For instance, if removing the GAN task in our model (for both discriminator and generator), the model gets immediately collapsed at first iterations as shown in Fig. 2c. It means that the GAN task still plays an important role in our GAN model. We also consider the adversarial training of discriminator objective like (Eq. 3) but we now rotate the fake samples and consider these rotated samples belonging to the fake class. The result in Fig. 2c does not suggest to rotate the fake samples when training the discriminator, because it likely creates noise and degrades its learning. We conduct this study with DCGAN on CIFAR-10 in the similar experimental setup previous studies, and and if the classification task is used.

Fifth, we also investigate proposed techniques for other network architectures CIFAR-10 and STL-10 datasets. At first, we repeat the first experiment for SN-GAN and ResNet architectures to select their best as shown in the first row of Fig. 3. The results suggest for SN-GAN and for Resnet. We realize that when the network is powerful (eg. ResNet), the best gets smaller. Perhaps, the more powerful network has better capability to learn good feature representation via the GAN task. In contrast, the smaller networks (DCGAN, SN-GAN) are harder to train, therefore needs more contribution from the classification task. Then, we study the good for these architectures as shown in the second row of Fig. 3. Here, we seek for our generator objective in the case of that the classifier is trained with the fake class (adversarial training) similar to our third study with DCGAN architecture. The generator objective helps to boost significantly the performance (if the choice of is good), especially for SN-GAN on CIFAR-10. Our proposed techniques also reduce the divergence issue as shown in the first column of Fig. 3. Although the baseline with ResNet achieves almost saturated performance, our techniques are still able to improve this model further. It’s worth noting that the FID of our self-supervised baseline (SS) already reaches the similar performance of SAGAN [33] - the state-of-the-art conditional GAN (see the discussion in the supplementary material) - it’s hard to make the improvement higher even though being combined with the adversarial training and our proposed generator objective. This study again confirms the effectiveness and robustness of our proposed techniques on various architecture. We observe that are good choices for DCGAN and ResNet on CIFAR-10, and is good for SN-GAN on CIFAR-10 and STL-10. With the combination of adversarial self-supervised learning for discriminator and our proposed generator objective, our best versions significantly outperform the baseline for various network architectures and datasets.

4.2 Compared to state-of-the-art methods

In this section, we compare the best settings of our proposed method to the state of the art on benchmark datasets: CIFAR-10 and STL-10 as shown in Table 1. We compare results obtained with SN-GAN [23] and ResNet [14, 23] architectures. As shown in Table 1, our method significantly outperforms the baseline Dist-GAN and other GAN methods, especially on the STL-10 dataset. This confirms the effectiveness of the combination GAN task and classification task into a unique model. It’s worth-noting that SN-GAN attains best results at about 100K iterations, yet this model diverges if continue the training. The similar observation is also discussed in [6]. We also compare to the recent work, SSGAN [6], which also integrates the self-supervised technique to improve GAN model. For this case, to be a fair comparison, we compute the similar FID with 10K real samples and 10K fake samples like this work. Our model achieves much better FID score than SSGAN with same ResNet architecture on CIFAR-10 dataset. Fig. 4 show some generated examples of our model of ResNet architectures on CIFAR-10 and STL-10 datasets.

Figure 4: Samples are generated by our method on CIFAR-10 and STL-10 datasets with ResNet architectures.

5 Conclusion

We propose to train the model with adversarial self-supervised learning. First, we show that training self-supervised learning helps to improve the discriminator (self-supervised baseline) and hence enhance the quality of generated images. Then we propose to train the discriminator with adversarial and a new generator objective via matching the cross-entropy loss between real and fake samples. The combination of adversarial training (discriminator) and cross-entropy matching (for generator) further boosts the performance of self-supervised baseline over with various network architectures on CIFAR-10 and STL-10 datasets. The best version of our proposed method significantly outperformed the baseline and established the new state-of-the-art FID scores over these benchmark datasets. Although investigating our proposed techniques mainly within an auto-encoder GAN model, we believe that our proposed techniques are orthogonal and potential to be used to improve other GAN methods.