1 Introduction
Generative Adversarial Networks (GAN). GAN goodfellownisp2014 have become one of the most important methods to learn generative models. GAN has shown remarkable results in various tasks, such as: image generation karrasiclr2018; brockiclr2018; karrascvpr2019, image transformation isolacvpr2017; zhucvpr2017
ledigcvpr2017, text to image reedarxiv2016; zhang2cvpr2017schleglipmi2017; limicdm2018. The idea behind GAN is the minimax game. It uses a binary classifier, socalled the discriminator, to distinguish the data (real) versus generated (fake) samples. The generator of GAN is trained to confuse the discriminator to classify the generated samples as the real ones. By having the generator and discriminator competing with each other in this adversarial process, they are able to improve themselves. The end goal is to have the generator capturing the data distribution. Although considerable improvement has been made for GAN under the conditional settings
odenaicml2017; zhangarxiv2018; brockiclr2018, i.e., using groundtruth labels to support the learning, it is still very challenging with unconditional setup. Fundamentally, using only a single signal (real/fake) to guide the generator to learn the highdimensional, complex data distribution is very challenging goodfellownips2016; arjovskyarxiv2017a; chearxiv2016; chenarxiv2016; metzarxiv2016; salimansnisp2016.Selfsupervised Learning.
Selfsupervised learning is an active research area
doerschcvpr2015; pathakcvpr2016; zhangeccv2016; zhang1cvpr2017; norooziiccv2017; gidarisiclr2018. Selfsupervised learning is a paradigm of unsupervised learning. Selfsupervised methods encourage the classifier to learn better feature representation with
pseudolabels. In particular, these methods propose to learn image feature by training the model to recognize some geometric transformation that is applied to the image which the model receives as the input. A simpleyetpowerful method proposed in gidarisiclr2018 is to use image rotations by 0, 90, 180, 270 degrees as the geometric transformation. The model is trained with the 4way classification task of recognizing one of the four rotations. This task is referred as the selfsupervised task. This simple method is able to close the gap between supervised and unsupervised image classification gidarisiclr2018.Selfsupervised Learning for GAN. Recently, selfsupervised learning has been applied to GAN training chenarxiv2018; tranarxiv2019. These works propose auxiliary selfsupervised classification tasks to assist the main GAN task (Figure 1). In particular, their objective functions for learning discriminator and generator are multitask loss as shown in (1) and (2) respectively:
(1) 
(2) 
(3) 
Here, in (3) is the GAN task, which is the original value function proposed in Goodfellow et al. goodfellownisp2014. is true data distribution, is the distribution induced by the generator mapping. and are the selfsupervised (SS) tasks for discriminator and generator learning, respectively (details to be discussed). is the classifier for the selfsupervised task, e.g. rotation classifier as discussed gidarisiclr2018. Based on this framework, Chen et al.chenarxiv2018 apply selfsupervised task to help discriminator counter catastrophic forgetting. Empirically, they have shown that selfsupervised task enables discriminator to learn more stable and improved representation. Tran et al. tranarxiv2019 propose to improve selfsupervised learning with adversarial training.
Despite the encouraging empirical results, indepth analysis of the interaction between SS tasks ( and ) and GAN task () has not been done before. On one hand, the application of SS task for discriminator learning is reasonable: the goal of discriminator is to classify real/fake image; an additional SS classification task could assist feature learning and enhance the GAN task. On the other hand, the motivation and design of SS task for generator learning is rather subtle: the goal of generator learning is to capture the data distribution in , and it is unclear exactly how an additional SS classification task could help.
In this work, we conduct indepth empirical and theoretical analysis to understand the interaction between selfsupervised tasks ( and ) and learning of generator . Interestingly, from our analysis, we reveal issues of existing works. Specifically, the SS tasks of existing works have “loophole” that, during generator learning, could exploit to maximize without truly learning the data distribution. We show that analytically and empirically that a severely modecollapsed generator can excel . To address this issue, we propose new SS tasks based on a multiclass minimax game. Our proposed new SS tasks of discriminator and generator compete with each other to reach the equilibrium point. Through this competition, our proposed SS tasks are able to support the GAN task better. Specifically, our analysis shows that our proposed SS tasks enhance matching between and by leveraging the transformed samples used in the SS classification (rotated images when gidarisiclr2018 is applied). In addition, our design couples GAN task and SS task. To validate our design, we provide theoretical analysis on the convergence property of our proposed SS tasks. Training a GAN with our proposed selfsupervised tasks based on multiclass minimax game significantly improves baseline models. Overall, our system establishes stateoftheart Fréchet Inception Distance (FID) scores. In summary, our contributions are:

We conduct indepth empirical and theoretical analysis to understand the issues of selfsupervised tasks in existing works.

Based on the analysis, we propose new selfsupervised tasks based on a multiclass minimax game.

We conduct extensive experiments to validate our proposed selfsupervised tasks.
2 Related works
While training GAN with conditional signals (e.g., groundtruth labels of classes) has made good progress odenaicml2017; zhangarxiv2018; brockiclr2018, training GAN in the unconditional setting is still very challenging. In the original GAN goodfellownisp2014, the single signal (real or fake) of samples is provided to train discriminator and the generator. With these signals, the generator or discriminator may fall into illpose settings, and they may get stuck at bad local minimums though still satisfying the signal constraints. To overcome the problems, many regularizations have been proposed. One of the most popular approaches is to enforce (towards) Lipschitz condition of the discriminator. These methods include weightclipping arjovskyarxiv2017a, gradient penalty constraints gulrajaniarxiv2017; rothnips2017; kodaliarxiv2017; petzkaarxiv2017; liuarxiv2018 and spectral norm miyatoiclr2018. Constraining the discriminator mitigates gradients vanishing and avoids sharp decision boundary between the real and fake classes.
Using Lipschitz constraints improve the stability of GAN. However, the challenging optimization problem still remains when using a single supervisory signal, similar to the original GAN goodfellownisp2014. In particular, the learning of discriminator is highly dependent on generated samples. If the generator collapses to some particular modes of data distribution, it is only able to create samples around these modes. There is no competition to train the discriminator around other modes. As a result, the gradients of these modes may vanish, and it is impossible for the generator to model well the entire data distribution. Using additional supervisory signals helps the optimization process. For example, using selfsupervised learning in the form of autoencoder has been proposed. AAE makhzaniarxiv2015 guides the generator towards resembling realistic samples. However, an issue with using autoencoder is that pixelwise reconstruction with norm causes blurry artifacts. VAE/GAN larsenarxiv2015, which combining VAE kingmaarxiv2013 and GAN, is an improved solution: while the discriminator of GAN enables the usage of featurewise reconstruction to overcome the blur, the VAE constrains the generator to mitigate mode collapse. In ALI dumoulinarxiv2016 and BiGAN donahuearxiv2016, they jointly train the data/latent samples in the GAN framework. InfoGAN chenarxiv2016 infers the disentangled latent representation by maximizing the mutual information. In traneccv2018; tranaaai2018, they combine two different types of supervisory signals: real/fake signals and selfsupervised signal in the form of autoencoder. In addition, Autoencoder based methods, including larsenarxiv2015; traneccv2018; tranaaai2018, can be considered as an approach to mitigate catastrophic forgetting because they regularize the generator to resemble the real ones. It is similar to EWC kirkpatrick2017nas or IS zenkearxiv2017 but the regularization is achieved via the output, not the parameter itself. Although using featurewise distance in autoencoder could reconstruct sharper images, it is still challenging to produce very realistic detail of textures or shapes.
Several different types of supervisory signal have been proposed. Instead of using only one discriminator or generator, they propose ensemble models, such as multiple discriminators tunips2017, mixture of generators hoangarxiv2018; ghoshcvpr2018 or applying an attacker as a new player for GAN training liucvpr2019. Recently, training model with auxiliary selfsupervised constraints chenarxiv2018; tranarxiv2019 via multi pseudoclasses gidarisiclr2018 helps improve stability of the optimization process. This approach is appealing: it is simple to implement and does not require more parameters in the networks (except a small head for the classifier).
3 GAN with Auxiliary SelfSupervised tasks
In chenarxiv2018, selfsupervised (SS) value function (also referred as “selfsupervised task”) was proposed for GAN goodfellownisp2014 via image rotation prediction gidarisiclr2018. In their work, they showed that the SS task was useful to mitigate catastrophic forgetting problem of GAN discriminator. The objectives of the discriminator and generator in chenarxiv2018 are shown in Eq. 4 and 5. Essentially, the SS task of the discriminator (denoted by ) is to train the classifier that maximizes the performance of predicting the rotation applied to the real samples. Given this classifier , the SS task of the generator (denoted by ) is to train the generator to produce fake samples for maximizing classification performance. The discriminator and classifier are the same (shared parameters), except the last layer in order to implement two different heads: the last fullyconnected layer which returns a onedimensional output (real or fake) for the discriminator, and the other which returns a dimensional softmax of pseudoclasses for the classifier. and are constants.
(4) 
(5) 
Here, the GAN value function (also referred as “GAN task”) can be the original minimax GAN objective goodfellownisp2014 or other improved versions. is the set of transformation, is the th transformation. The rotation SS task proposed in gidarisiclr2018 is applied, and are the 0, 90, 180, 270 degree image rotation, respectively. are the distributions of real and fake data samples, respectively. are the mixture distribution of rotated real and fake data samples (by ), respectively. Let be the th softmax output of classifier , and we have . The models are shown in Fig. 1a. In chenarxiv2018, empirical evidence of improvements has been provided.
Note that, the goal of is to encourage the generator to produce realistic images. It is because classifier is trained with real images and captures features that allow detection of rotation. However, the interaction of with the GAN task has not been adequately analyzed.
4 Analysis on Auxiliary Selfsupervised Tasks
We analyze the SS tasks in chenarxiv2018 (Figure 1a). We assume that all networks have enough capacity goodfellownisp2014. Refer to the Appendix A for full derivation. Let and be the optimal discriminator and optimal classifier respectively at an equilibrium point. We assume that we have an optimal of the GAN task. We focus on of SS task. Let
be the probability of sample
under transformation by (Figure 2). denotes the probability of data sample () or generated sample () respectively.Proposition 1
The optimal classifier of Eq. 4 is:
(6) 
Proof. Refer to our proof in Appendix A for optimal .
Theorem 1
Given optimal classifier for SS task , at the equilibrium point, maximizing SS task of Eq. 5 is equal to maximizing:
(7) 
Proof. Refer to our proof in Appendix A.
Theorem 1 depicts learning of generator given the optimal : selecting (hence ) to maximize . As is trained on real data, encourages to learn to generate realistic samples. However, we argue that can maximize without actually learning data distribution . In particular, it is sufficient for to maximize by simply learning to produce images which rotated version is rare (near zero probability). Some example images are shown in Figure 3a. Intuitively, for these images, rotation can be easily recognized.
The argument can be developed from Theorem 1. From (7), it can be shown that ( and ). One way for to achieve the maximum is to generate such that and . For these , the maximum is attained. Note that corresponds to 0 degree rotation, i.e., no rotation. Recall that is the probability distribution of transformed data by . Therefore the condition and means that there is no other rotated image resembling , or equivalently, rotated does not resemble any other images (Figure 2). Therefore, the generator can exploit this “loophole” to maximize without actually learning the data distribution. In particular, even a modecollapsed generator can achieve the maximum of by generating such images.
Empirical evidence. Empirically, our experiments (in Appendix B.2.1) show that the FID of the models when using is poor except for very small . We further illustrate this issue by a toy empirical example using CIFAR10. We augment the training images with transformation data to train the classifier to predict the rotation applied to . This is the SS task of discriminator in Figure 1a. Given this classifier , we simulate the SS task of generator learning as follows. To simulate the output of a good generator which generates diverse realistic samples, we choose the full test set of CIFAR10 (10 classes) images and compute the crossentropy loss, i.e. , when they are fed into . To simulate the output of a modecollapsed generator , we select samples from one class, e.g. “horse”, and compute the crossentropy loss when they are fed into . Fig. 3b show that some can outperform and achieve a smaller . E.g. a that produces only “horse” samples outperform under . This example illustrates that, while may help the generator to create more realistic samples, it does not help the generator to prevent mode collapse. In fact, as part of the multitask loss (see (5)), would undermine the learning of synthesizing diverse samples in the GAN task .
5 Proposed method
5.1 Auxiliary SelfSupervised Tasks with Multiclass Minimax Game
In this section, we propose improved SS tasks to address the issue (Fig. 1b). Based on a multiclass minimax game, our classifier learns to distinguish the rotated samples from real data versus those from generated data. Our proposed SS tasks are and in (8) and (9) respectively. Our discriminator objective is:
(8) 
Eq. 8 means that we simultaneously distinguish generated samples, as the th class, from the rotated real sample classes. Here, is the th output for the fake class of classifier .
While rotated real samples are fixed samples that help prevent the classifier (discriminator) from forgetting, the class serves as the connecting point between generator and classifier, and the generator can directly challenge the classifier. Our technique resembles the original GAN by Goodfellow et al. goodfellownisp2014, but we generalize it for multiclass minimax game. Our generator objective is:
(9) 
and form a multiclass minimax game. Note that, when we mention multiclass minimax game (or multiclass adversarial training), we refer to the SS tasks. The game for GAN task is the original by Goodfellow et al. goodfellownisp2014.
5.1.1 Theoretical Analysis
Proposition 2
For fixed generator , the optimal solution under Eq. 8 is:
(10) 
where and are probability of sample in the mixture distributions and respectively.
Proof. Refer to our proof in Appendix A for optimal .
Theorem 2
Given optimal classifier obtained from multiclass minimax training , at the equilibrium point, maximizing is equal to maximizing Eq. 11:
(11) 
Proof. Refer to our proof in Appendix A.
Note that proposed SS task objective (11) is different from the original SS task objective (7) with the KL divergence term. Furthermore, note that , as rotation is an affine transform and KL divergence is invariant under affine transform (our proof in Appendix A). Therefore, the improvement is clear: Proposed SS tasks work together to improve the matching of and by leveraging the rotated samples. For a given , feedbacks are computed from not only but also via the rotated samples. Therefore, has more feedbacks to improve . We investigate the improvement of our method on toy dataset as in Section 4. The setup is the same, except that now we replace models/cost functions of with our proposed ones (the design of and are the same). The loss now is shown in Fig. 3c. Comparing Fig. 3c and Fig. 3b, the improvement using our proposed model can be observed: has the lowest loss under our proposed model. Note that, since optimizing KL divergence is not easy because it is asymmetric and could be biased to one direction tunips2017, in our implementation, we use a slightly modified version as described in the Appendix.
6 Experiments
We measure the diversity and quality of generated samples via the Fréchet Inception Distance (FID) heuselarxiv2017. FID is computed with 10K real samples and 5K generated samples exactly as in miyatoiclr2018 if not precisely mentioned. We report the best FID attained in 300K iterations as in xiangarxiv2017; linips2017; traneccv2018; yaziciarxiv2018. We integrate our proposed techniques into two baseline models (SSGAN chenarxiv2018 and DistGAN traneccv2018). We conduct experiments mainly on CIFAR10 and STL10 (resized into as in miyatoiclr2018). We also provide additional experiments of CIFAR100, Imagenet and StackedMNIST.
For DistGAN traneccv2018, we evaluate three versions implemented with different network architectures: DCGAN architecture radfordarxiv2015, CNN architectures of SNGAN miyatoiclr2018 (referred as SNGAN architecture) and ResNet architecture gulrajaniarxiv2017. We recall these network architectures in Appendix C. We use ResNet architecture gulrajaniarxiv2017 for experiments of CIFAR100, Imagenet , and tiny K/4, K/2 architectures metzarxiv2016 for Stacked MNIST. We keep all parameters suggested in the original work and focus to understand the contribution of our proposed techniques. For SSGAN chenarxiv2018, we use the ResNet architecture as implemented in the official code^{1}^{1}1https://github.com/google/compare_gan.
In our experiments, we use SS to denote the original selfsupervised tasks proposed in chenarxiv2018, and we use MS to denote our proposed selfsupervised tasks “Multiclass minimax game based Selfsupervised tasks". Details of the experimental setup and network parameters are discussed in Appendix B.
We have conducted extensive experiments. Setup and results are discussed in Appendix B. In this section, we highlight the main results:

Comparison between SS and our proposed MS using the same baseline.

Comparison between our proposed baseline + MS and other stateoftheart unconditional and conditional GAN. We emphasize that our proposed baseline + MS is unconditional and does not use any label.
6.1 Comparison between Ss and our proposed Ms using the same baseline
Results are shown in Fig. 4 using DistGAN traneccv2018 as the baseline. For each experiment and for each approach (SS or MS), we obtain the best and using extensive search (see Appendix B.4 for details), and we use the best and in the comparison depicted in Fig. 4. In our experiments, we observe that DistGAN has stable convergence. Therefore, we use it in these experiments. As shown in Fig. 4, our proposed MS outperforms the original SS consistently. More details can be found in Appendix B.4.
6.2 Comparison between our proposed method with other stateoftheart GAN
Main results are shown in Table 1. Details of this comparison can be found in Appendix B.4. The best and as in Figure 4 are used in this comparison. The best FID attained in 300K iterations are reported as in xiangarxiv2017; linips2017; traneccv2018; yaziciarxiv2018. Note that SNGAN method miyatoiclr2018 attains the best FID at about 100K iterations with ResNet and it diverges afterward. Similar observation is also discussed in chenarxiv2018.
As shown in Table 1, our method (DistGAN + MS) consistently outperforms the baseline DistGAN and other stateoftheart GAN. These results confirm the effectiveness of our proposed selfsupervised tasks based on multiclass minimax game.
SNGAN  ResNet  
Methods  CIFAR10  STL10  CIFAR10  STL10  CIFAR10 
GANGP miyatoiclr2018  37.7         
WGANGP miyatoiclr2018  40.2  55.1       
SNGAN miyatoiclr2018  25.5  43.2  21.70 .21  40.10 .50  19.73 
SSGAN chenarxiv2018          15.65 
DistGAN traneccv2018  22.95  36.19  17.61 .30  28.50 .49  13.01 
GNGAN tranaaai2018  21.70  30.80  16.47 .28     
SAGAN zhangarxiv2018 (cond.)      13.4 (best)     
BigGAN brockiclr2018 (cond.)      14.73     
SSGAN          20.47 
Ours(SSGAN + MS)          19.89 
DistGAN + SS  21.40  29.79  14.97 .29  27.98 .38  12.37 
Ours(DistGAN + MS)  18.88  27.95  13.90 .22  27.10 .34  11.40 
We have also extracted the FID reported in chenarxiv2018, i.e. SSGAN with the original SS tasks proposed there. In this case, we follow exactly their settings and compute FID using 10K real samples and 10K fake samples. Our model achieves better FID score than SSGAN with exactly the same ResNet architecture on CIFAR10 dataset. See results under the column CIFAR10 in Table 1.
Note that we have tried to reproduce the results of SSGAN using its published code, but we were unable to achieve similar results as reported in the original paper chenarxiv2018. We have performed extensive search and we use the obtained best parameter to report the results as SSGAN in Table 1 (i.e., SSGAN uses the published code and the best parameters we obtained). We use this code and setup to compare SS and MS, i.e. we replace the SS code in the system with MS code, and obtain “SSGAN + MS”. As shown in Table 1, our “SSGAN + MS” achieves better FID than SSGAN. The improvement is consistent with Figure 4 when DistGAN is used as the baseline. More detailed experiments can be found in the Appendix. We have also compared SSGAN and our system (SSGAN + MS) on CelebA (). In this experiment, we use a small DCGAN architecture provided in the authors’ code. Our proposed MS outperforms the original SS, with FID improved from to . This experiment again confirms the effectiveness of our proposed MS.
We conduct additional experiments on CIFAR100 and ImageNet 3232 to compare SS and MS with DistGAN baseline. We use the same ResNet architecture as Section B.4 on CIFAR10 for this study, and we use the best parameters and selected in Section B.4 for ResNet architecture. Experimental results in Table 2 show that our MS consistently outperform SS for all benchmark datasets. For ImageNet 3232 we report the best FID for SS because the model suffers serious mode collapse at the end of training. Our MS achieves the best performance at the end of training.
Datasets  SS  MS 

CIFAR100 (10K5K FID)  21.02  19.74 
ImageNet 3232 (10K10K FID)  17.1  12.3 
We also evaluate the diversity of our generator on Stacked MNIST metzarxiv2016. Each image of this dataset is synthesized by stacking any three random MNIST digits. We follow exactly the same experiment setup with tiny architectures , and evaluation protocol of metzarxiv2016. We measure the quality of methods by the number of covered modes (higher is better) and KL divergence (lower is better). Refer to metzarxiv2016 for more details. Table. 3 shows that our proposed MS outperforms SS for both mode number and KL divergence. Our approach significantly outperforms stateoftheart traneccv2018; karrasiclr2018
. The means and standard deviations of
MS and SS are computed from eight runs (we retrain our GAN model from the scratch for each run). The results are reported with best of : for architecture and for architecture. Similarly, best of : for architecture and for architecture.Arch  Unrolled GAN metzarxiv2016  WGANGP gulrajaniarxiv2017  DistGAN traneccv2018  ProGAN karrasiclr2018  traneccv2018+SS  Ours(traneccv2018+MS) 

K/4, #  372.2 20.7  640.1 136.3  859.5 68.7  859.5 36.2  906.75 26.15  926.75 32.65 
K/4, KL  4.66 0.46  1.97 0.70  1.04 0.29  1.05 0.09  0.90 0.13  0.78 0.13 
K/2, #  817.4 39.9  772.4 146.5  917.9 69.6  919.8 35.1  957.50 31.23  976.00 10.04 
K/2, KL  1.43 0.12  1.35 0.55  1.06 0.23  0.82 0.13  0.61 0.15  0.52 0.07 
Finally, in Table 1, we compare our FID to SAGAN zhangarxiv2018 (a stateoftheart conditional GAN) and BigGAN brockiclr2018. We perform the experiments under the same conditions using ResNet architecture on the CIFAR10 dataset. We report the best FID that SAGAN can achieve. As SAGAN paper does not have CIFAR10 results zhangarxiv2018, we run the published SAGAN code and select the best parameters to obtain the results for CIFAR10. For BigGAN, we extract best FID from original paper. Although our method is unconditional, our best FID is very close to that of these stateoftheart conditional GAN. This validates the effectiveness of our design. Generated images using our system can be found in Figures 5 and 6 of Appendix B.
7 Conclusion
We provide theoretical and empirical analysis on auxiliary selfsupervised task for GAN. Our analysis reveals the limitation of the existing work. To address the limitation, we propose multiclass minimax game based selfsupervised tasks. Our proposed selfsupervised tasks leverage the rotated samples to provide better feedback in matching the data and generator distributions. Our theoretical and empirical analysis support improved convergence of our design. Our proposed SS tasks can be easily incorporated into existing GAN models. Experiment results suggest that they help boost the performance of baseline implemented with various network architectures on the CIFAR10, CIFAR100, STL10, CelebA, Imagenet , and StackedMNIST datasets. The best version of our proposed method establishes stateoftheart FID scores on all these benchmark datasets.
Acknowledgements
This work was supported by ST Electronics and the National Research Foundation(NRF), Prime Minister’s Office, Singapore under Corporate Laboratory @ University Scheme (Programme Title: STEE Infosec  SUTD Corporate Laboratory). This research was also supported by the National Research Foundation Singapore under its AI Singapore Programme [Award Number: AISG100E2018005]. This project was also supported by SUTD project PIESGPAI201801.
References
Appendix A Appendix: Proofs for Sections 4 and 5
Proposition 1 (Proof.)
Let be the th type of transformation, and let be the distribution of the transformed real sample. This section shows the proof for optimal . is the th softmax output of , hence . can be rewritten as:
(12) 
where is the probability that belongs to class , which can be considered as the the th output of “groundtruth” classifier on sample we expect the classifier to predict. Assume that has firstorder derivative with respective to . The optimal solution of can be obtained via setting this derivative equal to zero:
(13) 
For any , setting , and the value of optimal has the following form:
(14) 
Note that
, according to Bayes’ theorem
, and (the probability we apply the transformations for sample are equal), We finally obtain the optimal from Eq. 14: . That concludes our proof.Theorem 1 (Proof.) Substitute obtained above into :
(15) 
Substitute into (15) we have:
(16) 
That concludes our proof.
Proposition 2 (Proof.) Training selfsupervised task with minimax game is similar to previous objective, except the additional term of fake class as below:
(17) 
Assume that has firstorder derivative with respective to . The optimal can be derived via setting derivative of equal to zero as follows:
(18) 
Similar to above, for any , we have the derivative :
(19) 
Setting , and we get optimal , :
(20) 
With , we obtain the derivative of :
(21) 
Setting , and finally we get optimal , :
(22) 
Because , we finally obtain the optimal from Eq. 20: . That concludes the proof.
Theorem 2 (Proof.) Substitute optimal obtained above into :
(23) 
The first term can be written as:
(24) 
With the note that and . Moving the first term of Eq. 24 from the right side to left side, it concludes the proof.
Theorem 3
KL divergence is invariant to affine transform.
Proofs. Let
be a random variable.
is a distribution defined on . Let be an affine transform, i.e., , where is a full rank matrix and . Then for a random variable , , where is the Jacobian matrix, with its th entry defined as:(25) 
Obviously, . Then we have .
Let and are two distributions defined on . Then let and be the corresponding distributions defined on . Then we have and .
Using the definition of the KL divergence between and , we have:
(26)  
(27)  
(28) 
As , then we have:
(29)  
(30) 
According to the property of multiple integral, we have:
(31)  
(32)  
(33) 
It concludes our proof.
Corollary 1
KL divergence between real and fake distributions is equal to that of rotated real and rotated fake distributions by :
Note that we apply the above theorem of invariance of KL, with being respectively, and image rotation as the transform.
a.1 Implementation
Here, we discuss details of our implementation. For the SS tasks, we follow the geometric transformation of gidarisiclr2018 to argument images and compute pseudo labels. It is simple yet effective and currently the stateoftheart in selfsupervised tasks. In particular, we train discriminator to recognize the 2D rotations which were applied to the input image. We rotate the input image with rotations () and assign them the pseudolabels from 1 to .
To implement our model, the GAN objectives for discriminator and generator can be the ones in original GAN by Goodfellow et al. goodfellownisp2014, or other variants. In our work, we conduct experiments to show improvements with two baseline models: original SSGAN chenarxiv2018 and DistGAN traneccv2018.
We integrate SS tasks into DistGAN traneccv2018 and conduct study with this baseline. In our experiments, we observe that DistGAN has good convergence property and this is important for our ablation study.
(34) 
Second, in practice, achieving equilibrium point for optimal D, G, C is difficult. Therefore, inspired by traneccv2018, we propose the new generator objective to improve Eq. 9 as written in Eq. 34. It couples the convergence of and that allows the learning is more stable. Our intuition is that if generator distribution is similar to the real distribution, the classification performance on its transformed fake samples should be similar to that of those from real samples. Therefore, we propose to match the selfsupervised tasks of real and fake samples to train the generator. In other words, if real and fake samples are from similar distributions, the same tasks applied for real and fake samples should have resulted in similar behaviors. In particular, given the crossentropy loss computed on real samples, we train the generator to create samples that are able to match this loss. Here, we use norm for the and is the objective of GAN task traneccv2018. In our implementation, we randomly select a geometric transformation for each data sample when training the discriminator. And the same are applied for generated samples when matching the selfsupervised tasks to train the generator.
For this objective of generator, similar to Eq. 24, we have:
Comments
There are no comments yet.