Generative Adversarial Networks (GAN). GAN goodfellow-nisp-2014 have become one of the most important methods to learn generative models. GAN has shown remarkable results in various tasks, such as: image generation karras-iclr-2018; brock-iclr-2018; karras-cvpr-2019, image transformation isola-cvpr-2017; zhu-cvpr-2017ledig-cvpr-2017, text to image reed-arxiv-2016; zhang2-cvpr-2017schlegl-ipmi-2017; lim-icdm-2018
. The idea behind GAN is the mini-max game. It uses a binary classifier, so-called the discriminator, to distinguish the data (real) versus generated (fake) samples. The generator of GAN is trained to confuse the discriminator to classify the generated samples as the real ones. By having the generator and discriminator competing with each other in this adversarial process, they are able to improve themselves. The end goal is to have the generator capturing the data distribution. Although considerable improvement has been made for GAN under the conditional settingsodena-icml-2017; zhang-arxiv-2018; brock-iclr-2018, i.e., using ground-truth labels to support the learning, it is still very challenging with unconditional setup. Fundamentally, using only a single signal (real/fake) to guide the generator to learn the high-dimensional, complex data distribution is very challenging goodfellow-nips-2016; arjovsky-arxiv-2017a; che-arxiv-2016; chen-arxiv-2016; metz-arxiv-2016; salimans-nisp-2016.
Self-supervised learning is an active research areadoersch-cvpr-2015; pathak-cvpr-2016; zhang-eccv-2016; zhang1-cvpr-2017; noroozi-iccv-2017; gidaris-iclr-2018
. Self-supervised learning is a paradigm of unsupervised learning. Self-supervised methods encourage the classifier to learn better feature representation withpseudo-labels. In particular, these methods propose to learn image feature by training the model to recognize some geometric transformation that is applied to the image which the model receives as the input. A simple-yet-powerful method proposed in gidaris-iclr-2018 is to use image rotations by 0, 90, 180, 270 degrees as the geometric transformation. The model is trained with the 4-way classification task of recognizing one of the four rotations. This task is referred as the self-supervised task. This simple method is able to close the gap between supervised and unsupervised image classification gidaris-iclr-2018.
Self-supervised Learning for GAN. Recently, self-supervised learning has been applied to GAN training chen-arxiv-2018; tran-arxiv-2019. These works propose auxiliary self-supervised classification tasks to assist the main GAN task (Figure 1). In particular, their objective functions for learning discriminator and generator are multi-task loss as shown in (1) and (2) respectively:
Here, in (3) is the GAN task, which is the original value function proposed in Goodfellow et al. goodfellow-nisp-2014. is true data distribution, is the distribution induced by the generator mapping. and are the self-supervised (SS) tasks for discriminator and generator learning, respectively (details to be discussed). is the classifier for the self-supervised task, e.g. rotation classifier as discussed gidaris-iclr-2018. Based on this framework, Chen et al.chen-arxiv-2018 apply self-supervised task to help discriminator counter catastrophic forgetting. Empirically, they have shown that self-supervised task enables discriminator to learn more stable and improved representation. Tran et al. tran-arxiv-2019 propose to improve self-supervised learning with adversarial training.
Despite the encouraging empirical results, in-depth analysis of the interaction between SS tasks ( and ) and GAN task () has not been done before. On one hand, the application of SS task for discriminator learning is reasonable: the goal of discriminator is to classify real/fake image; an additional SS classification task could assist feature learning and enhance the GAN task. On the other hand, the motivation and design of SS task for generator learning is rather subtle: the goal of generator learning is to capture the data distribution in , and it is unclear exactly how an additional SS classification task could help.
In this work, we conduct in-depth empirical and theoretical analysis to understand the interaction between self-supervised tasks ( and ) and learning of generator . Interestingly, from our analysis, we reveal issues of existing works. Specifically, the SS tasks of existing works have “loophole” that, during generator learning, could exploit to maximize without truly learning the data distribution. We show that analytically and empirically that a severely mode-collapsed generator can excel . To address this issue, we propose new SS tasks based on a multi-class minimax game. Our proposed new SS tasks of discriminator and generator compete with each other to reach the equilibrium point. Through this competition, our proposed SS tasks are able to support the GAN task better. Specifically, our analysis shows that our proposed SS tasks enhance matching between and by leveraging the transformed samples used in the SS classification (rotated images when gidaris-iclr-2018 is applied). In addition, our design couples GAN task and SS task. To validate our design, we provide theoretical analysis on the convergence property of our proposed SS tasks. Training a GAN with our proposed self-supervised tasks based on multi-class minimax game significantly improves baseline models. Overall, our system establishes state-of-the-art Fréchet Inception Distance (FID) scores. In summary, our contributions are:
We conduct in-depth empirical and theoretical analysis to understand the issues of self-supervised tasks in existing works.
Based on the analysis, we propose new self-supervised tasks based on a multi-class minimax game.
We conduct extensive experiments to validate our proposed self-supervised tasks.
2 Related works
While training GAN with conditional signals (e.g., ground-truth labels of classes) has made good progress odena-icml-2017; zhang-arxiv-2018; brock-iclr-2018, training GAN in the unconditional setting is still very challenging. In the original GAN goodfellow-nisp-2014, the single signal (real or fake) of samples is provided to train discriminator and the generator. With these signals, the generator or discriminator may fall into ill-pose settings, and they may get stuck at bad local minimums though still satisfying the signal constraints. To overcome the problems, many regularizations have been proposed. One of the most popular approaches is to enforce (towards) Lipschitz condition of the discriminator. These methods include weight-clipping arjovsky-arxiv-2017a, gradient penalty constraints gulrajani-arxiv-2017; roth-nips-2017; kodali-arxiv-2017; petzka-arxiv-2017; liu-arxiv-2018 and spectral norm miyato-iclr-2018. Constraining the discriminator mitigates gradients vanishing and avoids sharp decision boundary between the real and fake classes.
Using Lipschitz constraints improve the stability of GAN. However, the challenging optimization problem still remains when using a single supervisory signal, similar to the original GAN goodfellow-nisp-2014. In particular, the learning of discriminator is highly dependent on generated samples. If the generator collapses to some particular modes of data distribution, it is only able to create samples around these modes. There is no competition to train the discriminator around other modes. As a result, the gradients of these modes may vanish, and it is impossible for the generator to model well the entire data distribution. Using additional supervisory signals helps the optimization process. For example, using self-supervised learning in the form of auto-encoder has been proposed. AAE makhzani-arxiv-2015 guides the generator towards resembling realistic samples. However, an issue with using auto-encoder is that pixel-wise reconstruction with -norm causes blurry artifacts. VAE/GAN larsen-arxiv-2015, which combining VAE kingma-arxiv-2013 and GAN, is an improved solution: while the discriminator of GAN enables the usage of feature-wise reconstruction to overcome the blur, the VAE constrains the generator to mitigate mode collapse. In ALI dumoulin-arxiv-2016 and BiGAN donahue-arxiv-2016, they jointly train the data/latent samples in the GAN framework. InfoGAN chen-arxiv-2016 infers the disentangled latent representation by maximizing the mutual information. In tran-eccv-2018; tran-aaai-2018, they combine two different types of supervisory signals: real/fake signals and self-supervised signal in the form of auto-encoder. In addition, Auto-encoder based methods, including larsen-arxiv-2015; tran-eccv-2018; tran-aaai-2018, can be considered as an approach to mitigate catastrophic forgetting because they regularize the generator to resemble the real ones. It is similar to EWC kirkpatrick-2017-nas or IS zenke-arxiv-2017 but the regularization is achieved via the output, not the parameter itself. Although using feature-wise distance in auto-encoder could reconstruct sharper images, it is still challenging to produce very realistic detail of textures or shapes.
Several different types of supervisory signal have been proposed. Instead of using only one discriminator or generator, they propose ensemble models, such as multiple discriminators tu-nips-2017, mixture of generators hoang-arxiv-2018; ghosh-cvpr-2018 or applying an attacker as a new player for GAN training liu-cvpr-2019. Recently, training model with auxiliary self-supervised constraints chen-arxiv-2018; tran-arxiv-2019 via multi pseudo-classes gidaris-iclr-2018 helps improve stability of the optimization process. This approach is appealing: it is simple to implement and does not require more parameters in the networks (except a small head for the classifier).
3 GAN with Auxiliary Self-Supervised tasks
In chen-arxiv-2018, self-supervised (SS) value function (also referred as “self-supervised task”) was proposed for GAN goodfellow-nisp-2014 via image rotation prediction gidaris-iclr-2018. In their work, they showed that the SS task was useful to mitigate catastrophic forgetting problem of GAN discriminator. The objectives of the discriminator and generator in chen-arxiv-2018 are shown in Eq. 4 and 5. Essentially, the SS task of the discriminator (denoted by ) is to train the classifier that maximizes the performance of predicting the rotation applied to the real samples. Given this classifier , the SS task of the generator (denoted by ) is to train the generator to produce fake samples for maximizing classification performance. The discriminator and classifier are the same (shared parameters), except the last layer in order to implement two different heads: the last fully-connected layer which returns a one-dimensional output (real or fake) for the discriminator, and the other which returns a -dimensional softmax of pseudo-classes for the classifier. and are constants.
Here, the GAN value function (also referred as “GAN task”) can be the original minimax GAN objective goodfellow-nisp-2014 or other improved versions. is the set of transformation, is the -th transformation. The rotation SS task proposed in gidaris-iclr-2018 is applied, and are the 0, 90, 180, 270 degree image rotation, respectively. are the distributions of real and fake data samples, respectively. are the mixture distribution of rotated real and fake data samples (by ), respectively. Let be the -th softmax output of classifier , and we have . The models are shown in Fig. 1a. In chen-arxiv-2018, empirical evidence of improvements has been provided.
Note that, the goal of is to encourage the generator to produce realistic images. It is because classifier is trained with real images and captures features that allow detection of rotation. However, the interaction of with the GAN task has not been adequately analyzed.
4 Analysis on Auxiliary Self-supervised Tasks
We analyze the SS tasks in chen-arxiv-2018 (Figure 1a). We assume that all networks have enough capacity goodfellow-nisp-2014. Refer to the Appendix A for full derivation. Let and be the optimal discriminator and optimal classifier respectively at an equilibrium point. We assume that we have an optimal of the GAN task. We focus on of SS task. Let
be the probability of sampleunder transformation by (Figure 2). denotes the probability of data sample () or generated sample () respectively.
The optimal classifier of Eq. 4 is:
Proof. Refer to our proof in Appendix A for optimal .
Given optimal classifier for SS task , at the equilibrium point, maximizing SS task of Eq. 5 is equal to maximizing:
Proof. Refer to our proof in Appendix A.
Theorem 1 depicts learning of generator given the optimal : selecting (hence ) to maximize . As is trained on real data, encourages to learn to generate realistic samples. However, we argue that can maximize without actually learning data distribution . In particular, it is sufficient for to maximize by simply learning to produce images which rotated version is rare (near zero probability). Some example images are shown in Figure 3a. Intuitively, for these images, rotation can be easily recognized.
The argument can be developed from Theorem 1. From (7), it can be shown that ( and ). One way for to achieve the maximum is to generate such that and . For these , the maximum is attained. Note that corresponds to 0 degree rotation, i.e., no rotation. Recall that is the probability distribution of transformed data by . Therefore the condition and means that there is no other rotated image resembling , or equivalently, rotated does not resemble any other images (Figure 2). Therefore, the generator can exploit this “loophole” to maximize without actually learning the data distribution. In particular, even a mode-collapsed generator can achieve the maximum of by generating such images.
Empirical evidence. Empirically, our experiments (in Appendix B.2.1) show that the FID of the models when using is poor except for very small . We further illustrate this issue by a toy empirical example using CIFAR-10. We augment the training images with transformation data to train the classifier to predict the rotation applied to . This is the SS task of discriminator in Figure 1a. Given this classifier , we simulate the SS task of generator learning as follows. To simulate the output of a good generator which generates diverse realistic samples, we choose the full test set of CIFAR-10 (10 classes) images and compute the cross-entropy loss, i.e. , when they are fed into . To simulate the output of a mode-collapsed generator , we select samples from one class, e.g. “horse”, and compute the cross-entropy loss when they are fed into . Fig. 3b show that some can outperform and achieve a smaller . E.g. a that produces only “horse” samples outperform under . This example illustrates that, while may help the generator to create more realistic samples, it does not help the generator to prevent mode collapse. In fact, as part of the multi-task loss (see (5)), would undermine the learning of synthesizing diverse samples in the GAN task .
5 Proposed method
5.1 Auxiliary Self-Supervised Tasks with Multi-class Minimax Game
In this section, we propose improved SS tasks to address the issue (Fig. 1b). Based on a multi-class minimax game, our classifier learns to distinguish the rotated samples from real data versus those from generated data. Our proposed SS tasks are and in (8) and (9) respectively. Our discriminator objective is:
Eq. 8 means that we simultaneously distinguish generated samples, as the -th class, from the rotated real sample classes. Here, is the -th output for the fake class of classifier .
While rotated real samples are fixed samples that help prevent the classifier (discriminator) from forgetting, the class serves as the connecting point between generator and classifier, and the generator can directly challenge the classifier. Our technique resembles the original GAN by Goodfellow et al. goodfellow-nisp-2014, but we generalize it for multi-class minimax game. Our generator objective is:
and form a multi-class minimax game. Note that, when we mention multi-class minimax game (or multi-class adversarial training), we refer to the SS tasks. The game for GAN task is the original by Goodfellow et al. goodfellow-nisp-2014.
5.1.1 Theoretical Analysis
For fixed generator , the optimal solution under Eq. 8 is:
where and are probability of sample in the mixture distributions and respectively.
Proof. Refer to our proof in Appendix A for optimal .
Given optimal classifier obtained from multi-class minimax training , at the equilibrium point, maximizing is equal to maximizing Eq. 11:
Proof. Refer to our proof in Appendix A.
Note that proposed SS task objective (11) is different from the original SS task objective (7) with the KL divergence term. Furthermore, note that , as rotation is an affine transform and KL divergence is invariant under affine transform (our proof in Appendix A). Therefore, the improvement is clear: Proposed SS tasks work together to improve the matching of and by leveraging the rotated samples. For a given , feedbacks are computed from not only but also via the rotated samples. Therefore, has more feedbacks to improve . We investigate the improvement of our method on toy dataset as in Section 4. The setup is the same, except that now we replace models/cost functions of with our proposed ones (the design of and are the same). The loss now is shown in Fig. 3c. Comparing Fig. 3c and Fig. 3b, the improvement using our proposed model can be observed: has the lowest loss under our proposed model. Note that, since optimizing KL divergence is not easy because it is asymmetric and could be biased to one direction tu-nips-2017, in our implementation, we use a slightly modified version as described in the Appendix.
We measure the diversity and quality of generated samples via the Fréchet Inception Distance (FID) heusel-arxiv-2017. FID is computed with 10K real samples and 5K generated samples exactly as in miyato-iclr-2018 if not precisely mentioned. We report the best FID attained in 300K iterations as in xiang-arxiv-2017; li-nips-2017; tran-eccv-2018; yazici-arxiv-2018. We integrate our proposed techniques into two baseline models (SSGAN chen-arxiv-2018 and Dist-GAN tran-eccv-2018). We conduct experiments mainly on CIFAR-10 and STL-10 (resized into as in miyato-iclr-2018). We also provide additional experiments of CIFAR-100, Imagenet and Stacked-MNIST.
For Dist-GAN tran-eccv-2018, we evaluate three versions implemented with different network architectures: DCGAN architecture radford-arxiv-2015, CNN architectures of SN-GAN miyato-iclr-2018 (referred as SN-GAN architecture) and ResNet architecture gulrajani-arxiv-2017. We recall these network architectures in Appendix C. We use ResNet architecture gulrajani-arxiv-2017 for experiments of CIFAR-100, Imagenet , and tiny K/4, K/2 architectures metz-arxiv-2016 for Stacked MNIST. We keep all parameters suggested in the original work and focus to understand the contribution of our proposed techniques. For SSGAN chen-arxiv-2018, we use the ResNet architecture as implemented in the official code111https://github.com/google/compare_gan.
In our experiments, we use SS to denote the original self-supervised tasks proposed in chen-arxiv-2018, and we use MS to denote our proposed self-supervised tasks “Multi-class mini-max game based Self-supervised tasks". Details of the experimental setup and network parameters are discussed in Appendix B.
We have conducted extensive experiments. Setup and results are discussed in Appendix B. In this section, we highlight the main results:
Comparison between SS and our proposed MS using the same baseline.
Comparison between our proposed baseline + MS and other state-of-the-art unconditional and conditional GAN. We emphasize that our proposed baseline + MS is unconditional and does not use any label.
6.1 Comparison between Ss and our proposed Ms using the same baseline
Results are shown in Fig. 4 using Dist-GAN tran-eccv-2018 as the baseline. For each experiment and for each approach (SS or MS), we obtain the best and using extensive search (see Appendix B.4 for details), and we use the best and in the comparison depicted in Fig. 4. In our experiments, we observe that Dist-GAN has stable convergence. Therefore, we use it in these experiments. As shown in Fig. 4, our proposed MS outperforms the original SS consistently. More details can be found in Appendix B.4.
6.2 Comparison between our proposed method with other state-of-the-art GAN
Main results are shown in Table 1. Details of this comparison can be found in Appendix B.4. The best and as in Figure 4 are used in this comparison. The best FID attained in 300K iterations are reported as in xiang-arxiv-2017; li-nips-2017; tran-eccv-2018; yazici-arxiv-2018. Note that SN-GAN method miyato-iclr-2018 attains the best FID at about 100K iterations with ResNet and it diverges afterward. Similar observation is also discussed in chen-arxiv-2018.
As shown in Table 1, our method (Dist-GAN + MS) consistently outperforms the baseline Dist-GAN and other state-of-the-art GAN. These results confirm the effectiveness of our proposed self-supervised tasks based on multi-class minimax game.
|SN-GAN miyato-iclr-2018||25.5||43.2||21.70 .21||40.10 .50||19.73|
|Dist-GAN tran-eccv-2018||22.95||36.19||17.61 .30||28.50 .49||13.01|
|GN-GAN tran-aaai-2018||21.70||30.80||16.47 .28||-||-|
|SAGAN zhang-arxiv-2018 (cond.)||-||-||13.4 (best)||-||-|
|BigGAN brock-iclr-2018 (cond.)||-||-||14.73||-||-|
|Ours(SSGAN + MS)||-||-||-||-||19.89|
|Dist-GAN + SS||21.40||29.79||14.97 .29||27.98 .38||12.37|
|Ours(Dist-GAN + MS)||18.88||27.95||13.90 .22||27.10 .34||11.40|
We have also extracted the FID reported in chen-arxiv-2018, i.e. SSGAN with the original SS tasks proposed there. In this case, we follow exactly their settings and compute FID using 10K real samples and 10K fake samples. Our model achieves better FID score than SSGAN with exactly the same ResNet architecture on CIFAR-10 dataset. See results under the column CIFAR-10 in Table 1.
Note that we have tried to reproduce the results of SSGAN using its published code, but we were unable to achieve similar results as reported in the original paper chen-arxiv-2018. We have performed extensive search and we use the obtained best parameter to report the results as SSGAN in Table 1 (i.e., SSGAN uses the published code and the best parameters we obtained). We use this code and setup to compare SS and MS, i.e. we replace the SS code in the system with MS code, and obtain “SSGAN + MS”. As shown in Table 1, our “SSGAN + MS” achieves better FID than SSGAN. The improvement is consistent with Figure 4 when Dist-GAN is used as the baseline. More detailed experiments can be found in the Appendix. We have also compared SSGAN and our system (SSGAN + MS) on CelebA (). In this experiment, we use a small DCGAN architecture provided in the authors’ code. Our proposed MS outperforms the original SS, with FID improved from to . This experiment again confirms the effectiveness of our proposed MS.
We conduct additional experiments on CIFAR-100 and ImageNet 3232 to compare SS and MS with Dist-GAN baseline. We use the same ResNet architecture as Section B.4 on CIFAR-10 for this study, and we use the best parameters and selected in Section B.4 for ResNet architecture. Experimental results in Table 2 show that our MS consistently outperform SS for all benchmark datasets. For ImageNet 3232 we report the best FID for SS because the model suffers serious mode collapse at the end of training. Our MS achieves the best performance at the end of training.
|CIFAR-100 (10K-5K FID)||21.02||19.74|
|ImageNet 3232 (10K-10K FID)||17.1||12.3|
We also evaluate the diversity of our generator on Stacked MNIST metz-arxiv-2016. Each image of this dataset is synthesized by stacking any three random MNIST digits. We follow exactly the same experiment setup with tiny architectures , and evaluation protocol of metz-arxiv-2016. We measure the quality of methods by the number of covered modes (higher is better) and KL divergence (lower is better). Refer to metz-arxiv-2016 for more details. Table. 3 shows that our proposed MS outperforms SS for both mode number and KL divergence. Our approach significantly outperforms state-of-the-art tran-eccv-2018; karras-iclr-2018
. The means and standard deviations ofMS and SS are computed from eight runs (we re-train our GAN model from the scratch for each run). The results are reported with best of : for architecture and for architecture. Similarly, best of : for architecture and for architecture.
|Arch||Unrolled GAN metz-arxiv-2016||WGAN-GP gulrajani-arxiv-2017||Dist-GAN tran-eccv-2018||Pro-GAN karras-iclr-2018||tran-eccv-2018+SS||Ours(tran-eccv-2018+MS)|
|K/4, #||372.2 20.7||640.1 136.3||859.5 68.7||859.5 36.2||906.75 26.15||926.75 32.65|
|K/4, KL||4.66 0.46||1.97 0.70||1.04 0.29||1.05 0.09||0.90 0.13||0.78 0.13|
|K/2, #||817.4 39.9||772.4 146.5||917.9 69.6||919.8 35.1||957.50 31.23||976.00 10.04|
|K/2, KL||1.43 0.12||1.35 0.55||1.06 0.23||0.82 0.13||0.61 0.15||0.52 0.07|
Finally, in Table 1, we compare our FID to SAGAN zhang-arxiv-2018 (a state-of-the-art conditional GAN) and BigGAN brock-iclr-2018. We perform the experiments under the same conditions using ResNet architecture on the CIFAR-10 dataset. We report the best FID that SAGAN can achieve. As SAGAN paper does not have CIFAR-10 results zhang-arxiv-2018, we run the published SAGAN code and select the best parameters to obtain the results for CIFAR-10. For BigGAN, we extract best FID from original paper. Although our method is unconditional, our best FID is very close to that of these state-of-the-art conditional GAN. This validates the effectiveness of our design. Generated images using our system can be found in Figures 5 and 6 of Appendix B.
We provide theoretical and empirical analysis on auxiliary self-supervised task for GAN. Our analysis reveals the limitation of the existing work. To address the limitation, we propose multi-class minimax game based self-supervised tasks. Our proposed self-supervised tasks leverage the rotated samples to provide better feedback in matching the data and generator distributions. Our theoretical and empirical analysis support improved convergence of our design. Our proposed SS tasks can be easily incorporated into existing GAN models. Experiment results suggest that they help boost the performance of baseline implemented with various network architectures on the CIFAR-10, CIFAR-100, STL-10, CelebA, Imagenet , and Stacked-MNIST datasets. The best version of our proposed method establishes state-of-the-art FID scores on all these benchmark datasets.
This work was supported by ST Electronics and the National Research Foundation(NRF), Prime Minister’s Office, Singapore under Corporate Laboratory @ University Scheme (Programme Title: STEE Infosec - SUTD Corporate Laboratory). This research was also supported by the National Research Foundation Singapore under its AI Singapore Programme [Award Number: AISG-100E-2018-005]. This project was also supported by SUTD project PIE-SGP-AI-2018-01.
Proposition 1 (Proof.)
Let be the -th type of transformation, and let be the distribution of the transformed real sample. This section shows the proof for optimal . is the -th soft-max output of , hence . can be re-written as:
where is the probability that belongs to class , which can be considered as the the -th output of “ground-truth” classifier on sample we expect the classifier to predict. Assume that has first-order derivative with respective to . The optimal solution of can be obtained via setting this derivative equal to zero:
For any , setting , and the value of optimal has the following form:
, according to Bayes’ theorem, and (the probability we apply the transformations for sample are equal), We finally obtain the optimal from Eq. 14: . That concludes our proof.
Theorem 1 (Proof.) Substitute obtained above into :
Substitute into (15) we have:
That concludes our proof.
Proposition 2 (Proof.) Training self-supervised task with minimax game is similar to previous objective, except the additional term of fake class as below:
Assume that has first-order derivative with respective to . The optimal can be derived via setting derivative of equal to zero as follows:
Similar to above, for any , we have the derivative :
Setting , and we get optimal , :
With , we obtain the derivative of :
Setting , and finally we get optimal , :
Because , we finally obtain the optimal from Eq. 20: . That concludes the proof.
Theorem 2 (Proof.) Substitute optimal obtained above into :
The first term can be written as:
With the note that and . Moving the first term of Eq. 24 from the right side to left side, it concludes the proof.
KL divergence is invariant to affine transform.
be a random variable.is a distribution defined on . Let be an affine transform, i.e., , where is a full rank matrix and . Then for a random variable , , where is the Jacobian matrix, with its -th entry defined as:
Obviously, . Then we have .
Let and are two distributions defined on . Then let and be the corresponding distributions defined on . Then we have and .
Using the definition of the KL divergence between and , we have:
As , then we have:
According to the property of multiple integral, we have:
It concludes our proof.
KL divergence between real and fake distributions is equal to that of rotated real and rotated fake distributions by :
Note that we apply the above theorem of invariance of KL, with being respectively, and image rotation as the transform.
Here, we discuss details of our implementation. For the SS tasks, we follow the geometric transformation of gidaris-iclr-2018 to argument images and compute pseudo labels. It is simple yet effective and currently the state-of-the-art in self-supervised tasks. In particular, we train discriminator to recognize the 2D rotations which were applied to the input image. We rotate the input image with rotations () and assign them the pseudo-labels from 1 to .
To implement our model, the GAN objectives for discriminator and generator can be the ones in original GAN by Goodfellow et al. goodfellow-nisp-2014, or other variants. In our work, we conduct experiments to show improvements with two baseline models: original SSGAN chen-arxiv-2018 and DistGAN tran-eccv-2018.
We integrate SS tasks into Dist-GAN tran-eccv-2018 and conduct study with this baseline. In our experiments, we observe that Dist-GAN has good convergence property and this is important for our ablation study.
Second, in practice, achieving equilibrium point for optimal D, G, C is difficult. Therefore, inspired by tran-eccv-2018, we propose the new generator objective to improve Eq. 9 as written in Eq. 34. It couples the convergence of and that allows the learning is more stable. Our intuition is that if generator distribution is similar to the real distribution, the classification performance on its transformed fake samples should be similar to that of those from real samples. Therefore, we propose to match the self-supervised tasks of real and fake samples to train the generator. In other words, if real and fake samples are from similar distributions, the same tasks applied for real and fake samples should have resulted in similar behaviors. In particular, given the cross-entropy loss computed on real samples, we train the generator to create samples that are able to match this loss. Here, we use -norm for the and is the objective of GAN task tran-eccv-2018. In our implementation, we randomly select a geometric transformation for each data sample when training the discriminator. And the same are applied for generated samples when matching the self-supervised tasks to train the generator.
For this objective of generator, similar to Eq. 24, we have: