code for paper "on positive-unlabeled classification in gan"
This paper defines a positive and unlabeled classification problem for standard GANs, which then leads to a novel technique to stabilize the training of the discriminator in GANs. Traditionally, real data are taken as positive while generated data are negative. This positive-negative classification criterion was kept fixed all through the learning process of the discriminator without considering the gradually improved quality of generated data, even if they could be more realistic than real data at times. In contrast, it is more reasonable to treat the generated data as unlabeled, which could be positive or negative according to their quality. The discriminator is thus a classifier for this positive and unlabeled classification problem, and we derive a new Positive-Unlabeled GAN (PUGAN). We theoretically discuss the global optimality the proposed model will achieve and the equivalent optimization goal. Empirically, we find that PUGAN can achieve comparable or even better performance than those sophisticated discriminator stabilization methods.READ FULL TEXT VIEW PDF
code for paper "on positive-unlabeled classification in gan"
Recently, deep generative models have received remarkable achievements in image generation tasks [11, 18, 21, 4]. As a representative generative model, GANs  approximated a target distribution via playing a min-max game. In the standard framework of GAN [4, 19]
, a generator takes noise vectors from a prior distribution (e.g
. Gaussian distribution and normal distribution) as the input and tends to produce data that follows the distribution of the reference natural images, while the discriminator aims to distinguish the generated data from the real data. Various GAN methods have been developed in many interesting applications. For example, in the image-to-image translation task, generators in GANs map the input image to output image. Representative methods include Pix2pix over paired training images and cycleGAN  in an unsupervised way.
In vanilla GANs, the training progress usually lacks stability, and the quality of generated images is not always satisfactory (e.g. model collapse). For instance, DCGAN  carefully designed the neural architectures for the generator and the discriminator to alleviate these problems. Progressive GAN  generated high-resolution images by progressively deepening the network. BigGAN  produced high-quality images by improving training methods, e.g. enlarging batch size, and truncating the latent space. WGAN  and WGAN-GP  tried to fit and optimize the Wasserstein distance to stabilize the generation process. SNGAN  proved the necessity and benefits of introducing Lipschitz continuity to the discriminator.
These aforementioned methods to stabilize GAN could be roughly divided into two categories: designing stable network structures and training strategies and developing new effective optimization goals. However, neither of them has stepped away from the positive-negative classification problem initially established in standard GAN. Although WGAN and WGAN-GP no longer take the discriminator as a classifier for real data and generated data, the aim of the is still to separate the real and generated data as far as possible. To the best of our knowledge, existing GAN models attempt to strictly distinguish between generated data and real data and ignore the fact that the quality of the generated samples is not the same. It is unfair to treat high-quality samples equally with low-quality samples, especially when high-quality samples are sufficiently realistic. Although there are many theoretical results proposed to justify the final equilibrium, such as vanilla GAN  proving the existing of the equilibrium and WGAN  replacing the JS divergence with the Wasserstein distance, these analysis mainly focus on the final achievement rather than the intermediate status in the training process.
In this paper, we suggest that instead of an ordinary positive and negative classification (i.e. real v.s. fake) problem, GAN is actually in the face of a positive-unlabeled classification problem. With adequate training, generated data could look real and may appear to be even more realistic than real data at times. It would then be illogical to make a stereotype of generated data as fake data. To catch up with the continuously improved quality of generated data, we thus take them as unlabeled data, which consists of low-quality data and high-quality data. These high-quality data are considered to be close to or even better than some real data. Within the framework of positive-unlabeled classification, the classification objective of standard GAN can be re-defined, and different variants can be easily obtained by considering different scoring functions (e.g. those in LSGAN  and HingeGAN ). In addition, we get rid of the class balance constraint (i.e. half of the sample are fake), and observe impressive performance improvement by increasing the share of generated data in the mini-batch. Our theoretical analysis suggests that the proposed new algorithm has a guaranteed final equilibrium. Experimental results on benchmark datasets demonstrate that we actually enjoy more stable training progress and thus achieve better generated samples.
In this section, we first review preliminary works about the standard Generative Adversarial Network (SGAN). Then we analyze the problem existing in GAN and define a new role for the discriminator . We also theoretically develop this idea into a new algorithm within the framework of SGAN, and then extended this algorithm to the general GAN, which shows the flexibility of our method.
was introduced by Goodfellow et al. (2014). It consists of two neural networks: discriminator networkand generator network . The discriminator aims to distinguish the provided real data and the fake data generated by the generator . On the other hand, the generator aims to generate fake data that can fool the discriminator . Following this adversarial manner, we expect the can generate high-quality data in the end. Formally, the objective function of GAN can be written as
where indicates the distribution of real data, is the random noise sampled from a prior distribution (i.e. the Gaussian distribution), and
is the predicted probability ofto be real by the discriminator. Since the minimax objective function might lead to gradient vanishing for when can perfectly distinguish two data set. More of GAN’s variants (e.g. WGAN  and LSGAN ) transform this minimax game into a non-saturating game. In general, the objective functions of these GANs can be concluded as follows:
where is the loss of classifying input as real and is the loss of classifying input as fake.
As shown above, existing GAN variants are trained to separate the real and generated data strictly. However, this does not match the actual situation in training. Some of the generated samples can achieve higher quality and are more realistic than others. This phenomenon usually lasts until the end of the training. As a result, the quality of samples generated by is very different, and there are many high-quality samples and a considerable proportion of low-quality samples. For instance, as a well-known problem in GAN, the model collapse problem that networks often suffer can be considered as the generation space consisting of some high quality and non-repetitive samples and the rest of the repetitive samples. These duplicate samples can be considered as low-quality samples and still need to be improved. There is still a phenomenon that there is a certain proportion of unsatisfactory samples in a well-trained generation space, and the gaps in the generated samples of different quality are relatively large. Therefore, the traditional method of strictly distinguishing the real sample from the generated sample does not conform to the actual situation of the training. In this paper, we propose an algorithm that is dedicated to picking out low-quality samples from the samples generated by and promoting them, unlike traditional discriminators that are dedicated to distinguishing real samples from generated samples. Our algorithm encourages the discriminator to divide the generation space of into high-quality samples and low-quality samples so that the generator could improve the low-quality samples.
By doing so, the proposed method enjoys several desired properties: i) The discriminator pays more attention to poor quality samples, allowing the generator to focus on improving the quality of these bad samples. As a result, the quality of generated samples is more balanced, and the overall quality is expected to be enhanced, ii) The training strategy of our algorithm is more in line with the actual situation of samples generated by in the training process, so the training process is expected to be more stable, iii) More importantly, the proposed algorithm is a flexible method which means that our algorithm could be easily integrated into existing frameworks of variety GAN, and we will show this desirable feature in the following subsection, and iv) Although the proposed method changes the function of the D network, we provide some theoretical results in Section 3 which demonstrate the proposed method also enjoys the same equilibrium condition with the other GAN and provide some guarantee for performance.
Above, we discuss the current problem existing in the GAN framework and propose to allow some good samples to be recognized as real data. In this part, we firstly introduce how we achieve this in the framework of standard GAN, and then we extend it for general GANs.
As mentioned above, we propose to allow the discriminator to treat the high-quality generated samples as real data and focus on the bad generated samples. The discriminator is required to learn how to distinguish high-quality samples with other low-quality samples. Identifying high-quality samples from generated samples under the guidance of real samples is very similar to Positive-Unlabeled classification problems [3, 12, 23], where only some positive samples were labeled, and the classifier tried to find positive samples from unlabeled samples consisting of positive and negative samples. According to the solution of the PU classification problem, we develop an algorithm learning a discriminator to recognize high-quality samples. Firstly, we denote the generated data as , consisted with high-quality samples and the bad samples . And we consider both and real data to be real (i.e. ) while consider to be fake (i.e. ). In addition, denote as the marginal density of and and are the class conditional densities of and respectively. Then the which is the marginal density of , can be obtained with:
where is the unknown class prior (i.e., the proportion of in ). Now we successfully seperate the generated space into two parts. To classify and from with as a binary classifier learned the distribution of from , we need to minimize its expected miss-classification rate
. The loss function for minimizingby a given could be:
where is the loss function measuring the loss of prediction when the ground true label is . However, we has less idea about which is . In our definition, the good samples are similar to the real data, which means that can be replaced by . Thus, the can be calculated by:
Similarity, the bad generated samples are also unknown, and we can only access the generated samples and the real data . should be modified to avoid the term of . From Eq. (5), the low-quality part can be expressed as follows,
Then we can find out the follow equation:
By minimizing Eq. (8), the discriminator can distinguish not only but also from , by only learning the distribution of based on and distribution of (the generated samples) from . We notice that the second and third terms of Eq. (8) are introduced from Eq. (6) and aim to calculate the loss over . The original loss is expected to be not less than zero, but the replacement loss function , may be negative. This abnormal value of loss may lead to over-fitting. It is important to avoid the it to be negative. Finally, the objective function of discriminator proposed in Eq. (8) will be:
Here we obtain the proposed objective function of the discriminator. Considering there is also an adversarial game between the discriminator and the generator , the generator should still be trained to deceive the discriminator . As a result, we can easily lead to the objective function of as follows,
Following Eqs. (9) and (10), we reached our objective to deal with the generated data in different ways, rather than treating it all as negative samples. Although we finally got a new loss function, in the next section, we theoretically prove that our proposed algorithm is also designed to minimize the distance between the generated distribution and the real distribution, which provides a theory for the effectiveness of our algorithm.
Above we conclude our objective function within the standard GAN framework. The proposed method can also be integrated into other general GAN frameworks flexibly. In this part, we combine the proposed method with other loss functions of discriminator in GAN.
In general, the objective function Eq. (2) contains two loss functions and
. Those concrete loss functions can be changed for a different variance of GANs, but all these loss functions are following the same concepts thatand are trying to separate the real data from the generated data as far as possible. Similar to SGAN, we propose that the discriminator in GAN is better to focus on generated samples with low quality and recognize the high-quality samples from the generated samples. Following this concept, we implement the proposed method for the general framework of GAN with the following equation:
With the help of Eq. (11), we can now integrate the proposed method into various frameworks of GAN, such as WGAN-GP , LSGAN , and SpectualGAN . This flexibility that combining with other models provides the proposed method a chance to get further improvement on existing excellent models. Loss functions corresponding to the specific model can be found in the supplementary material.
In the proposed method, the discriminator is encouraged to not only distinguish the real samples from the generated samples but also allow a certain proportion of generated samples of high quality to be recognized as real data, which reduces the instability problem during the training progress. Following this principle, we have obtained a novel loss function Eq. (9) and Eq. (10) in the framework of the standard GAN. Although Eq. (9) and Eq. (10) are designed to achieve the above-mentioned desirable characteristics, it is unclear whether the final convergence of the proposed method satisfies the requirements of the generation task. In this section, we provide a formal technical analysis of the convergence of the proposed objective function and prove that the proposed algorithm will perfectly lead the generated distribution to the real one. See the supplementary material for proof.
Now we consider the standard GAN based framework and analyze the optimal discriminator and generator. The discriminator is optimized by Eq. (9). Following the analysis proposed in GAN , the optimal distribution will balance between the true distribution and the learned distribution .
For the generator fixed, the optimal discriminator is
where is the distribution of low-quality generated samples of .
With the optimal discriminator fixed, we can reformulate the objective function by replacing in Eq. (10) according to Theorem 1. By doing so, we can summarize the behavior of in the following theorem.
With the optimal discriminator fixed, the optimization of generator is equivalent to minimize .
Theorem 2 suggests that the optimal generator will pay attention to reducing the divergence within the generated space by minimizing the distance between and . Moreover, the generator will also guide the generated distribution as close as possible to the real distribution , which ensures the quality of the generated sample. Combining Theorem 1 and Theorem 2, we can summarize the following corollary.
The global minimum of the proposed objective function is achieved if and only if . At that point, achieves the value of , and achieves the value of .
The above theoretical results prove that the proposed method will achieve equilibrium if and only if , which points out that our method enjoys the same global equilibrium point as other GAN frameworks. These results justify our approach. The next experiment section further illustrates the effectiveness of our approach.
In this section, we evaluate the proposed method on a range of datasets including MNIST , FMNIST , CIFAR-10 , CAT , and LSUN-bedroom . We resize images in the MNIST and FMNIST datasets to
for convenience. For these datasets with more than one kind of resolution, we mark them with the resolution, such as CAT-64. Experiments on CAT-128, CAT-256, CelebA-128, and LSUN-128 datasets are also conducted to evaluate the high-resolution generation ability of the proposed approach. Moreover, due to the limitation of computational resource, for LUSN-128 we randomly sample 100,000 images from the dataset as training set, instead of using all of them. The experiment is implemented in pytorch, and we use FID (Fréchet Inception Distance) as the quantitative indicator to evaluate the performance based the quality of the generated results (lower value of FID indicating higher generated quality). FID scores are calculated with 10,000 generated samples and 10,000 real images randomly sampled from the dataset in advance.
As mentioned in Section 2, our approach enjoys a high degree of flexibility can be integrated into most kind of GAN frameworks. We chose some variants of GAN as basic frameworks, such as standard GAN (SGAN) , LSGAN , WGAN-GP , and HingeGAN , and then integrate our method into these frameworks for comparison. For a fair comparison, we follow the same settings and architectures of GANs when we integrate our method and make sure the loss functions are the only changed part. We also compare our method with Relativistic GAN , which is another flexible GAN framework, and we use the average version (RaGAN) in the experiment. All objective function of the proposed frameworks could be found in the supplementary material.
All models used in experiments will be trained with Adam optimizer , and the random selecting seed is set to 1. In addition, the discriminator follows the CNN structure described by Miyato, et al. (2018)  while the generator will follow the structure of standard DCGAN  for all models generating images whose resolution less than 128 except WGAN-GP whose structure we leave in the Supplementary material. We use the stable setting for DCGAN  as the basic setting for training, which the learning rate is set to 0.0002, the and
for Adam optimizer, and the number of training time for discriminator and generator will both equal to 1. In addition, batch normalization is also implemented. Moreover, we set a general case for the hyper-parameter growth pattern called the basic pattern in this experiment. The basic pattern will be initialized with 0.1 and will increase smoothly at each iteration until it reaches 0.7. Detailed network structures used on other datasets could be found in the supplementary material.
|6464 images (N=9304)||128128 images (N=6645)||256256 images (N=2011)|
In this section, we evaluate the generation ability of the proposed method on multi-category image sets MNIST , FMNIST , and CIFAR-10  and single-category image sets CAT  and LSUN-bedroom . In this part, the resolution of images in the MNIST, FMNIST, and CIFAR-10 is 32 and is 64 in the rest datasets. We choose four representative adversarial models SGAN, HingeGAN, LSGAN, and WGAN-GP as basic frameworks and compare them with our algorithms, which are denoted as PUSGAN, PUHingeGAN, PULSGAN, and PUWGAN-GP, respectively.
Table 1 reports a comparison of the FID score obtained by the proposed method and basic models. Our models enjoy the ability to combine with most variants of GAN, which allows us to achieve the best performance on most data sets. Table 1 shows that most of our methods exceed their corresponding basic frameworks, which demonstrates the effectiveness and the flexibility of our approach. In Figure 1, we show a few images generated by the PUSGAN models. We observe that the proposed method generates high-quality images on various datasets, which is consistent with the quantitative results in Table 1.
We evaluate the stability of the proposed method on three resolution of the CAT dataset, such as 64 64, 128 128, and 256
256 pixels. As there are only 6654 and 2011 samples in the CAT-128 dataset and CAT-256 dataset respectively, some variants of GAN are unable to converge on these datasets. We choose SGAN, LSGAN, and WGAN-GP as basic models. We compare the proposed method with both these basic models as well as the corresponding Relativistic GAN (RaGAN). For each model, We calculate the FID score of the current model every 10,000 iterations. The results will be presented with the minimum, maximum, mean, and standard deviation (SD) of these obtained FID values. Table2 shows the FID results for different networks in different resolutions of data sets.
For 6464 resolution dataset, all models trained by the proposed method except PULSGAN, can achieve much lower FID in minimum, maximum and mean compared with its original version and even can further achieve lower FID values than their relativistic versions, which indicates that our algorithm can effectively improve the training stability and improve the quality of the generation.
For higher resolution data sets, SGAN failed to converge in 128x128 and 256x256 resolution datasets while LSGAN will be stuck at the early stage in 256x256 resolution dataset . The standard version of our model (PUSGAN) shows further stability with lower values of maximum, mean, and SD in all three resolution datasets. On the most challenging 256 resolution dataset, the proposed method achieves both the satisfactory quality and stability. In experiments, we found that although the PULSGAN can converge in the CAT-256 dataset, the convergence is much slower than other GANs. Nevertheless, PULSGAN still can achieve competitive results compared with other states of arts GANs such as WGAN-GP and RaGAN.
Overall, our algorithm presents desirable stability for all three data sets, and it can achieve similar or even better results compare to relativistic versions. It is impressive that all these GANs trained by the proposed method have improved. As a result, we conclude that the above stability experiment demonstrates that the proposed method provides could provide stability for the training progress for a variety of GANs and thus improve the quality of generated images.
As we have claimed, our approach focuses on improving low-quality samples and lead to more stable training progress. Thus our approach enjoys the ability to generalize to many training settings. To demonstrate this, we evaluate the proposed method on several hard training settings and compare it with the other GAN frameworks. In this part, we implement SGAN (RaSGAN, PUSGAN), LSGAN (RaLSGAN, PULSGAN), and HingeGAN (RaHingeGAN, PUHingeGAN) on the CIFAR-10 dataset. The experiment is conducted on one basic setting and three hard settings. The basic setting is same as above, and three hard settings are i) changing learning rate to (lr=.001), ii) removing Batch normalization layers in and
(No BN), and iii) replacing all activation functions with Tanh inand (Tanh).
The results are showed in Table 3. In the stable setting, we can find that the PUSGAN has better performance than SGAN and its other variants. The PUHingeGAN has a huge improvement compared with the original HingeGAN with a gap of 14, and it also performs better than RaHingeGAN. On the other hands, the performance of PULSGAN is slightly worse than other LSGAN versions.
When the learning rate is increased to 0.001, all three PU versions of GANs perform well, compared with the original one. While PUSGAN and PUHingeGAN can perform better than their relativistic versions. However, by changing optimization settings such as removing batch normalization or replace ReLU activation function with Tanh activation function (No BN and Tanh in columns respectively), the performances of PUGANs will be worse. It might indicate that the PUGANs rely on optimization terms for stable training.
Generating high-resolution images is a complicated task. To demonstrate the generation ability of our algorithm, we evaluate the proposed method on CAT-128, CelebA-128, and LSUN-128 datasets with 128 128 pixels and the CAT-256 dataset with 256 256 pixels. There are 202,599 images in the CelebA-128 dataset, and 3,033,042 in the LSUN-128 dataset (only 100,000 samples are used for training). As mentioned above, the CAT-128 and CAT-256 datasets are considered as more challenging high-resolution datasets because there are only 6,645 and 2,011 samples, respectively. Images shown in Figure 2
are generated by the proposed method within the architecture of SGAN (PUSGAN), while SGAN failed to generate such high-resolution images, especially on the CAT-256 dataset. Moreover, interpolation is also an impressive feature of the generative models, which indicates that the generative model successfully learns to fit the distribution of natural images instead of overfitting to the training samples. We show high-resolution interpolation results obtained by the proposed method in Figure3. It shows that our model generates smooth interpolation images. Figures 2 and 3 demonstrate that the proposed method could provide improvement on the quality of generated samples.
In Eq. (3), we introduce a class prior into our algorithm. The indicates the proportion of high-quality fake data in fake data, and we treat it as a hyper-parameter. In this section, we further evaluate the impact of class prior with PUSGAN framework on the CAT-64 dataset. The structure and training settings are the same as the previous sections. We set four different increasing patterns for during training. The first pattern is the basic pattern we used in previous sections. The second version will set to be 0.3 at the beginning, and it will be increased with 0.1 at every 10k iterations until 0.7 is reached. The second version is used to evaluate the impact of the fast growth of . For the third and fourth patterns, will be fixed at 0.3 and 0.5 during the training process.
In Figure 4, we found that the fast-growing pattern achieves the worst average performance, and its FID scores remain relatively high. As a comparison, the basic pattern can reach lower FID values than the fast pattern, and it is also relatively stable in the later training stage. The fixed value of 0.3 could present more stability and generate a competitive result, compared with the previous two patterns. In addition, the fourth version has the worst performance at the beginning, but it was keep going better and achieved the best performance in the end, within all four patterns. The shortage of this version mainly lies in the large fluctuations and slow convergence. It is interesting that all first three patterns have similar FID values at the early stage, while the performance with a higher is much worse at the same stage. This may be because the proportion of high-quality samples in the network at the beginning of training is far from 0.5, and setting to 0.5 is against the real situation, leading to an unstable training. On the other hands, the fourth one achieves a lower FID value in the end, while others have similar values at the same time. It might show a too large (e.g. 0.7), or a too small (e.g. 0.3) values of reduce the performances. The result shows that the
can affect the performance of the generation. We also present the result obtained by RaGAN for comparison. It shows that all the four versions of the model obtained by our algorithm produce both higher quality images and show better stability. As a result, the proposed method enjoys considerable tolerance for the selection of hyperparameters.
Normally, the size of real data is much smaller than the size of the generated data in the adversarial generative task. It is an interesting problem about how to make the most of this large amount of generated data. In general training progress of GAN, the batch size of real samples and generated samples are the same. Here, we try to increase the batch size of the generated data and maintain that of the actual data to take advantage of this large number of generated samples. We investigate the impact of increasing the batch size of the generated data for training. The evaluation is based on three versions of PUSGAN with different batch sizes of fake data. The first version is the basic version that the batch size of real and fake data are the same. The second and third versions will use twice and three times more fake data than real data, respectively. The structure and training settings are the same as the one we used in previous sections. We report these interesting results Figure 5.
From results, we find that the second and third versions of PUSGAN can reach their best performance at the very beginning, which proves that PUSGAN can converge faster by increasing the batch size of fake data. Although it shows that an increase in the number of false samples can provide a small performance boost and faster convergence, it seems to be detrimental to stability. Considering the stability and for a fair comparison, we insist on using the same batch size for both the real and generated data in the above experiments.
In this paper, we present a positive-unlabeled generative adversarial network (PUGAN), where the discriminator is trained to recognize the high-quality samples from the generated data, to obtain a more stable training progress. The proposed method addresses problems in traditional methods that neglecting the gradual increase in sample quality and the imbalance of generated sample quality, which provides more stable training progress and higher generation quality. We further demonstrate that our approach has the flexibility to combine with most existing GAN frameworks without requiring the addition of computational cost. Experiments conducted on real-world image datasets suggest that the proposed method successfully improve both the stability and the quality of generated samples. We also provide some theoretical results to illustrate the justification of our approach.
Image-to-image translation with conditional adversarial networks.arXiv preprint, 2017.
Positive-unlabeled learning with non-negative risk estimator.In Advances in neural information processing systems, 2017.
Proc. of International Conference on Computer Vision, 2017.
Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.