Mining GOLD Samples for Conditional GANs

10/21/2019 ∙ by Sangwoo Mo, et al. ∙ 9

Conditional generative adversarial networks (cGANs) have gained a considerable attention in recent years due to its class-wise controllability and superior quality for complex generation tasks. We introduce a simple yet effective approach to improving cGANs by measuring the discrepancy between the data distribution and the model distribution on given samples. The proposed measure, coined the gap of log-densities (GOLD), provides an effective self-diagnosis for cGANs while being efficienty computed from the discriminator. We propose three applications of the GOLD: example re-weighting, rejection sampling, and active learning, which improve the training, inference, and data selection of cGANs, respectively. Our experimental results demonstrate that the proposed methods outperform corresponding baselines for all three applications on different image datasets.



There are no comments yet.


page 7

Code Repositories


Mining GOLD Samples for Conditional GANs (NeurIPS 2019)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The generative adversarial network (GAN) (goodfellow2014generative, ) is arguably the most successful generative model in recent years, which have shown a remarkable progress across a broad range of applications, e.g., image synthesis (brock2018large, ; karras2018style, ; park2019semantic, ), data augmentation (shrivastava2016training, ; hosseini2018augmented, ) and style transfer (zhu2017unpaired, ; choi2018stargan, ; mo2018instagan, ). In particular, as its advanced variant, the conditional GANs (cGANs) (mirza2014conditional, ) have gained a considerable attention due to its class-wise controllability (chen2016infogan, ; reed2016learning, ; choi2018stargan, ) and superior quality for complex generation tasks (odena2017conditional, ; miyato2018cgans, ; brock2018large, ). Training GANs (including cGANs), however, are known to be often hard and highly unstable (salimans2016improved, ). Numerous techniques have thus been proposed to tackle the issue from different angles, e.g., improving architectures (miyato2018spectral, ; zhang2018self, ; chen2018on, ), losses and regularizers  (gulrajani2017improved, ; odena2018generator, ; jolicoeur2018relativistic, )

and other training heuristics 

(salimans2016improved, ; sonderby2016amortised, ; chen2018self, ). One promising direction for improving GANs would be to make GANs diagnose their own training and prescribe proper remedies. This is related to another branch of research on evaluating the performance of GANs, i.e., measuring the discrepancy of the data distribution and the model distribution. One may utilize the measure to quantify better models (lucic2018gans, ) or directly use it as an objective function to optimize (nowozin2016f, ; arjovsky2017wasserstein, ). However, measuring the discrepancy of GANs (and cGANs) is another challenging problem, since the data distribution remains unknown and the distribution GANs learn is implicit (mohamed2016learning, )

. Common approaches to the discrepancy measurement of GANs include estimating the variational bounds of statistical distances

(nowozin2016f, ; arjovsky2017wasserstein, ) and using an external pre-trained network as a surrogate evaluator (salimans2016improved, ; heusel2017gans, ; sajjadi2018assessing, ). Most previous methods on this line focus on classic unconditional GANs (i.e., data-only densities), whereas discrepancy measures specialized for cGANs (i.e., data-attribute joint densities) have rarely been explored.


In this paper, we propose a novel discrepancy measure for cGANs, that estimates the gap of log-densities (GOLD) of data and model distributions on given samples, thus being called the GOLD estimator. We show that it decomposes into two terms, marginal and conditional ones, that can be efficiently computed by two branches of the discriminator of cGAN. The two terms represent generation quality and class accuracy of generated samples, respectively, and the overall estimator measures the quality of conditional generation. We also propose a simple heuristic to balance the two terms, considering suboptimality levels of the two branches.

We present three applications of the GOLD estimator: example re-weighting, rejection sampling, and active learning, which improve the training, inference, and data selection of cGANs, respectively. All proposed methods require only a few lines of modification of the original code. We conduct our experiments on various datasets including MNIST lecun1998gradient , SVHN netzer2011reading and CIFAR-10 krizhevsky2009learning , and show that the GOLD-based schemes improve over the corresponding baselines for all three applications. For example, the GOLD-based re-weighting and rejection sampling schemes improve the fitting capacity (ravuri2019seeing, ) of cGAN trained under SVHN from 74.43 to 76.71 (+3.06%) and 73.58 to 75.06 (+2.01%), respectively. The GOLD-based active learning strategy improves the fitting capacity of cGAN trained under MNIST from 92.65 to 94.60 (+2.10%).


In Section 2, we briefly revisit cGAN models. In Section 3, we propose our main method, the gap of log-densities (GOLD) and its applications. In Section 4, we present the experimental results. Finally, in Section 5, we discuss more related work and conclude this paper.

2 Preliminary: Conditional GANs

The goal of cGANs is to learn the model distribution to match with the attribute-augmented data distribution . To this end, a variety of architectures have been proposed to incorporate additional attributes (mirza2014conditional, ; salimans2016improved, ; odena2017conditional, ; zhou2017activation, ; miyato2018cgans, ). The generator maps a pair of a latent and an attribute to generate a sample whereas the discriminator

guides the generator to learn the joint distribution

. Typically, there are two ways to use the attribute information: (a) providing it as an additional input to the discriminator (i.e., ) (mirza2014conditional, ; miyato2018cgans, )

, or (b) using it to train an auxiliary classifier for the attribute (

i.e., ) (salimans2016improved, ; odena2017conditional, ; zhou2017activation, ). The main difference between the two approaches can be viewed as whether to directly learn the joint distribution or to separately learn the marginal and the conditional .111 Projection discriminator (miyato2018cgans, ) is of type (a), but it decomposes the marginal and conditional terms in their architecture. It results another estimator form of the gap of log-densities.

In this paper, we address training cGANs in a semi-supervised setting where a large amount of unlabeled data are available with only a small amount of labeled data. It is more attractive and practical than a fully-supervised setting in the sense that labeling attributes of all samples is often expensive while unlabeled data can be easily obtained. It is thus natural to utilize unlabeled data for improving the model, e.g

., via semi-supervised learning and active learning (see Section

3.2). While both of the two approaches above, (a) and (b), can be used in a semi-supervised setting, cGANs of (b) provide a more natural framework for using both labeled and unlabeled data;222

(a) requires some modifications in the architecture and/or the loss function

(sricharan2017semi, ; lucic2019high, ).
one can use the unlabeled data to learn , and the labeled data to learn both and . Therefore, we focus on evaluating the second type of architectures, e.g., the auxiliary classifier GAN (ACGAN) (odena2017conditional, ). We remark that our main idea in this paper is applicable to both types of cGANs in general.

The ACGAN model consists of the generator and the discriminator consisting of the real/generated part and the auxiliary classifier part . Then, ACGAN is trained by optimizing both the GAN loss and the auxiliary classifier loss :


where is a hyper-parameter. Here, the generator and the discriminator minimize and , respectively.333 In experiments, we use the non-saturating GAN loss (goodfellow2014generative, ) to improve the stability in training. The original work (odena2017conditional, ) simply sets , but we empirically observe that using a smaller value often improves the performance: it strengthens the wrong signal of the generator when the generator produces bad samples with incorrect attributes. Such an issue has also been reported in related work of AMGAN(zhou2017activation, ) where the authors thus use . On the other hand, under a small amount of labeled data, a strictly positive value can be effective as it provides an effect of data augmentation to train the classifier . In our experiments, we indeed observe that using a proper value (e.g., ) improves the performance of ACGAN depending on datasets.

3 Gap of Log-Densities (GOLD)

In this section, we introduce a general formula of the gap of log-densities (GOLD) that measures the discrepancy between the data distribution and the model distribution on given samples. We then propose three applications: example re-weighting, rejection sampling, and active learning.

3.1 GOLD estimator: Measuring the discrepancy of cGANs

While cGANs can converge to the true joint distribution theoretically (goodfellow2014generative, ; nowozin2016f, ), they are often far from being optimal in practice, particularly when trained with limited labels. The degree of suboptimality can be measured by the discrepancy between the true distribution and the model distribution . Here, we consider the gap of log-densities (GOLD)444 We measure the gap of log-densities, since it leads to a computationally efficient estimator., , which can be rewritten as the sum of two log-ratio terms, marginal and conditional ones:


Recall that cGANs are designed to achieve two goals jointly: generating a sample drawn from and the distribution of its class is . The marginal and conditional terms measure the discrepancy on those two effects, respectively.

The exact computation of (2) is infeasible because we have no direct access to the true distribution and the implicit model distribution. Hence, we propose the GOLD estimator as follows. First, the marginal term is approximated by since the optimal discriminator satisfies (goodfellow2014generative, ). Second, we estimate the conditional term using the classifier as follows. When a generated sample is given with its ground-truth label , is assumed be and is approximated by . When a real sample is given with the ground-truth label , is assumed to be and is approximated by . To sum up, the GOLD estimator can be defined as


Note that the conditional terms above for generated and real samples have opposite signs each other. This matches the signs of marginal and conditional terms for both generated and real samples as their marginal terms tend to be negative and positive, respectively.555 The discriminator is trained to predict 0 and 1 for generated and real samples, respectively. Hence, (3) is reasonable to measure the joint quality of two effects of conditional generation.

For the derivation of (3), we assume the ideal (or optimal) discriminator , which does not hold in practice. We often observe that the scale of marginal term is significantly larger than the conditional term because the density is harder to learn than the class-predictive distribution (see Figure 0(a)). This leads the GOLD estimator to be biased toward the generation part (marginal term), ignoring the class-condition part (conditional term). To address the imbalance issue, we develop a balanced variant of the GOLD estimator:


where and

are the standard deviations of marginal and conditional terms (among samples), respectively.

3.2 Applications of the GOLD estimator

Example re-weighting.

A high value of the GOLD estimator suggests that the sample is under-estimated with respect to the joint distribution , and vice versa. Motivated by this, we propose an example re-weighting scheme for cGAN training, that guides the generator to focus on under-estimated samples during training. Formally, we consider the following re-weighted loss;


where is a hyper-parameter to control the level of re-weighting and we use for . Our intuition is that minimizing encourages the discriminator to learn stronger feedbacks from the under-estimated (generated) samples, thus indirectly guiding the generator to emphasize their region. When the GOLD estimator is negative, is trained to suppress the over-estimated samples, which indirectly regularizes to less focus on the corresponding region.

Since the GOLD estimator only becomes meaningful with sufficiently trained discriminators, we apply the re-weighting scheme with the loss of (5) after sufficiently training the model with the original loss of (1). We find that the GOLD estimator of generated samples stably converges to zero with the re-weighting scheme, while those only with the original loss do not converge (see Figure 0(b)). Note that one may also use the balanced version of the GOLD estimator in (5). In our experiments, however, we simply use because requires computing the standard deviations and along training, which significantly increases the computational burden. Improving the scheduling and/or re-weighting for training would be an interesting future direction.

Rejection sampling.

Rejection sampling (robert2013monte, ) is a useful technique to improve the inference of generative models, i.e., the quality of generated samples. Instead of directly sampling from , we first obtain a sample from a (reasonably good) proposal distribution

, and then accept it with probability

for some constant while rejecting otherwise. Given a proper estimator for the discrepancy, this can improve the quality of generated samples by rejecting unrealistic ones. For a given generated sample with the corresponding class , the GOLD rejection sampling is defined as using the following acceptance rate:


where is set to be the maximum of among samples. This helps in recovering the true data distribution , although the model distribution is suboptimal.666 One may use advanced sampling strategy, e.g., Metropolis-Hastings GAN (MH-GAN) (turner2018metropolis, ). As MH-GAN requires the density ratio information to run, one can naturally apply the GOLD estimator.

While the recent work (azadi2018discriminator, ) studies a rejection sampling for unconditional GANs, we focus on improving cGANs and our formula (6) of the acceptance rate is different. We also remark that in order to avoid extremely low acceptance rates, following the strategy in (azadi2018discriminator, ), we first pullback the ratio with (

is the sigmoid function), subtract a constant

, and pushforward to . As in (azadi2018discriminator, ), we set the constant to be a -th percentile of the batch, where is tuned for datasets. Note that controls the precision-recall trade-off (sajjadi2018assessing, ) of samples, as the low acceptance rate (high ) improves the quality and the high acceptance rate (low ) improves the diversity.

Active learning.

The goal of active learning (settles2009active, ) is to reduce the cost of labeling by predicting the best real samples (i.e., queries) to label to improve the current model. In training cGANs with active learning, it is natural to find and label samples with high GOLD values since they can be viewed as under-estimated ones with respect to the current model. For unlabeled samples, however, we do not have access to ground-truth class and thus (or ). To tackle this issue, we take an expectation of over the class probability using and estimate the conditional term as


where is the entropy function. Using the approximation above, the GOLD estimator for the unlabeled real samples can be defined as:


where and are the standard deviations of marginal and conditional (i.e., entropy) terms.

As in the conventional active learning for classifiers, one can view the first term in (8) as a density (or representativeness) score (gissin2019discriminative, ; sinha2019variational, ), which measures how well the sample represents the data distribution. The second term is an uncertainty (or informativeness) score (gal2017deep, ; beluch2018power, ), which measures how informative the label is for the current model. Hence, our method can be interpreted as a combination of the density and uncertainty scores (huang2010active, ) in a principled, yet scalable way. We finally remark that we also utilize all unlabeled samples in the pool to train our model, i.e., semi-supervised learning, which can be naturally done in the cGAN framework of our interest.


(a) Marginal/conditional terms


(b) GOLD estimator


(c) Fitting capacity
Figure 1: (a) Histogram of the marginal/conditional terms of the GOLD estimator. Training curve of the mean of the GOLD estimator (of generated samples) (b) and the fitting capacity (c), for the baseline model and that trained by the re-weighting scheme (GOLD) under MNIST dataset.

4 Experiments

In this section, we demonstrate the effectiveness of the GOLD estimator for three applications: example re-weighting, rejection sampling, and active learning. We conduct experiments on one synthetic point dataset and six image datasets: MNIST (lecun1998gradient, ), FMNIST (xiao2017fashion, ), SVHN (netzer2011reading, ), CIFAR-10 (krizhevsky2009learning, ), STL-10 (coates2011analysis, ), and LSUN (yu2015lsun, ). The synthetic dataset consists of random samples drawn from a Gaussian mixture with 6 clusters, where we assign the clusters binary labels to obtain 2 groups of 3 clusters (see Figure 4). As the choice of cGAN models to evaluate, we use the InfoGAN (chen2016infogan, ) model for 1-channel images (MNIST and FMNIST), the ACGAN (odena2017conditional, ) model for 3-channel images (SVHN, CIFAR-10, STL-10, and LSUN), and the GAN model of (gulrajani2017improved, ) with an auxiliary classifier for the synthetic dataset. For all experiments, the spectral normalization (SN) (miyato2018spectral, ) is used for more stable training. We set the balancing factor to in most of our experiments but lower the value when training cGANs on small datasets.777 This is because the generator is more likely to produce bad samples with incorrect attributes for small datasets, which strengthens the wrong signal. For all experiments on example re-weighting and rejection sampling, we choose the default value . For experiments on active learning, we choose and for synthetic/MNIST and FMNIST/SVHN datasets, respectively. The reported results are averaged over 5 trials for image datasets and 25 trials for the synthetic dataset.

As the evaluation metric for data generation, we choose to use the fitting capacity recently proposed in

(ravuri2019seeing, ; lesort2018training, ). It measures the accuracy of the real samples under a classifier trained with generated samples of cGAN, where we use LeNet (lecun1998gradient, ) as the classifier.888 We use training data to train ACGAN and test data to evaluate the fitting capacity, except LSUN that we use validation data for both training and evaluation due to the class imbalance of the training data. Intuitively, fitting capacity should match to the ‘true’ classifier accuracy (trained with real samples) if the model distribution perfectly matches to the real distribution. It is a natural evaluation metric for cGANs, as it directly measures the performance of conditional generation. Here, one may also suggest other popular metrics, e.g., Inception score (IS) (salimans2016improved, ) or Fréchet Inception distance (FID) (heusel2017gans, ), but the work of (ravuri2019seeing, ) have recently shown that when IS/FID of generated samples match to those of real ones, the fitting capacity is often much lower than the real classifier accuracy (i.e

., low correlation between IS/FID and fitting capacity). Furthermore, IS/FID are not suitable for non-ImageNet-like images,

e.g., MNIST or SVHN. Nevertheless, we provide some FID results in Supplementary Material for the interest of readers.

Baseline 96.430.17 77.971.24 74.430.71 36.760.99 36.730.64 26.350.82
GOLD 96.620.15 78.341.11 76.710.94 37.061.38 37.650.71 28.210.86
Table 2: Fitting capacity (%) for example re-weighting under various levels of supervision.

! Dataset 1% 5% 10% 20% 50% 100% Baseline SVHN 72.411.30 72.991.65 73.150.96 73.181.28 74.041.26 74.330.71 GOLD 75.011.93 75.580.86 75.780.74 76.041.93 76.251.40 76.710.94 Baseline CIFAR-10 17.990.78 18.420.71 21.841.14 23.131.95 35.411.03 36.760.99 GOLD 18.280.65 19.150.97 21.912.56 23.892.02 34.951.11 37.061.38

Table 1: Fitting capacity (%) (ravuri2019seeing, ) for example re-weighting under various datasets.

4.1 Example re-weighting

We first evaluate the effect of the re-weighting scheme using the loss (5

). We train the model for 20 and 200 epochs for 1-channel and 3-channel images, respectively. We use the baseline loss (

1) for the first half of epochs and the re-weighting scheme for the next half of epochs. We simply choose for the discriminator loss and for the generator loss. This is because a large

for the generator loss unstabilizes training by incurring high variance of gradients.

999 We do not make much effort in choosing as the choice is enough to show the improvement. We train the LeNet classifier (for fitting capacity) for 40 epochs, using 10,000 newly generated samples for each epoch. Figure 0(b) and Figure 0(c) report the training curves of the GOLD estimator (of generated samples) and the fitting capacity respectively, under MNIST dataset. Figure 0(b) shows that the GOLD estimator under the re-weighting scheme stably converges to zero, while that of baseline model monotonically decreases. As a result, in Figure 0(c), one can observe that the re-weighting scheme improves the fitting capacity, while that of the baseline model become worse as training proceeds. Table 2 and Table 2 report the fitting capacity for fully-supervised settings (i.e., use full labels of datasets to train cGANs) and semi-supervised settings (i.e., use only supervision of datasets to train cGANs), respectively. In most reported cases, our method outperforms the baseline model. For example, ours improves the fitting capacity from 74.43 to 76.71 (+3.06%) under the full labels of SVHN.

4.2 Rejection sampling

Next, we demonstrate the effect of the rejection sampling. We use the model trained by the original loss (1) with fully labeled datasets.101010 One can also use the model trained by the re-weighting scheme of loss (5) for further improvement. To emphasize the sampling effect, we use the fixed 50,000 samples instead of re-sampling for each epoch. We use for 1-channel images, and for 3-channel images. Table 4 presents the fitting capacity of the rejection sampling under various datasets. Our method shows a consistent improvement over the baseline (random sampling without rejection), e.g., ours improves from 73.58 to 75.06 (+2.01%) for SVHN. We also study the effect of , the control parameter of the acceptance ratio for the rejection sampling (high rejects more samples). As high harms the diversity and low harms the quality, we see the proper (e.g., 0.5 for CIFAR-10) shows the best performance. Table 4 and Figure 5

in Supplementary Material present the fitting capacity and the precision and recall on distributions (PRD)

(sajjadi2018assessing, ) plot, respectively, under CIFAR-10 and various values. Indeed, both low () and high () values harm the performance, and is of the best choice among them.

We also qualitatively analyze the effect of the rejection sampling. The first row of Figure 2 visualizes the generated samples with high marginal, conditional, and combined (GOLD) values. We observe that the random samples (without rejection) often contain low-quality samples with uncertain and/or wrong classes. On the other hand, samples with high marginal values improve the quality (or vividness), and samples with high conditional values improve the class accuracy (but loses the diversity). The samples with high GOLD values get the best of the both worlds, and produce diverse images with only a few wrong classes.

Baseline 96.050.41 77.940.83 73.580.72 35.150.51 34.330.30 26.430.14
GOLD 96.170.63 78.250.30 75.060.71 35.981.15 35.211.02 26.790.42
Table 4: Fitting capacity (%) for rejection sampling under CIFAR-10 and various values.
Baseline p = 0.1 p = 0.3 p = 0.5 p = 0.7 p = 0.9
35.150.51 35.800.42 35.870.61 35.981.15 35.850.53 35.330.53
Table 3: Fitting capacity (%) for rejection sampling under various datasets.


Figure 2: Generated and real samples with high marginal, conditional, and combined (GOLD) values. Generated samples are aligned by the class (each row), and the red box indicates the uncertain and/or wrong classes. See Section 4.2 and Section 4.3 for the detailed explanation.

4.3 Active learning

Finally, we demonstrate the active learning results. We conduct our experiments on a synthetic dataset and 3 image datasets (MNIST, FMNIST, SVHN). We train on the semi-supervised setting, as we have a large pool of unlabeled samples. We run 4 query acquisition steps (i.e., 5 training steps), where the triplet of initial (labeled) training set size, query size, and the final (labeled) training set size are set by (4,1,8), (10,2,18), (20,5,40), and (20,20,100) for synthetic, MNIST, FMNIST, and SVHN, respectively. We train the model for 100 epochs, and choose the model with the best fitting capacity on the validation set (of size 100), to compute the GOLD estimator for the query acquisition. Interestingly, we found that keeping the parameters of the generator (while re-initializing the discriminator) for the next model in the active learning scenario improves the performance. This is because the discriminator is easily overfitted and hard to escape from the local optima, but the generator is relatively easy to spread out the generated samples. We use this re-initialization scheme (i.e., keep and re-initialize ) for all active learning experiments. For query acquisition, we use the vanilla version of the GOLD estimator (8) for image datasets, but use the balanced version (9) for the synthetic dataset, as the synthetic dataset suffers from the over-confidence problem.

Figure 4 visualizes the selected queries based on the GOLD estimator under the synthetic dataset. The GOLD estimator has high values on the uncovered or the uncertain (i.e., samples are not obtained) regions, in which high marginal and conditional values occur, respectively. See the leftmost region of column 2 and the upmost region of column 3 for each case. Indeed, both components of the GOLD estimator contribute to the query selection. Consequently, the GOLD estimator effectively selects queries and learn the true joint distribution. In contrast, the random selection often picks redundant or less important regions, which makes the convergence slower. Figure 4 presents the quantitative results. Our method outperforms the random query selection, e.g., the final fitting capacity of our method on MNIST is 94.60, which improves 92.65 of the baseline by 2.10%.

In addition, we qualitatively analyze the effect of two (marginal and conditional) terms of the GOLD estimator. The second row of Figure 2 presents the real samples with high marginal, conditional, and combined (GOLD) values. We observe that samples picked under high marginal values have multiple digits (which are hard to generate) and those picked under high conditional values have uncertain classes. On the other hand, the GOLD estimator picks the uncertain samples with multiple digits, which takes the advantage of both.


(a) Synthetic






(d) SVHN
Figure 4: Fitting capacity for active learning under various datasets.
Figure 3: Visualization of the query selection based on the GOLD estimator. The first and second row are selected queries and generated samples, respectively. The third row is the GOLD estimator values, that the sample with the highest value is selected for the next iteration.

5 Discussion and Conclusion

We have proposed a novel, yet simple GOLD estimator which measures the discrepancy of the data distribution and the model distribution on given samples, which can be efficiently computed under the conditional GAN (cGAN) framework. We also propose three applications of the GOLD estimator: example re-weighting, rejection sampling, and active learning, which improves the training, inference, and data selection of cGANs, respectively. We are the first one studying these problems of cGAN in the literature, while those of classification models or the (original unconditional) GAN have been investigated in the literature. First, re-weighting (ren2018learning, ) or re-sampling (chang2017active, ; katharopoulos2018not, )

examples are studied to improve the performance, convergence speed, and/or robustness of the convolutional neural networks (CNNs). From the line of the research, we show that the re-weighting scheme can also improve the performance of cGANs. To this end, we use the higher weights for the samples with the larger discrepancy, which resembles the prior work on the hard example mining

(shrivastava2016training, ; lin2017focal, ) for classifiers/detectors. Designing a better re-weighting scheme or a better scheduling technique (bengio2009curriculum, ; kumar2010self, ) would be an interesting future research direction. Second, active learning (settles2009active, ) has been also well studied for the classification models (gal2017deep, ; sener2017active, ). Finally, there is a recent work which proposes the rejection sampling (robert2013monte, ) for the original (unconditional) GANs (azadi2018discriminator, ). In contrast to the prior work, we focus on the conditional generation, i.e., consider both the generation quality and the class accuracy. We finally remark that investigating other applications of the GOLD estimator, e.g

., outlier detection

(lee2017training, ) or training under noisy labels (ren2018learning, ), would also be an interesting future direction.


This research was supported by the Information Technology Research Center (ITRC) support program (IITP-2019-2016-0-00288), Next-Generation Information Computing Development Program (NRF-2017M3C4A7069369), and Institute of Information & communications Technology Planning & Evaluation (IITP) grant (No.2017-0-01779, A machine learning and statistical inference framework for explainable artificial intelligence), funded by the Ministry of Science and ICT, Korea (MSIT). We also appreciate GPU support from Brain Cloud team at Kakao Brain.


Appendix A PRD plot for Rejection Sampling

We provide the precision and recall on distributions (PRD) [45] plot in Figure 5 as a complementary of Table 4. As high harms the diversity and low harms the quality, we see the proper (e.g., 0.5 for CIFAR-10) shows the best PRD curve among various values.


Figure 5: PRD plot [45] for rejection sampling under CIFAR-10 and various values.

Appendix B FID scores for Rejection Sampling

We provide the Fréchet Inception distance (FID) [17] scores in Table 5 as a complementary of Table 4. Similar to the fitting capacity [41], our method shows a consistent improvement over the baseline.

Baseline 10.780.04 12.380.03 8.280.07 9.460.04 14.470.04 14.380.03
GOLD 10.700.05 12.320.06 8.120.06 9.440.02 14.440.04 14.350.06
Table 5: FID scores [17] for rejection sampling under various datasets.

Appendix C Robustness to the Mode Collapse

We provide the results on highly unstable training scenario, that the model severely suffers from the mode collapse. To this end, we use only 10 labeled samples (and no additional unlabeled samples) from the MNIST dataset. In this case, we observe that mode collapsing occurs at around 1,000-th epoch, and we apply our method for the next 1,000 epochs. Figure 6 shows that our method can mitigate the instability issue, significantly improving the FID score during training.


Figure 6: FID scores for the highly unstable scenario.

Appendix D Larger-scale Experiments

We provide the larger-scale experimental results. To this end, we use the ACGAN [39] model of size 128128 designed for the ImageNet [12] dataset. Following the experiment setting of [39], we train ACGAN model on 10 subclasses of the ImageNet.111111 The authors of [39] split ImageNet into 100 groups of 10 classes, and train 100 ACGAN models. In particular, we use the Imagenette and Imagewoof ( dataset which sampled the easiest and hardest 10 classes, respectively. We follow the same experiment setting with other vision datasets, but train 100 epochs and choose . Table 6 and Table 7 report the fitting capacity on example re-weighting and rejection sampling, respectively. Indeed, ours outperform the corresponding baselines.

Imagenette Imagewoof
Baseline 20.843.39 18.601.89
GOLD 23.643.60 19.760.90
Table 6: Fitting capacity (%) for example re-weighting under larger-scale datasets.
Imagenette Imagewoof
Baseline 16.893.51 16.441.97
GOLD 17.212.08 16.712.46
Table 7: Fitting capacity (%) for rejection sampling under large-scale datasets.