# Robustness of Conditional GANs to Noisy Labels

We study the problem of learning conditional generators from noisy labeled samples, where the labels are corrupted by random noise. A standard training of conditional GANs will not only produce samples with wrong labels, but also generate poor quality samples. We consider two scenarios, depending on whether the noise model is known or not. When the distribution of the noise is known, we introduce a novel architecture which we call Robust Conditional GAN (RCGAN). The main idea is to corrupt the label of the generated sample before feeding to the adversarial discriminator, forcing the generator to produce samples with clean labels. This approach of passing through a matching noisy channel is justified by corresponding multiplicative approximation bounds between the loss of the RCGAN and the distance between the clean real distribution and the generator distribution. This shows that the proposed approach is robust, when used with a carefully chosen discriminator architecture, known as projection discriminator. When the distribution of the noise is not known, we provide an extension of our architecture, which we call RCGAN-U, that learns the noise model simultaneously while training the generator. We show experimentally on MNIST and CIFAR-10 datasets that both the approaches consistently improve upon baseline approaches, and RCGAN-U closely matches the performance of RCGAN.

## Authors

• 4 publications
• 10 publications
• 4 publications
• 39 publications
• ### Stabilizing GAN Training with Multiple Random Projections

Training generative adversarial networks is unstable in high-dimensions ...

05/22/2017 ∙ by Behnam Neyshabur, et al. ∙ 0

read it

• ### Robust conditional GANs under missing or uncertain labels

Matching the performance of conditional Generative Adversarial Networks ...

06/09/2019 ∙ by Kiran Koshy Thekumparampil, et al. ∙ 9

read it

• ### Improving Detection of Credit Card Fraudulent Transactions using Generative Adversarial Networks

In this study, we employ Generative Adversarial Networks as an oversampl...

07/07/2019 ∙ by Hung Ba, et al. ∙ 0

read it

• ### The Implicit Metropolis-Hastings Algorithm

Recent works propose using the discriminator of a GAN to filter out unre...

06/09/2019 ∙ by Kirill Neklyudov, et al. ∙ 0

read it

• ### Limited Gradient Descent: Learning With Noisy Labels

Label noise may handicap the generalization of classifiers, and it is an...

11/20/2018 ∙ by Yi Sun, et al. ∙ 14

read it

• ### Adversarially-Trained Normalized Noisy-Feature Auto-Encoder for Text Generation

This article proposes Adversarially-Trained Normalized Noisy-Feature Aut...

11/10/2018 ∙ by Xiang Zhang, et al. ∙ 0

read it

• ### Adaptive Divergence for Rapid Adversarial Optimization

Adversarial Optimization (AO) provides a reliable, practical way to matc...

12/01/2019 ∙ by Maxim Borisyak, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Conditional generative adversarial networks (GAN) have been widely successful in several applications including improving image quality, semi-supervised learning, reinforcement learning, category transformation, style transfer, image de-noising, compression, in-painting, and super-resolution

[30, 13, 48, 36, 26, 58]. The goal of training a conditional GAN is to generate samples from distributions satisfying certain conditioning

on some correlated features. Concretely, given samples from joint distribution of a data point

and a label , we want to learn to generate samples from the true conditional distribution of the real data . A canonical conditional GAN studied in literature is the case of discrete label [30, 36, 35, 32]. Significant progresses have been made in this setting, which are typically evaluated on the quality of the conditional samples. These include measuring inception scores and intra Fréchet inception distances, visual inspection on downstream tasks such as category morphing and super resolution [32], and faithfulness of the samples as measured by how accurately we can infer the class that generated the sample [36].

We study the problem of training conditional GANs with noisy discrete labels. By noisy labels, we refer to a setting where the label for each example in the training set is randomly corrupted. Such noise can result from an adversary deliberately corrupting the data [7] or from human errors in crowdsourced label collection [12, 18]. This can be modeled as a random process, where a clean data point and its label are drawn from a joint distribution with classes. For each data point, the label is corrupted by passing through a noisy channel represented by a row-stochastic confusion matrix defined as . This defines a joint distribution for the data point and a noisy label : . If we train a standard conditional GAN on noisy samples, then it solves the following optimization:

 minG∈GmaxD∈FV(G,D)=E(x,˜y)∼˜PX,˜Y[ϕ(D(x,˜y))]+Ez∼N,y∼˜P˜Y[ϕ(1−D(G(z;y),y))] (1)

where is a function of choice, and are the discriminator and the generator respectively optimized over function classes and of our choice, and

is the distribution of the latent random vector. For typical choices of

, for example , and large enough function classes and , the optimal conditional generator learns to generate samples from , the corrupted conditional distribution. In other words, it generates samples from classes other than what it is conditioned on. As the learned distribution exhibits such a bias, we call this naive approach the Biased GAN. Under this setting, there is a fundamental question of interest: can we design a novel conditional GAN that can generate samples from the true conditional distribution , even when trained on noisy samples?

Several aspects of this problem make it challenging and interesting. First, the performance of such robust GAN should depend on how noisy the channel is. If is rank-deficient, for instance, then there are multiple distributions that result in the same distribution after the corruption, and hence no reliable learning of the true distribution is possible. We would ideally want a theoretical guarantee that shows such trade-off between and the robustness of GANs. Next, when the noise is from errors in crowdsourced labels, we might have some access to the confusion matrix from historical data. On other cases of adversarial corruption, we might not have any information of . We want to provide robust solutions to both. Finally, an important practical challenge in this setting is to correct the noisy labels in the training data. We address all such variations in our approaches and make the following contributions.

Our contributions. We introduce two architectures to train conditional GANs with noisy samples.

First, when we have the knowledge of the confusion matrix , we propose RCGAN (Robust Conditional GAN) in Section 2. We first prove that minimizing the RCGAN loss provably recovers the clean distribution (Theorem 2), under certain conditions on the class of discriminators we optimize over (Assumption 1). We show that such a condition on is also necessary, as without it, the training loss can be arbitrarily small while the generated distribution can be far from the real (Theorem 3). The assumption leads to our particular choice of the discriminator in RCGAN, called projection discriminator [32] that satisfies all the conditions (Remark 2). Finally, we provide a finite sample generalization bound showing that the loss minimized in training RCGAN does generalize, and results in the learned distribution being close to the clean conditional distribution (Theorem 4). Experimental results in benchmark datasets confirm that RCGAN is robust against noisy samples, and improves significantly over the naive Biased GAN.

Secondly, when we do not have access to , we propose RCGAN-U (RCGAN with Unknown noise distribution) in Section 4. We provide experimental results showing that performance gains similar to that of RCGAN can be achieved. Finally, we showcase the practical use of thus learned conditional GANs, by using it to fix the noisy labels in the training data. Numerical experiments confirm that the RCGAN framework provides a more robust approach to correcting the noisy labels, compared to the state-of-the-art methods that rely only on discriminators.

Related work. Two popular training methods for generative models are variational auto-encoders [22] and adversarial training [14]. The adversarial training approach has made significant advances in several applications of practical interest. [37, 2, 5] propose new architectures that significantly improve the training in practical image datasets. [58, 16] propose new architectures to transfer the style of one image to the other domain. [26, 43] show how to enhance a given image with learned generator, by enhancing the resolution or making it more realistic. [27, 50] show how to generate videos and [51, 1] demonstrate that 3-dimensional models can be generated from adversarial training. [23] proposes a new architecture encoding causal structures in conditional GANs. [42] introduces the state-of-the-art conditional independence tester. On a different direction, several recent approaches showcase how the manifold learned by the adversarial training can be used to solve inverse problems [9, 57, 53, 49].

Conditional GANs have been proposed as a successful tool for various applications, including class conditional image generation [36], image to image translation [21], and image generation from text [38, 55]. Most of the conditional GANs incorporate the class information by naively concatenating it to the input or feature vector at some middle layer [30, 13, 38, 55]. AC-GANs [36]

creates an auxiliary classifier to incorporate class information. Projection discriminator GAN

[32] takes an inner product between the embedded class vector and the feature vector. A recent work [31] which proposes spectral normalization shows that high quality image generation on -class ILSVRC2012 dataset [39] can be achieved using projection conditional discriminator.

Robustness of (unconditional) GANs against adversarial or random noise has recently been studied in [10, 52]. [52] studies an adversary attacking the output of the discriminator, perturbing the discriminator output with random noise. The proposed architecture of RCGAN is inspired by a closely related work of AmbientGAN in [10]. AmbientGAN is a general framework addressing any corruption on the data itself (not necessarily just the labels). Given a corrupted samples with known corruption, AmbientGAN applies that corruption to the output of the generator before feeding them to the discriminator. This has shown to successfully de-noise images in several practical scenarios.

Motivated by the success of AmbientGAN in de-noising, we propose RCGAN. An important distinction is that we make specific architectural choices guided by our theoretical analysis that gives a significant gain in practice as shown in Section 6. Under the scenario of interest with noisy labels, we provide sharp analyses for both the population loss and the finite sample loss. Such sharp characterizations do not exist for the more general AmbientGAN scenarios. Further, our RCGAN-U does not require the knowledge of the confusion matrix, departing from the AmbientGAN approach. Training classifiers from noisy labels is a closely related problem. Recently, [34, 20] proposed a theoretically motivated classifier which minimizes the modified loss in presence of noisy labels and showed improvement over the robust classifiers [29, 45, 46].

Notation. For a vector , is the standard -norm. For a matrix , let denote the operator norm. Then , and

, the maximum singular value.

is all ones vector with appropriate dimensions and

is identity matrix with appropriate dimensions.

. For a vector , () is its -th coordinate.

## 2 Our first architecture: RCGAN

Training a conditional GAN with noisy samples results in a biased generator. We propose Robust Conditional GAN (RCGAN) architecture which has the following pre-processing, discriminator update, and generator update steps. We assume in this section that the confusions matrix is known (and the marginal can easily be inferred), and address the case of unknown in Section 4.

Pre-processing: We train a classifier to predict the noisy label given under a loss , trained on , where

is a parametric family of classifiers (typically neural networks) and

is the joint distribution of real and corresponding real noisy .

D-step: We train on the following adversarial loss. In the second term below, is generated according to and corresponding noisy labels are generated by corrupting the according to the conditional distribution which is the -th row of the confusion matrix (assumed to be known):

 maxD∈FE(x,˜y)∼˜PX,˜Y[ϕ(D(x,˜y))]+Ez∼N,y∼PY˜y|y∼Cy[ϕ(1−D(G(z;y),˜y))],

where is the true marginal distribution of the labels, is the distribution of the latent random vector, and is a family of discriminators.

G-step: We train on the following loss with some :

 minG∈GEz∼N,y∼PY˜y|y∼Cy[ϕ(1−D(G(z;y),˜y))+λℓ(h∗(G(z;y)),y)], (2)

where is a family of generators. The idea of using auxiliary classifiers have been used to improve the quality of the image and stability of the training, for example in auxiliary classifier GAN (AC-GAN) [36], and improve the quality of clustering in the latent space [33]. We propose an auxiliary classifiers , mitigating a permutation error, which we empirically identified on naive implementation of our idea with no regularizers.

Permutation regularizer (controlled by ). Permutation error occurs if, when asked to produce samples from a target class, the trained generator produces samples dominantly from a single class but different from the target class. We propose a regularizer , which predicts the noisy label . As long as the confusion matrix is diagonally dominant, which is a necessary condition for identifiability, this regularizer encourages the correct permutation of the labels.

Theoretical motivation for RCGAN. When , we get the standard conditional GAN update steps, albeit one which tries to minimize discriminator loss between the noisy real distribution and the distribution of the generator when the label is passed through the same noisy channel parameterized by . The main idea of RCGAN is to minimize a certain divergence between noisy real data and noisy generated data. For example, the choice of bounded functions and identity map leads to a total variation minimization; The loss minimized in the G-step is the total variation between the two distributions with corrupted labels, up to some scaling and some shift. If we choose and , then we are minimizing the Jensen-Shannon divergence , where

denotes the Kullback-Leibler divergence. The following theorem provides approximation guarantees for some common divergence measures over noisy channel, justifying our proposed practical approach. We refer to Appendix

B for a proof.

###### Theorem 1.

Let and be two distributions on . Let be the corresponding distributions when samples from are passed through the noisy channel given by the confusion matrix (as defined in Section 1). If is full-rank, we get,

 dTV(˜P,˜Q)≤ (3) 18dJS(˜P∥∥˜Q)2≤ (4)

To interpret this theorem, let denote the distribution of the generator. The theorem implies that when the noisy generator distribution becomes close to the noisy real distribution in total variation or in Jensen-Shannon divergence, then the generator distribution must be close to the distribution of real data in the same metric. This justifies the use of the proposed architecture RCGAN. In practice, we minimize the sample divergence of the two distributions, instead of the population divergence as analyzed in the above theorem. However, these standard divergences are known to not generalize in training GANs [3]. To this end, we provide in Section 3 analyses on neural network distances, which are known to generalize, and provide finite sample bounds.

## 3 Theoretical Analysis of RCGAN

It was shown in [3] that standard GAN losses of Jensen-Shannon divergence and Wasserstein distance both fail to generalize with a finite number of samples. On the other hand, more recent advances in analyzing GANs in [56, 6, 4] show promising generalization bounds by either assuming Lipschitz conditions on the generator model or by restricting the analysis to certain classes of distributions. Under those assumptions, where JS divergence generalizes, Theorem 1 justifies the use of the proposed RCGAN. However, those require the distribution to be Gaussian, mixture of Gaussians, or output of a neural network generator, for example in [4].

In this section, we provide analyses of RCGAN on a distance that generalizes without any assumptions on the distribution of the real data as proven in [3]: neural network distance. Formally, consider a class of real-valued functions and a function which is either convex or concave. The neural network distance is defined as

 (5)

where is the distribution of the real data, is that of the generated data, and is the constant correction term to ensure that . We further assume that includes three constant functions , , and , in order to ensure that and , as shown in Lemma 1 in the Appendix.

The proposed RCGAN with approximately minimizes the neural network distance between the two corrupted distributions. In practice, is a parametric family of functions from a specific neural network architecture that the designer has chosen. In theory, we aim to identify how the choice of class provides the desired approximation bounds similar to those in Theorem 1, but for neural network distances. This analysis leads to the choice of projection discriminator [32] to be used in RCGAN (Remark 2). On the other hand, we show in Theorem 3 that an inappropriate choice of the discriminator architecture can cause non-approximation. Further, we provide the sample complexity of the approximation bounds in Theorem 4.

We refer to the un-regularized version with

as simply RCGAN. In this section, we focus on a class of loss functions called Integral Probability Metrics (IPM) where

[44]. This is a popular choice of loss in GANs in practice [47, 2, 8] and in analyses [4]. We write the induced neural network distance as , dropping the in the notation.

### 3.1 Approximation bounds for neural network distances

We define an operation over a matrix and a class of functions on as

 T∘F≜{f∈F|f(x,y)=∑˜y∈[m]Ty˜yf(x,˜y)}. (6)

This makes it convenient to represent the neural network distance corrupted by noise with a confusion matrix , where is the probability a label is corrupted as . Formally, it follows from (5) and (6) that . We refer to Appendix E for a proof. For to be a good approximation of , we show that the following condition is sufficient.

###### Assumption 1.

We assume that the class of discriminator functions can be decomposed into three parts such that is any constant and

• satisfies the inclusion condition:

 T∘F1⊆F1, (7)

for all ; and

• satisfies the label invariance condition: there exists a class of sets of functions, parametrized by , such that

 F2={αf(x,y)|f(x,y)=f(x),f(x)∈F(x),α∈[0,1]}. (8)

We discuss the necessity and practical implications of this assumption in Section 3.2, and give examples satisfying these assumptions in Remarks 2 and 3. Notice that a trivial class with a single constant zero function satisfies both inclusion and label invariance conditions. For example, we can choose and also choose to set either or , in which case only needs to satisfy either one of the conditions in Assumption 1. The flexibility that we gain by allowing the set addition is critical in applying these conditions to practical discriminators, especially in proving Remark 2. Note that in the inclusion condition in Eq. 7, we require the condition to hold for all max-norm bounded set: . The reason a weaker condition of all row-stochastic matrices, , does not suffice is that in order to prove the upper bound in Eq. 9, we need to apply the invariance condition to . This matrix is not row-stochastic, but still max-norm bounded.

We first show that Assumption 1 is sufficient for approximability of the neural network distance from corrupted samples. For two distributions and on , let and be the corresponding corrupted distributions respectively, where the label is passed through the noisy channel defined by the confusion matrix , i.e. .

###### Theorem 2.

If a class of functions satisfies Assumption 1, then

 (9)

where we follow the convention that if is not full rank.

We refer to Appendix E for a proof. This gives a sharp characterization on how two distances are related: the one we can minimize in training RCGAN (i.e. ) and the true measure of closeness (i.e. ). Although the latter cannot be directly evaluated or minimized, RCGAN is approximately minimizing the true neural network distance as desired.

The lower bound proves a special case of the data-processing inequality. Two random variables from

and get closer in neural network distance, when passed through a stochastic transformation. The upper bound puts a limit on how much closer and can get, depending on the noise level. This fundamental trade-off is captured by . Under the noiseless case where is the identity matrix, we have and we recover a trivial fact that the two distances are equal. On the other extreme, if is rank deficient, we use the convention that and the two distances can be arbitrarily different. The approximation factor of captures how much the space can shrink by the noise . This coincides with Theorem 1, where a similar trade-off was identified for the TV distance. Next remark shows that these bounds cannot be tightened for general , , and . A proof is provided in Appendix D.

###### Remark 1.

For any full-rank confusion matrix , there exist pairs of distributions and , and a function class satisfying Assumption 1, such that

• , and

•  .

Theorem 2 shows that RCGAN can learn the true conditional distribution, justifying its use; and performance of RCGAN is determined by how noisy the samples are via . There are still two loose ends. First, does practical implementation of RCGAN architecture satisfy the inclusion and/or label invariance assumptions? Secondly, in practice we cannot minimize as we only have a finite number of samples. How much do we lose in this finite sample regime? We give precise answers to each question in the following two sections.

### 3.2 Inclusion and label invariance assumptions

Several class of functions satisfy Assumption 1 (c.f. Remark 3). For RCGAN, we propose a popular state-of-the-art discriminator for conditional GANs known as the projection discriminator [32], parametrized by , , and :

 DV,v,θ(x,y)=vec(y)TVψ(x;θ)+vTψ′(x;θ), (10)

where and are vector valued parametric functions for some integers , and . The first term satisfies the inclusion condition, as any operation with can be absorbed into . The second term is label invariant as it does not depend on . This is made precise in the following remark, whose proof is provided in Appendix F. Together with this remark, the approximability result in Theorem 2 justifies the use of projection discriminators in RCGAN, which we use in all our experiments.

###### Remark 2.

The class of projection discriminators defined in Eq. 10 satisfies Assumption 1 for any , , and , if

Other choices of and are also possible. For example, or are also sufficient. We find the proposed choice of easy to implement, as a column-wise -norm normalization via projected gradient descent. We describe implementation details in Appendix I.

Next, we ask if Assumption 1 is necessary also. We show that for all pairs of distributions satisfying the following technical conditions, and all confusion matrix , there exists a class where approximation bounds in (9) fail.

###### Assumption 2.

We consider a pair of distributions and and a confusion matrix satisfying the following conditions:

• The random variable conditioned on

is a continuous random variable with density functions

and , respectively.

• There exists , and

is not a right eigenvector of

, for all , where .

A pair violating the above assumptions either has that is a mixture of continuous and discrete distribution, or all ’s are aligned with the right eigenvectors of .

###### Theorem 3.

For all sufficiently small , all distributions and satisfying Assumption 2, and all full-rank , there exist not satisfying Assumption 1, such that

 dF3(˜P,˜Q)≤Oϵ(ϵ) and dF3(P,Q)≥Oϵ(1), (11)

and not satisfying Assumption 1, such that

 dF4(˜P,˜Q)≥Oϵ(1) and dF4(P,Q)≤Oϵ(ϵ). (12)

We refer to Appendix G for a proof. This implies that some assumptions on the function class are necessary, such as those in Assumption 1. Without any restrictions, we can find bad examples where the two distances and are arbitrarily different for any , , and .

### 3.3 Finite sample analysis

In practice, we do not have access to the probability distributions

and . Instead, we observe a set of samples of a finite size , from each of them. In training GAN, we minimize the empirical neural network distance, , where and denote the empirical distribution of samples. Inspired from the recent generalization results in [3], we show that this empirical distance minimization leads to small up to an additive error that vanishes with an increasing sample size. As shown in [3], Lipschitz and bounded function classes are critical in achieving sample efficiency for GANs. We follow the same approach over a similar function class. Let

 Fp,L={Du(x,y)∈[0,1]| Du(x,y) is L-Lipschitz in u and u∈U⊆Rp}, (13)

be a class of bounded functions with parameter . We say that is -Lipschitz in if

 ∣∣Du1(x,y)−Du2(x,y)∣∣≤L∥u1−u2∥,∀u1,u2∈U,x∈X,y∈[m]. (14)
###### Theorem 4.

For any class of bounded Lipschitz functions satisfying Assumption 1, there exists a universal constant such that

 dFp,L(˜Pn,˜Qn)−ϵ≤dFp,L(P,Q)≤|||C−1|||∞(dFp,L(˜Pn,˜Qn)+ϵ), (15)

with probability at least for any and large enough,

We refer to Appendix H for a proof. This justifies the use of the proposed RCGAN which minimizes , as it leads to the generator being close to the real distribution in neural network distance, . These bounds inherit the approximability of the population version in Theorem 2.

## 4 Our second architecture: RCGAN-U

In many real world scenarios the confusion matrix is unknown. We propose RCGAN-Unknown (RCGAN-U) algorithm which jointly estimates the real distribution and the noise model . The pre-processing and D steps of the RCGAN-U are the same as those of RCGAN, assuming the current guess of the confusion matrix. As the G-step in (2) is not differentiable in , we use the following reparameterized estimator of the loss, motivated by similar technique in training classifiers from noisy labels:

 minG∈G,M∈CEz∼Ny∼PY[ϕM(G(z;y),y,D)+λl(h∗(G(z;y)),y)]

where is the set of all transition matrices and .

## 5 Experiments

Implementation details are explained in Appendix I. We consider one-coin based models, which are parameterized by their label accuracy probability . In this model a sample with true label is flipped uniformly at random to label in with probability . The entries of its confusion matrix , will then be and , where is the number of classes. We call this model uniform flipping model. We train proposed GANs on MNIST and CIFAR- datasets [25, 24] and compare them to two baselines. Code to reproduce our experiments is available at https://github.com/POLane16/Robust-Conditional-GAN.

Baselines. First is the biased GAN, which is a conditional GAN applied directly on the noisy data. The loss is hence biased, and the true conditional distribution is not the optimal solution of this biased loss. Next natural baseline is using de-biased classifier as the discriminator, motivated by the approach of [34] on learning classifiers from noisy labels. The main insight is to modify the loss function according to , such that in expectation the loss matches that of the clean data. We refer to this approach as unbiased GAN. Concretely, when training the discriminator, we propose the following (modified) de-biased loss:

 maxD∈FE(x,˜y)∼˜PX,˜Y[∑y∈[m](C−1)˜yyϕ(D(x,y))]+Ez∼Ny∼PY[ϕ(1−D(G(z;y),y))]. (16)

This is unbiased, as the first term is equivalent to , which is the standard GAN loss with clean samples. However, such de-biasing is sensitive to the condition number of , and can become numerically unstable for noisy channels as has large entries [20]. For both the dataset, we use linear classifiers for permutation regularizer of the RCGAN-U architecture.

### 5.1 Mnist

We train five architectures on MNIST dataset corrupted by the uniform flipping noise: RCGAN+y, RCGAN, RCGAN-U, unbiased GAN, and biased GAN. RCGAN+y architecture has the same architecture as RCGAN but the input to the first layer of its discriminator is concatenated with a one-hot representation of the label. We discuss our techniques to overcome the challenges involved in training RCGAN+y in Appendix I.

Conditional generators can be used to generate samples from a particular class , in the classes it learned. We then can use a pre-trained classifier to compare to the true class of the sample, (as perceived by the classifier ). We compare the generator label accuracy defined as , in Figure 2, left panel. We generated k labels chosen uniformly at random and corresponding conditional samples from the generators, and calculated the generator label accuracy using a CNN classifier pre-trained on the clean MNIST data to an accuracy of 99.2%. The proposed RCGAN significantly improves upon the competing baselines, and achieves almost perfect label accuracy until a high noise of . RCGAN+y further improves upon RCGAN and to gain very high accuracy even at . The high accuracy of RCGAN-U suggests that robust training is possible without prior knowledge of the confusion matrix . As expected, biased GAN has an accuracy of approximately .

An immediate application of robust GANs is recovering the true labels of the noisy training data, which is an important and challenging problem in crowdsourcing. We propose a new meta-algorithm, which we call cGAN-label-recovery, which use any conditional generator trained on the noisy samples, to estimate the true label, as , of a sample using the following optimization.

 (17)

In the right panel of Figure 2 we compare the label recovery accuracy of the meta-algorithm using the five conditional GANs, on randomly chosen noisy training samples. This is also compared to a state-of-the-art method [34] for label recovery, which proposed minimizing unbiased loss function given the noisy labels and the confusion matrix. This unbiased classifier, was shown to outperforms the robust classifiers [29, 45, 46] and can be used to predict the true label of the training examples. In Figures 4 of Appendix J, we show example images from all the generators.

### 5.2 Cifar-10

In Figure 3, we show the inception score [40] and the label accuracy of the conditional generator for the four approaches: our proposed RCGAN and RCGAN-U, against the baselines Unbiased (Section 5) and Biased (Section 1) GANs trained using CIFAR- images [24], while varying the label accuracy of the real data under uniform flipping. In RCGAN-U, even with the regularizer, the learned confusion matrix was a permuted version of the true , possibly because a linear classifier might be too simple to classify CIFAR images. To combat this, we initialized the confusion matrix to be diagonally dominant (Appendix I).

In the left panel of Figure 3, our RCGAN and RCGAN-U consistently achieve higher inception scores than the other two approaches. The Unbiased GAN is highly unstable and hence produces garbage images for large noise (Fig. 5), possibly due to numerical instability of , as noted in [20]. This confirms that robust GANs not only produce images from the correct class, but also produce better quality images. In the right panel of Figure 3, we report the generator label accuracy (Section 5.1) on k samples generated by each GAN. We classify the generator images using a ResNet- model trained to an accuracy of on the noiseless CIFAR- dataset111https://github.com/wenxinxu/resnet-in-tensorflow. Biased GAN has significantly lower label accuracy whereas the Unbiased GAN has low inception score. In Figure 5 in Appendix J, we show example images from the three generators for the different flipping probabilities. We believe that the gain in using the proposed robust GANs will be larger, when we train to higher accuracy with larger networks and extensive hyper parameter tuning, with latest innovations in GAN architectures, for example [54, 28, 17, 19, 41].

## 6 Numerical comparisons with AmbientGAN [10]

In Table 1, we plot the generated label accuracy (as defined in Section 5.1) of RCGAN (which uses the proposed projection discriminator) and AmbientGAN (which uses the DCGAN with no projection discriminator) for multiple values of noise levels (). One of the main reasons for the performance drop of AmbientGAN is that without the projection discriminator, training of AmbientGAN is sensitive to how the mini-batches are chosen. For example, if the distribution of the labels in the mini-batch of the real data is different from that of the mini-batch of the generated data, then the performance of (conditional) AmbientGAN significantly drops. This is critical as we have noisy labels, and matching the labels is in the mini-batch is challenging. Our proposed RCGAN provides an architecture and training methods for applying AmbientGAN to noisy labeled data, to overcome theses challenges. When a projection discriminator is used, as in all our RCGAN and RCGAN-U implementations, the performance is not sensitive to how the mini-batches are sampled. When a discriminator that is not necessarily a projection discriminator is used, as in our RCGAN+ architecture, we propose a novel scheduling of the training, which avoids local minima resulting from mis-matched mini-batches (explained in Appendix I). The results are averaged over 10 instances.

## 7 Conclusion

Standard conditional GANs can be sensitive to noise in the labels of the training data. We propose two new architectures to make them robust, one requiring the knowledge of the distribution of the noise and another which does not, and demonstrate the robustness on benchmark datasets of CIFAR-10 and MNIST. We further showcase how the learned generator can be used to recover the corrupted labels in the training data, which can potentially be used in practical applications. The proposed architecture combines the noise adding idea of AmbientGAN [10], projection discriminator of [32], and regularizers similar to those in InfoGAN [11]. Inspired by AmbientGAN [10], the main idea is to pair the generator output image with a label that is passed through a noisy channel, before feeding to the discriminator. We justify this idea of noise adding by identifying a certain class of discriminators that have good generalization properties. In particular, we prove that projection discriminator, introduced in [32], has a good generalization property. We showcase that the proposed architecture, when trained with a regularizer, has superior robustness on benchmark datasets.

One weakness of our theoretical result in Theorem 4 is that depending on the choice of (i.e. the representation power of the parametric class ), closeness in the neural network distance does not always imply closeness of the distributions. It is generally a challenging problem to address generalization for specific function class and a pair of distributions and . However, a recent breakthrough in generalization properties of GAN in [4] makes the connection between and precise, under some assumptions on the and . This leads to the following research question: under which class of distributions and does the neural network distance of the proposed conditional GAN with projection discriminator generalize? The emphasis is in studying the class of functions satisfying Assumption 1 and identifying corresponding family of distributions that generalize under this function class.

## Acknowledgement

This work is supported by NSF awards CNS-1527754, CCF-1553452, CCF-1705007, RI-1815535 and Google Faculty Research Award. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). This work is partially supported by the generous research credits on AWS cloud computing resources from Amazon.

## Appendix A Notations and Lemmas

### a.1 Additional Notation

Here we define some additional notations required for the proof. We define certain notations before we provide the main theoretical contributions of our paper. If is a function of two variable of , where , then is the vector . If is probability distribution of , then is the conditional distribution of given .

For a matrix , let . Then , and , the maximum singular value. is all ones vector with appropriate dimensions and is identity matrix with appropriate dimensions. . For a vector , () is its -th coordinate.

For the sake of proof we will assume that is class of vector functions of the form . In terms of the notation in the main material original is here. For a class of vector valued functions . Therefore we re-define the operation between a matrix and as,

 T∘F={TD(⋅)|f(⋅)∈F}.

If is probability distribution of , then is the conditional discrete distribution of given , is the marginal density of , and

 ¯¯¯¯PY|X=x =[PY|X=x(Y=1),PY|X=x(Y=1),…,PY|X=x(Y=m)]T, and (18) ¯¯¯¯¯¯pX(x) =pX(x)¯¯¯¯PY|X=x (19)

### a.2 Supporting Lemmas

###### Lemma 1 (Characterization of neural network distance).

for all . And if is a convex or concave function, then the Neural network distance is when the distributions are same, i.e. .

###### Proof.

For concave we define . Since, by definition is feasible solution to the optimization problem in (5), thus .

 dF,ϕ(P,P)= ≤ supD∈F2ϕ(E(x,y)∼P[12(D(x)y+1−D(x)y)])−2ϕ(1/2) = supD∈F2ϕ(1/2)−2ϕ(1/2)=0

The inequality in second line follows from Jensen’s inequality for concave .

For convex we define . Since, by definition is feasible solution to the optimization problem in (5), thus .

 dF,ϕ(P,P) =supD∈FE(x,y)∼P[ϕ(D(x)y)+ϕ(1−D(x)y)]−(ϕ(0)+ϕ(1)) ≤supD∈FE(x,y)∼P[ϕ(0)+ϕ(1)]−(ϕ(0)+ϕ(1))=0

The last inequality follows from Jensen’s inequality for convex

This Lemma 1 ensures that all the multiplicative lower bounds and upper bounds in Theorem 3 and its corollaries implies recoverability.

###### Lemma 2.

If is a distributions on and is the distribution of sample of when passed through the noisy-channel given by the confusion matrix (as defined in Section 1). Then,

 ¯¯¯¯¯˜P˜Y|X=x=CT¯¯¯¯PY|X=x, (20)

where .

###### Proof.
 ˜P˜Y|X=x(˜Y=j) =∑i∈[m]P(˜Y=j|Y=i)PY|X=x(Y=j),∀j∈[m] ˜P˜Y|X=x(˜Y=j) =∑i∈[m]CijPY|X=x(Y=j),∀j∈[m] ¯¯¯¯¯˜P˜Y|X=x =CT¯¯¯¯PY|X=x

## Appendix B Proof of Theorem 1

We first prove the approximation bounds for total variation distance in Eq. (3), and then use it to prove similar bounds for the Jensen-Shannon divergence in Eq. (4). Recall that total variation distance between and can be written in several ways:

 dTV(P,Q) = maxS1,…,Sm∑y∈[m]P(Sy,y)−Q(Sy,y) = maxS1,…,Sm∑y∈[m]|P(Sy,y)−Q(Sy,y)| = maxS1,…,Sm∥P({Sy}y∈[m],⋅)−Q({Sy}y∈[m],⋅)∥1,

where we used the notation of a row-vector . The lower bound on follows that

 dTV(P,Q) = = maxS1,…,Sm⊆X⟨1,P({Sy}y∈[m],⋅)−Q({Sy}y∈[m],⋅)⟩ (a)= maxS1,…,Sm⊆X⟨1,(˜P({Sy}y∈[m],⋅)−˜Q({Sy}y∈[m],⋅))C−1⟩ (b)≤ |||C−T|||1maxS1,…,Sm⊆X∥∥˜P({Sy}y∈[m],⋅)−˜Q({Sy}y∈[m],⋅)∥∥1 (c)= |||C−1|||∞dTV(˜P,˜Q),

where follows from the fact that , follows from the fact that , and follows from . The upper bound follows from similar arguments:

 dTV(˜P,˜Q) ≤|||CT|||1maxS1,…,Sm⊆X∥∥P({Sy}y∈[m],⋅)−Q({Sy}y∈[m],⋅)∥∥1 =dTV(P,Q)

where last equality uses the fact that