1 Introduction
Conditional generative adversarial networks (GAN) have been widely successful in several applications including improving image quality, semisupervised learning, reinforcement learning, category transformation, style transfer, image denoising, compression, inpainting, and superresolution
[30, 13, 48, 36, 26, 58]. The goal of training a conditional GAN is to generate samples from distributions satisfying certain conditioningon some correlated features. Concretely, given samples from joint distribution of a data point
and a label , we want to learn to generate samples from the true conditional distribution of the real data . A canonical conditional GAN studied in literature is the case of discrete label [30, 36, 35, 32]. Significant progresses have been made in this setting, which are typically evaluated on the quality of the conditional samples. These include measuring inception scores and intra Fréchet inception distances, visual inspection on downstream tasks such as category morphing and super resolution [32], and faithfulness of the samples as measured by how accurately we can infer the class that generated the sample [36].We study the problem of training conditional GANs with noisy discrete labels. By noisy labels, we refer to a setting where the label for each example in the training set is randomly corrupted. Such noise can result from an adversary deliberately corrupting the data [7] or from human errors in crowdsourced label collection [12, 18]. This can be modeled as a random process, where a clean data point and its label are drawn from a joint distribution with classes. For each data point, the label is corrupted by passing through a noisy channel represented by a rowstochastic confusion matrix defined as . This defines a joint distribution for the data point and a noisy label : . If we train a standard conditional GAN on noisy samples, then it solves the following optimization:
(1) 
where is a function of choice, and are the discriminator and the generator respectively optimized over function classes and of our choice, and
is the distribution of the latent random vector. For typical choices of
, for example , and large enough function classes and , the optimal conditional generator learns to generate samples from , the corrupted conditional distribution. In other words, it generates samples from classes other than what it is conditioned on. As the learned distribution exhibits such a bias, we call this naive approach the Biased GAN. Under this setting, there is a fundamental question of interest: can we design a novel conditional GAN that can generate samples from the true conditional distribution , even when trained on noisy samples?Several aspects of this problem make it challenging and interesting. First, the performance of such robust GAN should depend on how noisy the channel is. If is rankdeficient, for instance, then there are multiple distributions that result in the same distribution after the corruption, and hence no reliable learning of the true distribution is possible. We would ideally want a theoretical guarantee that shows such tradeoff between and the robustness of GANs. Next, when the noise is from errors in crowdsourced labels, we might have some access to the confusion matrix from historical data. On other cases of adversarial corruption, we might not have any information of . We want to provide robust solutions to both. Finally, an important practical challenge in this setting is to correct the noisy labels in the training data. We address all such variations in our approaches and make the following contributions.
Our contributions. We introduce two architectures to train conditional GANs with noisy samples.
First, when we have the knowledge of the confusion matrix , we propose RCGAN (Robust Conditional GAN) in Section 2. We first prove that minimizing the RCGAN loss provably recovers the clean distribution (Theorem 2), under certain conditions on the class of discriminators we optimize over (Assumption 1). We show that such a condition on is also necessary, as without it, the training loss can be arbitrarily small while the generated distribution can be far from the real (Theorem 3). The assumption leads to our particular choice of the discriminator in RCGAN, called projection discriminator [32] that satisfies all the conditions (Remark 2). Finally, we provide a finite sample generalization bound showing that the loss minimized in training RCGAN does generalize, and results in the learned distribution being close to the clean conditional distribution (Theorem 4). Experimental results in benchmark datasets confirm that RCGAN is robust against noisy samples, and improves significantly over the naive Biased GAN.
Secondly, when we do not have access to , we propose RCGANU (RCGAN with Unknown noise distribution) in Section 4. We provide experimental results showing that performance gains similar to that of RCGAN can be achieved. Finally, we showcase the practical use of thus learned conditional GANs, by using it to fix the noisy labels in the training data. Numerical experiments confirm that the RCGAN framework provides a more robust approach to correcting the noisy labels, compared to the stateoftheart methods that rely only on discriminators.
Related work. Two popular training methods for generative models are variational autoencoders [22] and adversarial training [14]. The adversarial training approach has made significant advances in several applications of practical interest. [37, 2, 5] propose new architectures that significantly improve the training in practical image datasets. [58, 16] propose new architectures to transfer the style of one image to the other domain. [26, 43] show how to enhance a given image with learned generator, by enhancing the resolution or making it more realistic. [27, 50] show how to generate videos and [51, 1] demonstrate that 3dimensional models can be generated from adversarial training. [23] proposes a new architecture encoding causal structures in conditional GANs. [42] introduces the stateoftheart conditional independence tester. On a different direction, several recent approaches showcase how the manifold learned by the adversarial training can be used to solve inverse problems [9, 57, 53, 49].
Conditional GANs have been proposed as a successful tool for various applications, including class conditional image generation [36], image to image translation [21], and image generation from text [38, 55]. Most of the conditional GANs incorporate the class information by naively concatenating it to the input or feature vector at some middle layer [30, 13, 38, 55]. ACGANs [36]
creates an auxiliary classifier to incorporate class information. Projection discriminator GAN
[32] takes an inner product between the embedded class vector and the feature vector. A recent work [31] which proposes spectral normalization shows that high quality image generation on class ILSVRC2012 dataset [39] can be achieved using projection conditional discriminator.Robustness of (unconditional) GANs against adversarial or random noise has recently been studied in [10, 52]. [52] studies an adversary attacking the output of the discriminator, perturbing the discriminator output with random noise. The proposed architecture of RCGAN is inspired by a closely related work of AmbientGAN in [10]. AmbientGAN is a general framework addressing any corruption on the data itself (not necessarily just the labels). Given a corrupted samples with known corruption, AmbientGAN applies that corruption to the output of the generator before feeding them to the discriminator. This has shown to successfully denoise images in several practical scenarios.
Motivated by the success of AmbientGAN in denoising, we propose RCGAN. An important distinction is that we make specific architectural choices guided by our theoretical analysis that gives a significant gain in practice as shown in Section 6. Under the scenario of interest with noisy labels, we provide sharp analyses for both the population loss and the finite sample loss. Such sharp characterizations do not exist for the more general AmbientGAN scenarios. Further, our RCGANU does not require the knowledge of the confusion matrix, departing from the AmbientGAN approach. Training classifiers from noisy labels is a closely related problem. Recently, [34, 20] proposed a theoretically motivated classifier which minimizes the modified loss in presence of noisy labels and showed improvement over the robust classifiers [29, 45, 46].
Notation. For a vector , is the standard norm. For a matrix , let denote the operator norm. Then , and
, the maximum singular value.
is all ones vector with appropriate dimensions andis identity matrix with appropriate dimensions.
. For a vector , () is its th coordinate.2 Our first architecture: RCGAN
Training a conditional GAN with noisy samples results in a biased generator. We propose Robust Conditional GAN (RCGAN) architecture which has the following preprocessing, discriminator update, and generator update steps. We assume in this section that the confusions matrix is known (and the marginal can easily be inferred), and address the case of unknown in Section 4.
Preprocessing: We train a classifier to predict the noisy label given under a loss , trained on , where
is a parametric family of classifiers (typically neural networks) and
is the joint distribution of real and corresponding real noisy .Dstep: We train on the following adversarial loss. In the second term below, is generated according to and corresponding noisy labels are generated by corrupting the according to the conditional distribution which is the th row of the confusion matrix (assumed to be known):
where is the true marginal distribution of the labels, is the distribution of the latent random vector, and is a family of discriminators.
Gstep: We train on the following loss with some :
(2) 
where is a family of generators. The idea of using auxiliary classifiers have been used to improve the quality of the image and stability of the training, for example in auxiliary classifier GAN (ACGAN) [36], and improve the quality of clustering in the latent space [33]. We propose an auxiliary classifiers , mitigating a permutation error, which we empirically identified on naive implementation of our idea with no regularizers.
Permutation regularizer (controlled by ). Permutation error occurs if, when asked to produce samples from a target class, the trained generator produces samples dominantly from a single class but different from the target class. We propose a regularizer , which predicts the noisy label . As long as the confusion matrix is diagonally dominant, which is a necessary condition for identifiability, this regularizer encourages the correct permutation of the labels.
Theoretical motivation for RCGAN. When , we get the standard conditional GAN update steps, albeit one which tries to minimize discriminator loss between the noisy real distribution and the distribution of the generator when the label is passed through the same noisy channel parameterized by . The main idea of RCGAN is to minimize a certain divergence between noisy real data and noisy generated data. For example, the choice of bounded functions and identity map leads to a total variation minimization; The loss minimized in the Gstep is the total variation between the two distributions with corrupted labels, up to some scaling and some shift. If we choose and , then we are minimizing the JensenShannon divergence , where
denotes the KullbackLeibler divergence. The following theorem provides approximation guarantees for some common divergence measures over noisy channel, justifying our proposed practical approach. We refer to Appendix
B for a proof.Theorem 1.
Let and be two distributions on . Let be the corresponding distributions when samples from are passed through the noisy channel given by the confusion matrix (as defined in Section 1). If is fullrank, we get,
(3)  
(4) 
To interpret this theorem, let denote the distribution of the generator. The theorem implies that when the noisy generator distribution becomes close to the noisy real distribution in total variation or in JensenShannon divergence, then the generator distribution must be close to the distribution of real data in the same metric. This justifies the use of the proposed architecture RCGAN. In practice, we minimize the sample divergence of the two distributions, instead of the population divergence as analyzed in the above theorem. However, these standard divergences are known to not generalize in training GANs [3]. To this end, we provide in Section 3 analyses on neural network distances, which are known to generalize, and provide finite sample bounds.
3 Theoretical Analysis of RCGAN
It was shown in [3] that standard GAN losses of JensenShannon divergence and Wasserstein distance both fail to generalize with a finite number of samples. On the other hand, more recent advances in analyzing GANs in [56, 6, 4] show promising generalization bounds by either assuming Lipschitz conditions on the generator model or by restricting the analysis to certain classes of distributions. Under those assumptions, where JS divergence generalizes, Theorem 1 justifies the use of the proposed RCGAN. However, those require the distribution to be Gaussian, mixture of Gaussians, or output of a neural network generator, for example in [4].
In this section, we provide analyses of RCGAN on a distance that generalizes without any assumptions on the distribution of the real data as proven in [3]: neural network distance. Formally, consider a class of realvalued functions and a function which is either convex or concave. The neural network distance is defined as
(5) 
where is the distribution of the real data, is that of the generated data, and is the constant correction term to ensure that . We further assume that includes three constant functions , , and , in order to ensure that and , as shown in Lemma 1 in the Appendix.
The proposed RCGAN with approximately minimizes the neural network distance between the two corrupted distributions. In practice, is a parametric family of functions from a specific neural network architecture that the designer has chosen. In theory, we aim to identify how the choice of class provides the desired approximation bounds similar to those in Theorem 1, but for neural network distances. This analysis leads to the choice of projection discriminator [32] to be used in RCGAN (Remark 2). On the other hand, we show in Theorem 3 that an inappropriate choice of the discriminator architecture can cause nonapproximation. Further, we provide the sample complexity of the approximation bounds in Theorem 4.
We refer to the unregularized version with
as simply RCGAN. In this section, we focus on a class of loss functions called Integral Probability Metrics (IPM) where
[44]. This is a popular choice of loss in GANs in practice [47, 2, 8] and in analyses [4]. We write the induced neural network distance as , dropping the in the notation.3.1 Approximation bounds for neural network distances
We define an operation over a matrix and a class of functions on as
(6) 
This makes it convenient to represent the neural network distance corrupted by noise with a confusion matrix , where is the probability a label is corrupted as . Formally, it follows from (5) and (6) that . We refer to Appendix E for a proof. For to be a good approximation of , we show that the following condition is sufficient.
Assumption 1.
We assume that the class of discriminator functions can be decomposed into three parts such that is any constant and

satisfies the inclusion condition:
(7) for all ; and

satisfies the label invariance condition: there exists a class of sets of functions, parametrized by , such that
(8)
We discuss the necessity and practical implications of this assumption in Section 3.2, and give examples satisfying these assumptions in Remarks 2 and 3. Notice that a trivial class with a single constant zero function satisfies both inclusion and label invariance conditions. For example, we can choose and also choose to set either or , in which case only needs to satisfy either one of the conditions in Assumption 1. The flexibility that we gain by allowing the set addition is critical in applying these conditions to practical discriminators, especially in proving Remark 2. Note that in the inclusion condition in Eq. 7, we require the condition to hold for all maxnorm bounded set: . The reason a weaker condition of all rowstochastic matrices, , does not suffice is that in order to prove the upper bound in Eq. 9, we need to apply the invariance condition to . This matrix is not rowstochastic, but still maxnorm bounded.
We first show that Assumption 1 is sufficient for approximability of the neural network distance from corrupted samples. For two distributions and on , let and be the corresponding corrupted distributions respectively, where the label is passed through the noisy channel defined by the confusion matrix , i.e. .
Theorem 2.
If a class of functions satisfies Assumption 1, then
(9) 
where we follow the convention that if is not full rank.
We refer to Appendix E for a proof. This gives a sharp characterization on how two distances are related: the one we can minimize in training RCGAN (i.e. ) and the true measure of closeness (i.e. ). Although the latter cannot be directly evaluated or minimized, RCGAN is approximately minimizing the true neural network distance as desired.
The lower bound proves a special case of the dataprocessing inequality. Two random variables from
and get closer in neural network distance, when passed through a stochastic transformation. The upper bound puts a limit on how much closer and can get, depending on the noise level. This fundamental tradeoff is captured by . Under the noiseless case where is the identity matrix, we have and we recover a trivial fact that the two distances are equal. On the other extreme, if is rank deficient, we use the convention that and the two distances can be arbitrarily different. The approximation factor of captures how much the space can shrink by the noise . This coincides with Theorem 1, where a similar tradeoff was identified for the TV distance. Next remark shows that these bounds cannot be tightened for general , , and . A proof is provided in Appendix D.Remark 1.
For any fullrank confusion matrix , there exist pairs of distributions and , and a function class satisfying Assumption 1, such that

, and

.
Theorem 2 shows that RCGAN can learn the true conditional distribution, justifying its use; and performance of RCGAN is determined by how noisy the samples are via . There are still two loose ends. First, does practical implementation of RCGAN architecture satisfy the inclusion and/or label invariance assumptions? Secondly, in practice we cannot minimize as we only have a finite number of samples. How much do we lose in this finite sample regime? We give precise answers to each question in the following two sections.
3.2 Inclusion and label invariance assumptions
Several class of functions satisfy Assumption 1 (c.f. Remark 3). For RCGAN, we propose a popular stateoftheart discriminator for conditional GANs known as the projection discriminator [32], parametrized by , , and :
(10) 
where and are vector valued parametric functions for some integers , and . The first term satisfies the inclusion condition, as any operation with can be absorbed into . The second term is label invariant as it does not depend on . This is made precise in the following remark, whose proof is provided in Appendix F. Together with this remark, the approximability result in Theorem 2 justifies the use of projection discriminators in RCGAN, which we use in all our experiments.
Remark 2.
Other choices of and are also possible. For example, or are also sufficient. We find the proposed choice of easy to implement, as a columnwise norm normalization via projected gradient descent. We describe implementation details in Appendix I.
Next, we ask if Assumption 1 is necessary also. We show that for all pairs of distributions satisfying the following technical conditions, and all confusion matrix , there exists a class where approximation bounds in (9) fail.
Assumption 2.
We consider a pair of distributions and and a confusion matrix satisfying the following conditions:

The random variable conditioned on
is a continuous random variable with density functions
and , respectively.
A pair violating the above assumptions either has that is a mixture of continuous and discrete distribution, or all ’s are aligned with the right eigenvectors of .
Theorem 3.
3.3 Finite sample analysis
In practice, we do not have access to the probability distributions
and . Instead, we observe a set of samples of a finite size , from each of them. In training GAN, we minimize the empirical neural network distance, , where and denote the empirical distribution of samples. Inspired from the recent generalization results in [3], we show that this empirical distance minimization leads to small up to an additive error that vanishes with an increasing sample size. As shown in [3], Lipschitz and bounded function classes are critical in achieving sample efficiency for GANs. We follow the same approach over a similar function class. Let(13) 
be a class of bounded functions with parameter . We say that is Lipschitz in if
(14) 
Theorem 4.
For any class of bounded Lipschitz functions satisfying Assumption 1, there exists a universal constant such that
(15) 
with probability at least for any and large enough,
4 Our second architecture: RCGANU
In many real world scenarios the confusion matrix is unknown. We propose RCGANUnknown (RCGANU) algorithm which jointly estimates the real distribution and the noise model . The preprocessing and D steps of the RCGANU are the same as those of RCGAN, assuming the current guess of the confusion matrix. As the Gstep in (2) is not differentiable in , we use the following reparameterized estimator of the loss, motivated by similar technique in training classifiers from noisy labels:
where is the set of all transition matrices and .
5 Experiments
Implementation details are explained in Appendix I. We consider onecoin based models, which are parameterized by their label accuracy probability . In this model a sample with true label is flipped uniformly at random to label in with probability . The entries of its confusion matrix , will then be and , where is the number of classes. We call this model uniform flipping model. We train proposed GANs on MNIST and CIFAR datasets [25, 24] and compare them to two baselines. Code to reproduce our experiments is available at https://github.com/POLane16/RobustConditionalGAN.
Baselines. First is the biased GAN, which is a conditional GAN applied directly on the noisy data. The loss is hence biased, and the true conditional distribution is not the optimal solution of this biased loss. Next natural baseline is using debiased classifier as the discriminator, motivated by the approach of [34] on learning classifiers from noisy labels. The main insight is to modify the loss function according to , such that in expectation the loss matches that of the clean data. We refer to this approach as unbiased GAN. Concretely, when training the discriminator, we propose the following (modified) debiased loss:
(16) 
This is unbiased, as the first term is equivalent to , which is the standard GAN loss with clean samples. However, such debiasing is sensitive to the condition number of , and can become numerically unstable for noisy channels as has large entries [20]. For both the dataset, we use linear classifiers for permutation regularizer of the RCGANU architecture.
5.1 Mnist
We train five architectures on MNIST dataset corrupted by the uniform flipping noise: RCGAN+y, RCGAN, RCGANU, unbiased GAN, and biased GAN. RCGAN+y architecture has the same architecture as RCGAN but the input to the first layer of its discriminator is concatenated with a onehot representation of the label. We discuss our techniques to overcome the challenges involved in training RCGAN+y in Appendix I.
Conditional generators can be used to generate samples from a particular class , in the classes it learned. We then can use a pretrained classifier to compare to the true class of the sample, (as perceived by the classifier ). We compare the generator label accuracy defined as , in Figure 2, left panel. We generated k labels chosen uniformly at random and corresponding conditional samples from the generators, and calculated the generator label accuracy using a CNN classifier pretrained on the clean MNIST data to an accuracy of 99.2%. The proposed RCGAN significantly improves upon the competing baselines, and achieves almost perfect label accuracy until a high noise of . RCGAN+y further improves upon RCGAN and to gain very high accuracy even at . The high accuracy of RCGANU suggests that robust training is possible without prior knowledge of the confusion matrix . As expected, biased GAN has an accuracy of approximately .
An immediate application of robust GANs is recovering the true labels of the noisy training data, which is an important and challenging problem in crowdsourcing. We propose a new metaalgorithm, which we call cGANlabelrecovery, which use any conditional generator trained on the noisy samples, to estimate the true label, as , of a sample using the following optimization.
(17) 
In the right panel of Figure 2 we compare the label recovery accuracy of the metaalgorithm using the five conditional GANs, on randomly chosen noisy training samples. This is also compared to a stateoftheart method [34] for label recovery, which proposed minimizing unbiased loss function given the noisy labels and the confusion matrix. This unbiased classifier, was shown to outperforms the robust classifiers [29, 45, 46] and can be used to predict the true label of the training examples. In Figures 4 of Appendix J, we show example images from all the generators.
5.2 Cifar
In Figure 3, we show the inception score [40] and the label accuracy of the conditional generator for the four approaches: our proposed RCGAN and RCGANU, against the baselines Unbiased (Section 5) and Biased (Section 1) GANs trained using CIFAR images [24], while varying the label accuracy of the real data under uniform flipping. In RCGANU, even with the regularizer, the learned confusion matrix was a permuted version of the true , possibly because a linear classifier might be too simple to classify CIFAR images. To combat this, we initialized the confusion matrix to be diagonally dominant (Appendix I).
In the left panel of Figure 3, our RCGAN and RCGANU consistently achieve higher inception scores than the other two approaches. The Unbiased GAN is highly unstable and hence produces garbage images for large noise (Fig. 5), possibly due to numerical instability of , as noted in [20]. This confirms that robust GANs not only produce images from the correct class, but also produce better quality images. In the right panel of Figure 3, we report the generator label accuracy (Section 5.1) on k samples generated by each GAN. We classify the generator images using a ResNet model trained to an accuracy of on the noiseless CIFAR dataset^{1}^{1}1https://github.com/wenxinxu/resnetintensorflow. Biased GAN has significantly lower label accuracy whereas the Unbiased GAN has low inception score. In Figure 5 in Appendix J, we show example images from the three generators for the different flipping probabilities. We believe that the gain in using the proposed robust GANs will be larger, when we train to higher accuracy with larger networks and extensive hyper parameter tuning, with latest innovations in GAN architectures, for example [54, 28, 17, 19, 41].
6 Numerical comparisons with AmbientGAN [10]
In Table 1, we plot the generated label accuracy (as defined in Section 5.1) of RCGAN (which uses the proposed projection discriminator) and AmbientGAN (which uses the DCGAN with no projection discriminator) for multiple values of noise levels (). One of the main reasons for the performance drop of AmbientGAN is that without the projection discriminator, training of AmbientGAN is sensitive to how the minibatches are chosen. For example, if the distribution of the labels in the minibatch of the real data is different from that of the minibatch of the generated data, then the performance of (conditional) AmbientGAN significantly drops. This is critical as we have noisy labels, and matching the labels is in the minibatch is challenging. Our proposed RCGAN provides an architecture and training methods for applying AmbientGAN to noisy labeled data, to overcome theses challenges. When a projection discriminator is used, as in all our RCGAN and RCGANU implementations, the performance is not sensitive to how the minibatches are sampled. When a discriminator that is not necessarily a projection discriminator is used, as in our RCGAN+ architecture, we propose a novel scheduling of the training, which avoids local minima resulting from mismatched minibatches (explained in Appendix I). The results are averaged over 10 instances.
Noise level (1)  

0.2  0.3  0.5  
RCGAN  0.994  0.994  0.994 
AmbientGAN  0.940  0.902  0.857 
7 Conclusion
Standard conditional GANs can be sensitive to noise in the labels of the training data. We propose two new architectures to make them robust, one requiring the knowledge of the distribution of the noise and another which does not, and demonstrate the robustness on benchmark datasets of CIFAR10 and MNIST. We further showcase how the learned generator can be used to recover the corrupted labels in the training data, which can potentially be used in practical applications. The proposed architecture combines the noise adding idea of AmbientGAN [10], projection discriminator of [32], and regularizers similar to those in InfoGAN [11]. Inspired by AmbientGAN [10], the main idea is to pair the generator output image with a label that is passed through a noisy channel, before feeding to the discriminator. We justify this idea of noise adding by identifying a certain class of discriminators that have good generalization properties. In particular, we prove that projection discriminator, introduced in [32], has a good generalization property. We showcase that the proposed architecture, when trained with a regularizer, has superior robustness on benchmark datasets.
One weakness of our theoretical result in Theorem 4 is that depending on the choice of (i.e. the representation power of the parametric class ), closeness in the neural network distance does not always imply closeness of the distributions. It is generally a challenging problem to address generalization for specific function class and a pair of distributions and . However, a recent breakthrough in generalization properties of GAN in [4] makes the connection between and precise, under some assumptions on the and . This leads to the following research question: under which class of distributions and does the neural network distance of the proposed conditional GAN with projection discriminator generalize? The emphasis is in studying the class of functions satisfying Assumption 1 and identifying corresponding family of distributions that generalize under this function class.
Acknowledgement
This work is supported by NSF awards CNS1527754, CCF1553452, CCF1705007, RI1815535 and Google Faculty Research Award. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI1445606, at the Pittsburgh Supercomputing Center (PSC). This work is partially supported by the generous research credits on AWS cloud computing resources from Amazon.
References
 [1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Representation learning and adversarial generation of 3d point clouds. arXiv preprint arXiv:1707.02392, 2017.
 [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [3] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.
 [4] Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity in GANs. arXiv preprint arXiv:1806.10586, 2018.
 [5] David Berthelot, Tom Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
 [6] G Biau, B Cadre, M Sangnier, and U Tanielian. Some theoretical properties of GANs. arXiv preprint arXiv:1803.07819, 2018.

[7]
Battista Biggio, Blaine Nelson, and Pavel Laskov.
Support vector machines under adversarial label noise.
In
Asian Conference on Machine Learning
, pages 97–112, 2011.  [8] Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. arXiv preprint arXiv:1801.01401, 2018.
 [9] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208, 2017.
 [10] Ashish Bora, Eric Price, and Alexandros G Dimakis. Ambientgan: Generative models from lossy measurements. In International Conference on Learning Representations (ICLR), 2018.
 [11] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
 [12] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer errorrates using the em algorithm. Applied statistics, pages 20–28, 1979.
 [13] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
 [14] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [15] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769–5779, 2017.
 [16] Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros. Imagetoimage translation with conditional adversarial networks. arXiv preprint, 2017.
 [17] Alexia JolicoeurMartineau. The relativistic discriminator: a key element missing from standard GAN. arXiv preprint arXiv:1807.00734, 2018.
 [18] David R Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems. In Advances in neural information processing systems, pages 1953–1961, 2011.
 [19] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 [20] Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singlylabeled data. arXiv preprint arXiv:1712.04577, 2017.
 [21] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to discover crossdomain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
 [22] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [23] Murat Kocaoglu, Christopher Snyder, Alexandros G Dimakis, and Sriram Vishwanath. Causalgan: Learning causal implicit generative models with adversarial training. arXiv preprint arXiv:1709.02023, 2017.
 [24] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[25]
Yann LeCun.
The mnist database of handwritten digits.
http://yann. lecun. com/exdb/mnist/, 1998.  [26] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photorealistic single image superresolution using a generative adversarial network. arXiv preprint, 2016.
 [27] Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing. Dual motion gan for futureflow embedded video prediction. arXiv preprint, 2017.
 [28] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. PacGAN: The power of two samples in generative adversarial networks. arXiv preprint arXiv:1712.04086, 2017.
 [29] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S Yu. Building text classifiers using positive and unlabeled examples. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pages 179–186. IEEE, 2003.
 [30] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [31] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
 [32] Takeru Miyato and Masanori Koyama. cGANs with projection discriminator. arXiv preprint arXiv:1802.05637, 2018.
 [33] Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan. Clustergan: Latent space clustering in generative adversarial networks. arXiv preprint arXiv:1809.03627, 2018.
 [34] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013.
 [35] Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Dosovitskiy, and Jeff Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. arXiv preprint arXiv:1612.00005, 2016.
 [36] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585, 2016.
 [37] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [38] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.

[39]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.
Imagenet large scale visual recognition challenge.
International Journal of Computer Vision
, 115(3):211–252, 2015.  [40] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
 [41] Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. Solving approximate Wasserstein GANs to stationarity. arXiv preprint arXiv:1802.08249, 2018.
 [42] Rajat Sen, Karthikeyan Shanmugam, Himanshu Asnani, Arman Rahimzamani, and Sreeram Kannan. Mimic and classify: A metaalgorithm for conditional independence testing. arXiv preprint arXiv:1806.09708, 2018.

[43]
Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and
Russ Webb.
Learning from simulated and unsupervised images through adversarial
training.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, volume 3, page 6, 2017.  [44] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG Lanckriet. On integral probability metrics, divergences and binary classification. arXiv preprint arXiv:0901.2698, 2009.

[45]
Guillaume Stempfel and Liva Ralaivola.
Learning kernel perceptrons on noisy data using random projections.
In International Conference on Algorithmic Learning Theory, pages 328–342. Springer, 2007.  [46] Guillaume Stempfel and Liva Ralaivola. Learning svms from sloppily labeled data. In International Conference on Artificial Neural Networks, pages 884–893. Springer, 2009.
 [47] Dougal J Sutherland, HsiaoYu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola, and Arthur Gretton. Generative models and model criticism via optimized maximum mean discrepancy. arXiv preprint arXiv:1611.04488, 2016.
 [48] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 [49] David Van Veen, Ajil Jalal, Eric Price, Sriram Vishwanath, and Alexandros G Dimakis. Compressed sensing with deep image prior and learned regularization. arXiv preprint arXiv:1806.06438, 2018.
 [50] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613–621, 2016.
 [51] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
 [52] Zhi Xu, Chengtao Li, and Stefanie Jegelka. Robust gans against dishonest adversaries. arXiv preprint arXiv:1802.09700, 2018.
 [53] Raymond Yeh, Chen Chen, Teck Yian Lim, Mark HasegawaJohnson, and Minh N Do. Semantic image inpainting with perceptual and contextual losses. arXiv preprint arXiv:1607.07539, 2016.
 [54] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Selfattention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
 [55] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks. In IEEE Int. Conf. Comput. Vision (ICCV), pages 5907–5915, 2017.
 [56] Pengchuan Zhang, Qiang Liu, Dengyong Zhou, Tao Xu, and Xiaodong He. On the discriminationgeneralization tradeoff in GANs. arXiv preprint arXiv:1711.02771, 2017.
 [57] JunYan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
 [58] JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
Appendix
Appendix A Notations and Lemmas
a.1 Additional Notation
Here we define some additional notations required for the proof. We define certain notations before we provide the main theoretical contributions of our paper. If is a function of two variable of , where , then is the vector . If is probability distribution of , then is the conditional distribution of given .
For a matrix , let . Then , and , the maximum singular value. is all ones vector with appropriate dimensions and is identity matrix with appropriate dimensions. . For a vector , () is its th coordinate.
For the sake of proof we will assume that is class of vector functions of the form . In terms of the notation in the main material original is here. For a class of vector valued functions . Therefore we redefine the operation between a matrix and as,
If is probability distribution of , then is the conditional discrete distribution of given , is the marginal density of , and
(18)  
(19) 
a.2 Supporting Lemmas
Lemma 1 (Characterization of neural network distance).
for all . And if is a convex or concave function, then the Neural network distance is when the distributions are same, i.e. .
Proof.
For concave we define . Since, by definition is feasible solution to the optimization problem in (5), thus .
The inequality in second line follows from Jensen’s inequality for concave .
For convex we define . Since, by definition is feasible solution to the optimization problem in (5), thus .
The last inequality follows from Jensen’s inequality for convex ∎
This Lemma 1 ensures that all the multiplicative lower bounds and upper bounds in Theorem 3 and its corollaries implies recoverability.
Lemma 2.
If is a distributions on and is the distribution of sample of when passed through the noisychannel given by the confusion matrix (as defined in Section 1). Then,
(20) 
where .
Proof.
∎
Appendix B Proof of Theorem 1
We first prove the approximation bounds for total variation distance in Eq. (3), and then use it to prove similar bounds for the JensenShannon divergence in Eq. (4). Recall that total variation distance between and can be written in several ways:
where we used the notation of a rowvector . The lower bound on follows that
where follows from the fact that , follows from the fact that , and follows from . The upper bound follows from similar arguments:
where last equality uses the fact that