1 Introduction
Learning a probability model from data samples is a fundamental task in unsupervised learning. The recently developed generative adversarial network (GAN)
goodfellow2014generativeleverages the power of deep neural networks to successfully address this task across various domains
goodfellow2016nips. In contrast to traditional methods of parameter fitting like maximum likelihood estimation, the GAN approach views the problem as a
game between a generator whose goal is to generate fake samples that are close to the real data training samples and a discriminator whose goal is to distinguish between the real and fake samples. The generator creates the fake samples by mapping from random noise input.The following minimax problem is the original GAN problem, also called vanilla GAN, introduced in goodfellow2014generative
(1) 
Here denotes the generator’s noise input,
represents the random vector for the real data distributed as
, and and respectively represent the generator and discriminator function sets. Implementing this minimax game using deep neural network classes and has lead to the stateoftheart generative model for many different tasks.To shed light on the probabilistic meaning of vanilla GAN, goodfellow2014generative shows that given an unconstrained discriminator , i.e. if contains all possible functions, the minimax problem (1) will reduce to
(2) 
where denotes the JensenShannon (JS) divergence. The optimization problem (2) can be interpreted as finding the closest generative model to the data distribution (Figure 1a), where distance is measured using the JSdivergence. Various GAN formulations were later proposed by changing the divergence measure in (2): fGAN nowozin2016f generalizes vanilla GAN by minimizing a general fdivergence; Wasserstein GAN (WGAN) arjovsky2017wasserstein considers the firstorder Wasserstein (the earthmover’s) distance; MMDGAN dziugaite2015training ; li2015generative ; li2017mmd considers the maximum mean discrepancy; Energybased GAN zhao2016energy minimizes the total variation distance as discussed in arjovsky2017wasserstein ; Quadratic GAN feizi2017understanding finds the distribution minimizing the secondorder Wasserstein distance.
However, GANs trained in practice differ from this minimum divergence formulation, since their discriminator is not optimized over an unconstrained set and is constrained to smaller classes such as neural nets. As shown in arora2017generalization , constraining the discriminator is in fact necessary to guarantee good generalization properties for GAN’s learned model. Then, how does the minimum divergence interpretation (2) change as we constrain ? A standard approach used in arora2017generalization ; liu2017approximation is to view the maximum discriminator objective as an based distance between distributions. For unconstrained , the based distance reduces to the original divergence measure, e.g. the JSdivergence in vanilla GAN.
While based distances have been shown to be useful for analyzing GAN’s generalization properties arora2017generalization , their connection to the original divergence measure remains unclear for a constrained . Then, what is the interpretation of GAN minimax game with a constrained discriminator? In this work, we address this question by interpreting the dual problem to the discriminator optimization. To analyze the dual problem, we develop a convex duality framework for general divergence minimization problems. We apply the duality framework to the fdivergence and optimal transport cost families, providing interpretation for fGAN, including vanilla GAN minimizing JSdivergence, and Wasserstein GAN.
Specifically, we generalize the interpretation for unconstrained in (2) to any linear space discriminator set
. For this class of discriminator sets, we interpret vanilla GAN as the following JSdivergence minimization between two sets of probability distributions, the set of generative models and the set of discriminator momentmatching distributions (Figure
1b),(3) 
Here contains any distribution satisfying the moment matching constraint for all discriminator ’s in . More generally, we show that a similar interpretation applies to GANs trained over any convex discriminator set . We further discuss the application of our duality framework to neural net discriminators with bounded Lipschitz constant. While a set of neural network functions is not necessarily convex, we prove any convex combination of Lipschitzbounded neural nets can be approximated by uniformly combining boundedlymany neural nets. This result applied to our duality framework suggests considering a uniform mixture of multiple neural nets as the discriminator.
As a byproduct, we apply the duality framework to the minimum sum hybrid of fdivergence and the firstorder Wasserstein () distance, e.g. the following hybrid of JSdivergence and distance:
(4) 
We prove that this hybrid divergence enjoys a continuous behavior in distribution . Therefore, the hybrid divergence provides a remedy for the discontinuous behavior of the JSdivergence when optimizing the generator parameters in vanilla GAN. arjovsky2017wasserstein observes this issue with the JSdivergence in vanilla GAN and proposes to instead minimize the continuouslychanging distance in WGAN. However, as empirically demonstrated in miyato2018spectral vanilla GAN with Lipschitzbounded discriminator remains the stateoftheart method for training deep generative models in several benchmark tasks. Here, we leverage our duality framework to prove that the hybrid , which possesses the same continuity property as in distance, is in fact the divergence measure minimized in vanilla GAN with Lipschitz discriminator. Our analysis hence provides an explanation for why regularizing the discriminator’s Lipschitz constant via gradient penalty gulrajani2017improved or spectral normalization miyato2018spectral improves the training performance in vanilla GAN. We then extend our focus to the hybrid of fdivergence and the secondorder Wasserstein () distance. In this case, we derive the fGAN (e.g. vanilla GAN) problem with its discriminator being adversarially trained using Wasserstein risk minimization sinha2018certifiable . We numerically evaluate the power of these families of hybrid divergences in training vanilla GAN.
2 Divergence Measures
2.1 JensenShannon divergence
The JensenShannon divergence is defined in terms of the KLdivergence (denoted by ) as
where is the middistribution between and . Unlike the KLdivergence, the JSdivergence is symmetric and bounded .
2.2 fdivergence
The fdivergence family csiszar2004information generalizes the KL and JS divergence measures. Given a convex lower semicontinuous function with , the fdivergence is defined as
(5) 
Here denotes expectation over distribution and denote the density functions for distributions , respectively. The KLdivergence and the JSdivergence are members of the fdivergence family, corresponding to respectively and .
2.3 Optimal transport cost, Wasserstein distance
The optimal transport cost for cost function , which we denote by , is defined as
(6) 
where contains all couplings with marginals . The Kantorovich duality villani2008optimal shows that for a nonnegative lower semicontinuous cost ,
(7) 
where we use to denote ’s ctransform defined as and call cconcave if is the ctransform of a valid function. Considering the normbased cost with , the th order Wasserstein distance is defined based on the optimal transport cost as
(8) 
An important special case is the firstorder Wasserstein () distance corresponding to the difference norm cost . Given cost function , a function is cconcave if and only if is Lipschitz, and the ctransform for any Lipschitz . Therefore, the Kantorovich duality (7) implies that
(9) 
Another notable special case is the secondorder Wasserstein () distance, corresponding to the difference normsquared cost .
3 Divergence minimization in GANs: a convex duality framework
In this section, we develop a convex duality framework for analyzing divergence minimization problems conditioned to momentmatching constraints. Our framework generalizes the duality framework developed in altun2006unifying for the fdivergence family.
For a general divergence measure , we define ’s conjugate over distribution , which we denote by , as the following mapping from realvalued functions of to real numbers
(10) 
Here the supremum is over all distributions on with support set . We later show the following theorem, which is based on the above definition, recovers various wellknown GAN formulations, when applied to divergence measures discussed in Section 2.
Theorem 1.
Suppose divergence is nonnegative, lower semicontinuous and convex in distribution . Consider a convex set of continuous functions and assume support set is compact. Then,
(11)  
Proof.
We defer the proof to the Appendix. ∎
Theorem 1 interprets (11)’s LHS minimax problem as searching for the closest generative model to the distributions penalized to share the same moments specified by with . The following corollary of Theorem 1 shows if we further assume that is a linear space, then the penalty term penalizing moment mismatches can be moved to the constraints. This reduction reveals a divergence minimization problem between generative models and the following set which we call the set of discriminator moment matching distributions,
(12) 
Corollary 1.
In Theorem 1 suppose is further a linear space, i.e. for any and we have . Then,
(13) 
In next section, we apply this duality framework to divergence measures discussed in Section 2 and show how to derive various GAN problems through the developed framework.
4 Duality framework applied to different divergence measures
4.1 fdivergence: fGAN and vanilla GAN
Theorem 2 shows the application of Theorem 1 to fdivergences. We use to denote ’s convexconjugate boyd2004convex , defined as . Note that Theorem 2 applies to any fdivergence with nondecreasing convexconjugate , which holds for all fdivergence examples discussed in nowozin2016f with the only exception of Pearson divergence.
Theorem 2.
Proof.
We defer the proof to the Appendix. ∎
The minimax problem (14) is in fact the fGAN problem nowozin2016f . Theorem 2 hence reveals that fGAN searches for the generative model minimizing fdivergence to the distributions matching moments specified by to the moments of true distribution.
Example 1.
Consider the JSdivergence, i.e. fdivergence corresponding to . Then, (14) up to additive and multiplicative constants reduces to
(15) 
Moreover, if for function set the corresponding is a convex set, then (15) will reduce to the following minimax game which is the vanilla GAN problem (1) with sigmoid activation applied to the discriminator output,
(16) 
4.2 Optimal Transport Cost: Wasserstein GAN
Theorem 3.
Proof.
We defer the proof to the Appendix. ∎
Therefore the minimax game between and in (17) can be viewed as minimizing the optimal transport cost between generative models and the distributions matching moments over with ’s moments. The following example applies this result to the firstorder Wasserstein distance and recovers the WGAN problem arjovsky2017wasserstein with a constrained Lipschitz discriminator.
Example 2.
Therefore, the momentmatching interpretation also holds for WGAN: for a convex set of Lipschitz functions WGAN finds the generative model with minimum distance to the distributions penalized to share the same moments over with the data distribution. We discuss two more examples in the Appendix: 1) for the indicator cost corresponding to the total variation distance we draw the connection to the energybased GAN zhao2016energy , 2) for the secondorder cost we recover feizi2017understanding ’s quadratic GAN formulation under the LQG setting assumptions, i.e. linear generator, quadratic discriminator and Gaussian input data.
5 Duality framework applied to neural net discriminators
We applied the duality framework to analyze GAN problems with convex discriminator sets. However, a neural net set , where denotes a neural net function with fixed architecture and weights in feasible set , does not generally satisfy this convexity assumption. Note that a linear combination of several neural net functions in may not remain in .
Therefore, we apply the duality framework to ’s convex hull, which we denote by , containing any convex combination of neural net functions in . However, a convex combination of infinitelymany neural nets from is characterized by infinitelymany parameters, which makes optimizing the discriminator over computationally intractable. In the following theorem, we show that although a function in is a combination of infinitelymany neural nets, that function can be approximated by uniformly combining boundedlymany neural nets in .
Theorem 4.
Suppose any function is Lipschitz and bounded as . Also, assume that the dimensional random input is normbounded as . Then, any function in can be uniformly approximated over the ball within error by a uniform combination of functions .
Proof.
We defer the proof to the Appendix. ∎
The above theorem suggests using a uniform combination of multiple discriminator nets to find a better approximation of the solution to the divergence minimization problem in Theorem 1 solved over . Note that this approach is different from MIXGAN arora2017generalization proposed for achieving equilibrium in GAN minimiax game. While our approach considers a uniform combination of multiple neural nets as the discriminator, MIXGAN considers a randomized combination of the minimax game over multiple neural net discriminators and generators.
6 Minimumsum hybrid of fdivergence and Wasserstein distance: GAN with Lipschitz or adversariallytrained discriminator
Here we apply the convex duality framework to a novel class of divergence measures. For each fdivergence we define divergence , which is the minimum sum hybrid of and divergences, as follows
(19) 
The above infimum is taken over all distributions on random , searching for distribution minimizing the sum of the Wasserstein distance between and and the fdivergence from to . Note that the hybrid of JSdivergence and distance defined earlier in (4) is a special case of the above definition. While fdivergence in fGAN does not change continuously with the generator parameters, the following theorem shows that similar to the continuous behavior of distance shown in arjovsky2017towards ; arjovsky2017wasserstein the proposed hybrid divergence changes continuously with the generative model. We defer the proofs of this section’s results to the Appendix.
Theorem 5.
Suppose is continuously changing with parameters . Then, for any and , will behave continuously as a function of . Moreover, if is assumed to be locally Lipschitz, then will be differentiable w.r.t. almost everywhere.
Our next result reveals the minimax problem dual to minimizing this hybrid divergence with symmetric fdivergence component. We note that this symmetricity condition is met by the JSdivergence and the squared Hellinger divergence among the fdivergence examples discussed in nowozin2016f .
Theorem 6.
The above theorem reveals that when the Lipschitz constant of discriminator in fGAN is properly regularized, then solving the fGAN problem over the regularized discriminator also minimizes the continuous divergence measure . As a special case, in the vanilla GAN problem (16) we only need to constrain discriminator to be 1Lipschitz, which can be done via the gradient penalty gulrajani2017improved or spectral normalization of ’s weight matrices miyato2018spectral , and then we minimize the continuouslybehaving . This result is also consistent with miyato2018spectral ’s empirical observations that regularizing the Lipschitz constant of the discriminator improves the training performance in vanilla GAN.
Our discussion has so far focused on the mixture of fdivergence and the first order Wasserstein distance, which suggests training fGAN over Lipschitzbounded discriminators. As a second solution, we prove that the desired continuity property can also be achieved through the following hybrid using the secondorder Wasserstein () distancesquared:
(21) 
Theorem 7.
Suppose continuously changes with parameters . Then, for any distribution and random vector , will be continuous in . Also, if we further assume is bounded and locallyLipschitz w.r.t. , then the hybrid divergence is almost everywhere differentiable w.r.t. .
The following result shows that minimizing reduces to fGAN problem where the discriminator is being adversarially trained.
Theorem 8.
The above result reduces minimizing the hybrid divergence to an fGAN minimax game with a new third player. Here the third player assists the generator by perturbing the generated fake samples in order to make them harder to be distinguished from the real samples by the discriminator. The cost for perturbing a fake sample to will be , which constrains the power of the third player who can be interpreted as an adversary to the discriminator. To implement the game between these three players, we can adversarially learn the discriminator while we are training GAN, using the Wasserstein risk minimization (WRM) adversarial learning scheme discussed in sinha2018certifiable .
7 Numerical Experiments
To evaluate our theoretical results, we used the CelebA liu2015faceattributes and LSUNbedroom yu2015lsun datasets. Furthermore, in the Appendix we include the results of our experiments over the MNIST lecun1998mnist dataset. We considered vanilla GAN goodfellow2014generative with the minimax formulation in (16) and DCGAN radford2015unsupervised convolutional architecture for discriminator and generator. We used the code provided by gulrajani2017improved and trained DCGAN via Adam optimizer kingma2014adam for 200,000 generator iterations. We applied 5 discriminator updates for each generator update.
Figure 2 shows how the discriminator loss evaluated over 2000 validation samples, which is an estimate of the divergence measure, changes as we train the DCGAN over LSUN samples. Using standard DCGAN regularizied by only batch normalization (BN) ioffe2015batch , we observed (Figure 2left) that the JSdivergence estimate always remains close to its maximum value and also poorly correlates with the visual quality of generated samples. In this experiment, the GAN training failed and led to mode collapse starting at about the 110,000th iteration. On the other hand, after replacing BN with spectral normalization (SN) miyato2018spectral to ensure the discriminator’s Lipschitzness, the discriminator loss decreased in a desired monotonic fashion (Figure 2right). This observation is consistent with Theorems 5 and 6 showing that the discriminator loss becomes an estimate for the hybrid divergence changing continuously with the generator parameters. Also, the samples generated by the Lipschitzregularized DCGAN looked qualitatively better and correlated well with the estimate of divergence.
Figure 3 shows the results of similar experiments over the CelebA dataset. Again, we observed (Figure 3top left) that the JSdivergence estimate remains close to while training DCGAN with BN. However, after applying two different Lipschitz regularization methods, SN and the gradient penalty (GP) gulrajani2017improved in Figures 3top right and bottom left, we observed that the hybrid changed nicely and monotonically, and correlated properly with the sharpness of samples generated. Figure 3bottom right shows that a similar desired behavior can also be achieved using the secondorder hybrid divergence. In this case, we trained the DCGAN discriminator via the WRM adversarial learning scheme sinha2018certifiable .
8 Related Work
Theoretical studies of GAN have focused on three different aspects: approximation, generalization, and optimization. On the approximation properties of GAN, liu2017approximation studies GAN’s approximation power using a momentmatching approach. The authors view the maximized discriminator objective as an based adversarial divergence, showing that the adversarial divergence between two distributions takes its minimum value if and only if the two distributions share the same moments over . Our convex duality framework interprets their result and further draws the connection to the original divergence measure. nock2017f studies the fGAN problem through an information geometric approach based on the Bregman divergence and its connection to fdivergence.
Analyzing GAN’s generalization performance is another problem of interest in several recent works. arora2017generalization proves generalization guarantees for GANs in terms of based distance measures. arora2017gans uses an elegant approach based on the Birthday Paradox to empirically study the generalizibility of GAN’s learned models. santurkar2017classification develops a quantitative approach for examining diversity and generalization in GAN’s learned distribution. zhang2018on studies approximationgeneralization tradeoffs in GAN by analyzing the discriminative power of based distances. Regarding optimization properties of GAN, chen2018training ; zhao2018information propose dualitybased methods for improving the optimization performance in training deep generative models. roth2017stabilizing suggests applying noise convolution with input data for boosting the training performance in fGAN. Moreover, several other works including nagarajan2017gradient ; mescheder2017numerics ; daskalakis2017training ; feizi2017understanding ; sanjabi2018solving explore the optimization and stability properties of training GANs. Finally, we note that the same convex analysis approach used in this paper has further provided a powerful theoretical framework to analyze various supervised and unsupervised learning problems dudik2007maximum ; razaviyayn2015discrete ; farnia2016minimax ; fathony2016adversarial ; fathony2017adversarial .
Acknowledgments: We are grateful for support under a Stanford Graduate Fellowship, the National Science Foundation grant under CCF1563098, and the Center for Science of Information (CSoI), an NSF Science and Technology Center under grant agreement CCF0939370.
References
 [1] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [2] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
 [3] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.

[4]
Martin Arjovsky, Soumith Chintala, and Léon Bottou.
Wasserstein generative adversarial networks.
International Conference on Machine Learning
, 2017.  [5] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906, 2015.
 [6] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International Conference on Machine Learning, pages 1718–1727, 2015.
 [7] ChunLiang Li, WeiCheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2200–2210, 2017.
 [8] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energybased generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
 [9] Soheil Feizi, Farzan Farnia, Tony Ginart, and David Tse. Understanding gans: the lqg setting. arXiv preprint arXiv:1710.10793, 2017.
 [10] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.
 [11] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pages 5551–5559, 2017.
 [12] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. International Conference on Learning Representations, 2018.
 [13] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769–5779, 2017.
 [14] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018.
 [15] Imre Csiszár, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417–528, 2004.
 [16] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

[17]
Yasemin Altun and Alex Smola.
Unifying divergence minimization and statistical inference via convex
duality.
In
International Conference on Computational Learning Theory
, pages 139–153, 2006.  [18] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [19] Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017.

[20]
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, 2015.  [21] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a largescale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.

[22]
Yann LeCun.
The mnist database of handwritten digits.
http://yann. lecun.com/exdb/mnist/, 1998.  [23] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [26] Richard Nock, Zac Cranko, Aditya K Menon, Lizhen Qu, and Robert C Williamson. fgans in an information geometric nutshell. In Advances in Neural Information Processing Systems, pages 456–464, 2017.
 [27] Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.
 [28] Shibani Santurkar, Ludwig Schmidt, and Aleksander Madry. A classificationbased perspective on gan distributions. arXiv preprint arXiv:1711.00970, 2017.
 [29] Pengchuan Zhang, Qiang Liu, Dengyong Zhou, Tao Xu, and Xiaodong He. On the discriminationgeneralization tradeoff in GANs. International Conference on Learning Representations, 2018.
 [30] Xu Chen, Jiang Wang, and Hao Ge. Training generative adversarial networks via primaldual subgradient methods: a Lagrangian perspective on GAN. In International Conference on Learning Representations, 2018.
 [31] Shengjia Zhao, Jiaming Song, and Stefano Ermon. The information autoencoding family: A lagrangian perspective on latent variable generative models. arXiv preprint arXiv:1806.06514, 2018.
 [32] Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems, pages 2015–2025, 2017.
 [33] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pages 5591–5600, 2017.
 [34] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, pages 1823–1833, 2017.
 [35] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. arXiv preprint arXiv:1711.00141, 2017.
 [36] Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. Solving approximate wasserstein gans to stationarity. arXiv preprint arXiv:1802.08249, 2018.
 [37] Miroslav Dudík, Steven J Phillips, and Robert E Schapire. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8(Jun):1217–1260, 2007.

[38]
Meisam Razaviyayn, Farzan Farnia, and David Tse.
Discrete rényi classifiers.
In Advances in Neural Information Processing Systems, pages 3276–3284, 2015. 
[39]
Farzan Farnia and David Tse.
A minimax approach to supervised learning.
In Advances in Neural Information Processing Systems, pages 4240–4248, 2016.  [40] Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian Ziebart. Adversarial multiclass classification: A risk minimization perspective. In Advances in Neural Information Processing Systems, pages 559–567, 2016.
 [41] Rizal Fathony, Mohammad Ali Bashiri, and Brian Ziebart. Adversarial surrogate losses for ordinal regression. In Advances in Neural Information Processing Systems, pages 563–573, 2017.
 [42] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitzmargin training: Scalable certification of perturbation invariance for deep neural networks. arXiv preprint arXiv:1802.04034, 2018.
 [43] Jonathan M Borwein. A very complicated proof of the minimax theorem. Minimax Theory and its Applications, 1(1):21–27, 2016.
 [44] Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 2013.
9 Appendix
9.1 Additional numerical results
9.1.1 LSUN divergence estimates for different training schemes
Figure 4 shows the complete divergence estimates over LSUN dataset for the GAN training schemes described in the main text. While the hybrid divergence measures , decreased smoothly as the DCGAN was being trained, the JSdivergence always remained close to its maximum value which led to lowerquality produced samples.
9.1.2 CelebA, LSUN, MNIST images generated by different trainings of DCGAN
Figures 5, 6, and 7 show the CelebA, LSUN, and MNIST samples generated by vanilla DCGAN trained via the different methods described in the main text. Observe that applying Lipschitz regularization and adversarial training to the discriminator consistently result in the highest quality generator output samples. We note that tight SN in these figures refers to [42]’s spectral normalization method for convolutional layers, which precisely normalizes a conv layer’s spectral norm and hence guarantees the
Lipschitzness of the discriminator neural net. Note that for nontight SN we use the original heuristic for normalizing convolutional layers’ operator norm introduced in
[12].9.2 Proof of Theorem 1
Theorem 1 and Corollary 1 directly result from the following two lemmas.
Lemma 1.
Suppose divergence is nonnegative, lower semicontinuous and convex in distribution . Consider a convex subset of continuous functions and assume support set is compact. Then, the following duality holds for any pair of distributions :
(23) 
Proof.
Note that
(24)  
Here (a) is a consequence of the generalized Sion’s minimax theorem [43], because the space of probability measures on compact is convex and weakly compact [44], is assumed to be convex, the minimiax objective is lower semicontinuous and convex in and linear in . (b) holds according to the conjugate ’s definition. ∎
Lemma 2.
Assume divergence is nonnegative, lower semicontinuous and convex in distribution over compact . Consider a linear space subset of continuous functions . Then, the following duality holds for any pair of distributions :
(25) 
Proof.
This lemma is a consequence of Lemma 1. Note that a linear space is a convex set. Therefore, Lemma 1 applies to . However, since is a linear space i.e. for any and it includes we have
(26) 
As a result, the minimizing precisely matches the moments over to ’s moments, which completes the proof. ∎
9.3 Proof of Theorem 2
We first prove the following lemma.
Lemma 3.
Consider fdivergence corresponding to function which has a nondecreasing convexconjugate . Then, for any continuous
(27) 
where satisfies . Here stands for the derivative of conjugate function which is supposed to be nonnegative everywhere.
Proof.
Note that
(28)  
(29) 
Here (a) and (b) follow from the conjugate and fdivergence definitions. (c) rewrites the optimization problem in terms of the density function corresponding to distribution . (d) uses the strong convex duality to move the density constraint to the objective. Note that strong duality holds, since we have a convex optimization problem with affine constraints. (e) rewrites the problem after a change of variable . (f) holds since and are assumed to be continuous. (g) follows from the assumption that the derivative of takes nonnegative values, and hence the minimizing also minimizes the unconstrained optimization for the convex conjugate
Taking the derivative of the concave objective, the value maximizing the objective solves the equation which is assumed to be . Therefore, (h) holds and the proof is complete. ∎
Now we prove Theorem 2 which can be broken into two parts as follows.
Theorem (Theorem 2).
Consider fdivergence where has a nondecreasing conjugate .
(a) Suppose is a convex set closed to a constant addition, i.e. for any we have . Then,
(30) 
(b) Suppose is a linear space including the constant function . Then,
(31) 
Proof.
This theorem is an application of Theorem 1 and Corollary 1. For part (a) we have
Here (c) is a direct result of Theorem 1. (d) uses the simplified version (28) for . (e) follows from the assumption that is closed to constant additions.
For part (b) note that since is a linear space and includes , it is closed to constant additions. Hence, an application of Corollary 1 reveals
which makes the proof complete. ∎
9.4 Proof of Theorem 3
Theorem 3 is a direct application of the following lemma to Theorem 1 and Corollary 1.
Lemma 4.
Let be a lower semicontinuous nonnegative cost function. Considering the ctransform operation defined in the text, the following holds for any continuous
(32) 
Proof.
We have
Here (a), (b), (d) hold according to the definitions. Moreover, we show (c) will hold with equality under the lemma’s assumptions. is lower semicontinuous, and hence for every there exists a measurable function such that for the coupling the absolute difference is bounded. Therefore, holds with equality and the proof is complete. ∎
9.5 Proof of Theorem 4
Consider a convex combination of functions from as where
can be considered as a probability density function over feasible set
. Consider samples taken i.i.d. from . Since any is bounded, according to Hoeffding’s inequality for a fixed we have(33) 
Next we consider a covering for the ball , where we choose . We know a covering exists with a bounded size [15]. Then, an application of the union bound implies
Hence if we have