# A Convex Duality Framework for GANs

Generative adversarial network (GAN) is a minimax game between a generator mimicking the true model and a discriminator distinguishing the samples produced by the generator from the real training samples. Given an unconstrained discriminator able to approximate any function, this game reduces to finding the generative model minimizing a divergence measure, e.g. the Jensen-Shannon (JS) divergence, to the data distribution. However, in practice the discriminator is constrained to be in a smaller class F such as neural nets. Then, a natural question is how the divergence minimization interpretation changes as we constrain F. In this work, we address this question by developing a convex duality framework for analyzing GANs. For a convex set F, this duality framework interprets the original GAN formulation as finding the generative model with minimum JS-divergence to the distributions penalized to match the moments of the data distribution, with the moments specified by the discriminators in F. We show that this interpretation more generally holds for f-GAN and Wasserstein GAN. As a byproduct, we apply the duality framework to a hybrid of f-divergence and Wasserstein distance. Unlike the f-divergence, we prove that the proposed hybrid divergence changes continuously with the generative model, which suggests regularizing the discriminator's Lipschitz constant in f-GAN and vanilla GAN. We numerically evaluate the power of the suggested regularization schemes for improving GAN's training performance.

• 6 publications
• 34 publications
03/29/2018

### Generative Modeling using the Sliced Wasserstein Distance

Generative Adversarial Nets (GANs) are very successful at modeling distr...
10/09/2019

### How Well Do WGANs Estimate the Wasserstein Metric?

Generative modelling is often cast as minimizing a similarity measure be...
12/12/2020

### On Duality Gap as a Measure for Monitoring GAN Training

Generative adversarial network (GAN) is among the most popular deep lear...
11/06/2017

### KGAN: How to Break The Minimax Game in GAN

Generative Adversarial Networks (GANs) were intuitively and attractively...
11/18/2018

### GAN-QP: A Novel GAN Framework without Gradient Vanishing and Lipschitz Constraint

We know SGAN may have a risk of gradient vanishing. A significant improv...
11/07/2017

### On the Discrimination-Generalization Tradeoff in GANs

Generative adversarial training can be generally understood as minimizin...
01/27/2019

We deconstruct the performance of GANs into three components: 1. Formu...

## 1 Introduction

Learning a probability model from data samples is a fundamental task in unsupervised learning. The recently developed generative adversarial network (GAN)

goodfellow2014generative

leverages the power of deep neural networks to successfully address this task across various domains

goodfellow2016nips

. In contrast to traditional methods of parameter fitting like maximum likelihood estimation, the GAN approach views the problem as a

game between a generator whose goal is to generate fake samples that are close to the real data training samples and a discriminator whose goal is to distinguish between the real and fake samples. The generator creates the fake samples by mapping from random noise input.

The following minimax problem is the original GAN problem, also called vanilla GAN, introduced in goodfellow2014generative

 minG∈GmaxD∈FE[logD(X)]+E[log(1−D(G(Z)))]. (1)

Here denotes the generator’s noise input,

represents the random vector for the real data distributed as

, and and respectively represent the generator and discriminator function sets. Implementing this minimax game using deep neural network classes and has lead to the state-of-the-art generative model for many different tasks.

To shed light on the probabilistic meaning of vanilla GAN, goodfellow2014generative shows that given an unconstrained discriminator , i.e. if contains all possible functions, the minimax problem (1) will reduce to

 minG∈GJSD(PX,PG(Z)), (2)

where denotes the Jensen-Shannon (JS) divergence. The optimization problem (2) can be interpreted as finding the closest generative model to the data distribution (Figure 1a), where distance is measured using the JS-divergence. Various GAN formulations were later proposed by changing the divergence measure in (2): f-GAN nowozin2016f generalizes vanilla GAN by minimizing a general f-divergence; Wasserstein GAN (WGAN) arjovsky2017wasserstein considers the first-order Wasserstein (the earth-mover’s) distance; MMD-GAN dziugaite2015training ; li2015generative ; li2017mmd considers the maximum mean discrepancy; Energy-based GAN zhao2016energy minimizes the total variation distance as discussed in arjovsky2017wasserstein ; Quadratic GAN feizi2017understanding finds the distribution minimizing the second-order Wasserstein distance.

However, GANs trained in practice differ from this minimum divergence formulation, since their discriminator is not optimized over an unconstrained set and is constrained to smaller classes such as neural nets. As shown in arora2017generalization , constraining the discriminator is in fact necessary to guarantee good generalization properties for GAN’s learned model. Then, how does the minimum divergence interpretation (2) change as we constrain ? A standard approach used in arora2017generalization ; liu2017approximation is to view the maximum discriminator objective as an -based distance between distributions. For unconstrained , the -based distance reduces to the original divergence measure, e.g. the JS-divergence in vanilla GAN.

While -based distances have been shown to be useful for analyzing GAN’s generalization properties arora2017generalization , their connection to the original divergence measure remains unclear for a constrained . Then, what is the interpretation of GAN minimax game with a constrained discriminator? In this work, we address this question by interpreting the dual problem to the discriminator optimization. To analyze the dual problem, we develop a convex duality framework for general divergence minimization problems. We apply the duality framework to the f-divergence and optimal transport cost families, providing interpretation for f-GAN, including vanilla GAN minimizing JS-divergence, and Wasserstein GAN.

Specifically, we generalize the interpretation for unconstrained in (2) to any linear space discriminator set

. For this class of discriminator sets, we interpret vanilla GAN as the following JS-divergence minimization between two sets of probability distributions, the set of generative models and the set of discriminator moment-matching distributions (Figure

1b),

 minG∈GminQ∈PF(PX)JSD(PG(Z),Q). (3)

Here contains any distribution satisfying the moment matching constraint for all discriminator ’s in . More generally, we show that a similar interpretation applies to GANs trained over any convex discriminator set . We further discuss the application of our duality framework to neural net discriminators with bounded Lipschitz constant. While a set of neural network functions is not necessarily convex, we prove any convex combination of Lipschitz-bounded neural nets can be approximated by uniformly combining boundedly-many neural nets. This result applied to our duality framework suggests considering a uniform mixture of multiple neural nets as the discriminator.

As a byproduct, we apply the duality framework to the minimum sum hybrid of f-divergence and the first-order Wasserstein () distance, e.g. the following hybrid of JS-divergence and distance:

 dJSD,W1(P1,P2):=minQW1(P1,Q)+JSD(Q,P2). (4)

We prove that this hybrid divergence enjoys a continuous behavior in distribution . Therefore, the hybrid divergence provides a remedy for the discontinuous behavior of the JS-divergence when optimizing the generator parameters in vanilla GAN. arjovsky2017wasserstein observes this issue with the JS-divergence in vanilla GAN and proposes to instead minimize the continuously-changing distance in WGAN. However, as empirically demonstrated in miyato2018spectral vanilla GAN with Lipschitz-bounded discriminator remains the state-of-the-art method for training deep generative models in several benchmark tasks. Here, we leverage our duality framework to prove that the hybrid , which possesses the same continuity property as in distance, is in fact the divergence measure minimized in vanilla GAN with -Lipschitz discriminator. Our analysis hence provides an explanation for why regularizing the discriminator’s Lipschitz constant via gradient penalty gulrajani2017improved or spectral normalization miyato2018spectral improves the training performance in vanilla GAN. We then extend our focus to the hybrid of f-divergence and the second-order Wasserstein () distance. In this case, we derive the f-GAN (e.g. vanilla GAN) problem with its discriminator being adversarially trained using Wasserstein risk minimization sinha2018certifiable . We numerically evaluate the power of these families of hybrid divergences in training vanilla GAN.

## 2 Divergence Measures

### 2.1 Jensen-Shannon divergence

The Jensen-Shannon divergence is defined in terms of the KL-divergence (denoted by ) as

 JSD(P,Q):=12KL(P∥M)+12KL(Q∥M)

where is the mid-distribution between and . Unlike the KL-divergence, the JS-divergence is symmetric and bounded .

### 2.2 f-divergence

The f-divergence family csiszar2004information generalizes the KL and JS divergence measures. Given a convex lower semicontinuous function with , the f-divergence is defined as

 df(P,Q):=EP[f(q(X)p(X))]=∫p(x)f(q(x)p(x))dx. (5)

Here denotes expectation over distribution and denote the density functions for distributions , respectively. The KL-divergence and the JS-divergence are members of the f-divergence family, corresponding to respectively and .

### 2.3 Optimal transport cost, Wasserstein distance

The optimal transport cost for cost function , which we denote by , is defined as

 OTc(P,Q):=infM∈Π(P,Q)E[c(X,X′)], (6)

where contains all couplings with marginals . The Kantorovich duality villani2008optimal shows that for a non-negative lower semi-continuous cost ,

 OTc(P,Q)=maxDc-concaveEP[D(X)]−EQ[Dc(X)], (7)

where we use to denote ’s c-transform defined as and call c-concave if is the c-transform of a valid function. Considering the norm-based cost with , the th order Wasserstein distance is defined based on the optimal transport cost as

 Wq(P,Q):=OTcq(P,Q)1/q=infM∈Π(P,Q)E[∥X−X′∥q]1/q. (8)

An important special case is the first-order Wasserstein () distance corresponding to the difference norm cost . Given cost function , a function is c-concave if and only if is -Lipschitz, and the c-transform for any -Lipschitz . Therefore, the Kantorovich duality (7) implies that

 W1(P,Q)=maxD1-LipschitzEP[D(X)]−EQ[D(X)]. (9)

Another notable special case is the second-order Wasserstein () distance, corresponding to the difference norm-squared cost .

## 3 Divergence minimization in GANs: a convex duality framework

In this section, we develop a convex duality framework for analyzing divergence minimization problems conditioned to moment-matching constraints. Our framework generalizes the duality framework developed in altun2006unifying for the f-divergence family.

For a general divergence measure , we define ’s conjugate over distribution , which we denote by , as the following mapping from real-valued functions of to real numbers

 d∗P(D):=supQEQ[D(X)]−d(P,Q). (10)

Here the supremum is over all distributions on with support set . We later show the following theorem, which is based on the above definition, recovers various well-known GAN formulations, when applied to divergence measures discussed in Section 2.

###### Theorem 1.

Suppose divergence is non-negative, lower semicontinuous and convex in distribution . Consider a convex set of continuous functions and assume support set is compact. Then,

 minG∈GmaxD∈FEPX[D(X)]−d∗PG(Z)(D) (11) = minG∈GminQ{d(PG(Z),Q)+maxD∈F{EPX[D(X)]−EQ[D(X)]}}.
###### Proof.

We defer the proof to the Appendix. ∎

Theorem 1 interprets (11)’s LHS minimax problem as searching for the closest generative model to the distributions penalized to share the same moments specified by with . The following corollary of Theorem 1 shows if we further assume that is a linear space, then the penalty term penalizing moment mismatches can be moved to the constraints. This reduction reveals a divergence minimization problem between generative models and the following set which we call the set of discriminator moment matching distributions,

 PF(P):={Q:∀D∈F,EQ[D(X)]=EP[D(X)]}. (12)
###### Corollary 1.

In Theorem 1 suppose is further a linear space, i.e. for any and we have . Then,

 minG∈GmaxD∈FEPX[D(X)]−d∗PG(Z)(D)=minG∈GminQ∈PF(PX)d(PG(Z),Q). (13)

In next section, we apply this duality framework to divergence measures discussed in Section 2 and show how to derive various GAN problems through the developed framework.

## 4 Duality framework applied to different divergence measures

### 4.1 f-divergence: f-GAN and vanilla GAN

Theorem 2 shows the application of Theorem 1 to f-divergences. We use to denote ’s convex-conjugate boyd2004convex , defined as . Note that Theorem 2 applies to any f-divergence with non-decreasing convex-conjugate , which holds for all f-divergence examples discussed in nowozin2016f with the only exception of Pearson -divergence.

###### Theorem 2.

Consider f-divergence where the corresponding has a non-decreasing convex-conjugate . In addition to Theorem 1’s assumptions, suppose is closed to adding constant functions, i.e. if . Then, the minimax problem in the LHS of (11) and (13), will reduce to

 minG∈GmaxD∈FE[D(X)]−E[f∗(D(G(Z)))]. (14)
###### Proof.

We defer the proof to the Appendix. ∎

The minimax problem (14) is in fact the f-GAN problem nowozin2016f . Theorem 2 hence reveals that f-GAN searches for the generative model minimizing f-divergence to the distributions matching moments specified by to the moments of true distribution.

###### Example 1.

Consider the JS-divergence, i.e. f-divergence corresponding to . Then, (14) up to additive and multiplicative constants reduces to

 minG∈GmaxD∈FE[D(X)]+E[log(1−exp(D(G(Z)))]. (15)

Moreover, if for function set the corresponding is a convex set, then (15) will reduce to the following minimax game which is the vanilla GAN problem (1) with sigmoid activation applied to the discriminator output,

 minG∈Gmax~D∈~FE[log11+exp(~D(X))]+E[logexp(~D(X))1+exp(~D(X))]. (16)

### 4.2 Optimal Transport Cost: Wasserstein GAN

###### Theorem 3.

Let divergence be optimal transport cost where is a non-negative lower semicontinuous cost function. Then, the minimax problem in the LHS of (11) and (13) reduces to

 minG∈GmaxD∈FE[D(X)]−E[Dc(G(Z))]. (17)
###### Proof.

We defer the proof to the Appendix. ∎

Therefore the minimax game between and in (17) can be viewed as minimizing the optimal transport cost between generative models and the distributions matching moments over with ’s moments. The following example applies this result to the first-order Wasserstein distance and recovers the WGAN problem arjovsky2017wasserstein with a constrained -Lipschitz discriminator.

###### Example 2.

Let the optimal transport cost in (17) be the distance, and suppose is a convex subset of 1-Lipschitz functions. Then, the minimax problem (17) will reduce to

 minG∈GmaxD∈FE[D(X)]−E[D(G(Z))]. (18)

Therefore, the moment-matching interpretation also holds for WGAN: for a convex set of -Lipschitz functions WGAN finds the generative model with minimum distance to the distributions penalized to share the same moments over with the data distribution. We discuss two more examples in the Appendix: 1) for the indicator cost corresponding to the total variation distance we draw the connection to the energy-based GAN zhao2016energy , 2) for the second-order cost we recover feizi2017understanding ’s quadratic GAN formulation under the LQG setting assumptions, i.e. linear generator, quadratic discriminator and Gaussian input data.

## 5 Duality framework applied to neural net discriminators

We applied the duality framework to analyze GAN problems with convex discriminator sets. However, a neural net set , where denotes a neural net function with fixed architecture and weights in feasible set , does not generally satisfy this convexity assumption. Note that a linear combination of several neural net functions in may not remain in .

Therefore, we apply the duality framework to ’s convex hull, which we denote by , containing any convex combination of neural net functions in . However, a convex combination of infinitely-many neural nets from is characterized by infinitely-many parameters, which makes optimizing the discriminator over computationally intractable. In the following theorem, we show that although a function in is a combination of infinitely-many neural nets, that function can be approximated by uniformly combining boundedly-many neural nets in .

###### Theorem 4.

Suppose any function is -Lipschitz and bounded as . Also, assume that the -dimensional random input is norm-bounded as . Then, any function in can be uniformly approximated over the ball within -error by a uniform combination of functions .

###### Proof.

We defer the proof to the Appendix. ∎

The above theorem suggests using a uniform combination of multiple discriminator nets to find a better approximation of the solution to the divergence minimization problem in Theorem 1 solved over . Note that this approach is different from MIX-GAN arora2017generalization proposed for achieving equilibrium in GAN minimiax game. While our approach considers a uniform combination of multiple neural nets as the discriminator, MIX-GAN considers a randomized combination of the minimax game over multiple neural net discriminators and generators.

## 6 Minimum-sum hybrid of f-divergence and Wasserstein distance: GAN with Lipschitz or adversarially-trained discriminator

Here we apply the convex duality framework to a novel class of divergence measures. For each f-divergence we define divergence , which is the minimum sum hybrid of and divergences, as follows

 df,W1(P1,P2):=infQW1(P1,Q)+df(Q,P2). (19)

The above infimum is taken over all distributions on random , searching for distribution minimizing the sum of the Wasserstein distance between and and the f-divergence from to . Note that the hybrid of JS-divergence and -distance defined earlier in (4) is a special case of the above definition. While f-divergence in f-GAN does not change continuously with the generator parameters, the following theorem shows that similar to the continuous behavior of -distance shown in arjovsky2017towards ; arjovsky2017wasserstein the proposed hybrid divergence changes continuously with the generative model. We defer the proofs of this section’s results to the Appendix.

###### Theorem 5.

Suppose is continuously changing with parameters . Then, for any and , will behave continuously as a function of . Moreover, if is assumed to be locally Lipschitz, then will be differentiable w.r.t. almost everywhere.

Our next result reveals the minimax problem dual to minimizing this hybrid divergence with symmetric f-divergence component. We note that this symmetricity condition is met by the JS-divergence and the squared Hellinger divergence among the f-divergence examples discussed in nowozin2016f .

###### Theorem 6.

Consider with a symmetric f-divergence , i.e. , satisfying the assumptions in Theorem 2. If the composition is 1-Lipschitz for all , the minimax problem in Theorem 1 for the hybrid reduces to the f-GAN problem, i.e.

 minG∈GmaxD∈FE[D(X)]−E[f∗(D(G(Z))]. (20)

The above theorem reveals that when the Lipschitz constant of discriminator in f-GAN is properly regularized, then solving the f-GAN problem over the regularized discriminator also minimizes the continuous divergence measure . As a special case, in the vanilla GAN problem (16) we only need to constrain discriminator to be 1-Lipschitz, which can be done via the gradient penalty gulrajani2017improved or spectral normalization of ’s weight matrices miyato2018spectral , and then we minimize the continuously-behaving . This result is also consistent with miyato2018spectral ’s empirical observations that regularizing the Lipschitz constant of the discriminator improves the training performance in vanilla GAN.

Our discussion has so far focused on the mixture of f-divergence and the first order Wasserstein distance, which suggests training f-GAN over Lipschitz-bounded discriminators. As a second solution, we prove that the desired continuity property can also be achieved through the following hybrid using the second-order Wasserstein () distance-squared:

 df,W2(P1,P2):=infQW22(P1,Q)+df(Q,P2). (21)
###### Theorem 7.

Suppose continuously changes with parameters . Then, for any distribution and random vector , will be continuous in . Also, if we further assume is bounded and locally-Lipschitz w.r.t. , then the hybrid divergence is almost everywhere differentiable w.r.t. .

The following result shows that minimizing reduces to f-GAN problem where the discriminator is being adversarially trained.

###### Theorem 8.

Assume and satisfy the assumptions in Theorem 6. Then, the minimax problem in Theorem 1 corresponding to the hybrid divergence reduces to

 minG∈GmaxD∈FE[D(X)]+E[minu−f∗(D(G(Z)+u))+∥u∥2]. (22)

The above result reduces minimizing the hybrid divergence to an f-GAN minimax game with a new third player. Here the third player assists the generator by perturbing the generated fake samples in order to make them harder to be distinguished from the real samples by the discriminator. The cost for perturbing a fake sample to will be , which constrains the power of the third player who can be interpreted as an adversary to the discriminator. To implement the game between these three players, we can adversarially learn the discriminator while we are training GAN, using the Wasserstein risk minimization (WRM) adversarial learning scheme discussed in sinha2018certifiable .

## 7 Numerical Experiments

To evaluate our theoretical results, we used the CelebA liu2015faceattributes and LSUN-bedroom yu2015lsun datasets. Furthermore, in the Appendix we include the results of our experiments over the MNIST lecun1998mnist dataset. We considered vanilla GAN goodfellow2014generative with the minimax formulation in (16) and DCGAN radford2015unsupervised convolutional architecture for discriminator and generator. We used the code provided by gulrajani2017improved and trained DCGAN via Adam optimizer kingma2014adam for 200,000 generator iterations. We applied 5 discriminator updates for each generator update.

Figure 2 shows how the discriminator loss evaluated over 2000 validation samples, which is an estimate of the divergence measure, changes as we train the DCGAN over LSUN samples. Using standard DCGAN regularizied by only batch normalization (BN) ioffe2015batch , we observed (Figure 2-left) that the JS-divergence estimate always remains close to its maximum value and also poorly correlates with the visual quality of generated samples. In this experiment, the GAN training failed and led to mode collapse starting at about the 110,000th iteration. On the other hand, after replacing BN with spectral normalization (SN) miyato2018spectral to ensure the discriminator’s Lipschitzness, the discriminator loss decreased in a desired monotonic fashion (Figure 2-right). This observation is consistent with Theorems 5 and 6 showing that the discriminator loss becomes an estimate for the hybrid divergence changing continuously with the generator parameters. Also, the samples generated by the Lipschitz-regularized DCGAN looked qualitatively better and correlated well with the estimate of divergence.

Figure 3 shows the results of similar experiments over the CelebA dataset. Again, we observed (Figure 3-top left) that the JS-divergence estimate remains close to while training DCGAN with BN. However, after applying two different Lipschitz regularization methods, SN and the gradient penalty (GP) gulrajani2017improved in Figures 3-top right and bottom left, we observed that the hybrid changed nicely and monotonically, and correlated properly with the sharpness of samples generated. Figure 3-bottom right shows that a similar desired behavior can also be achieved using the second-order hybrid divergence. In this case, we trained the DCGAN discriminator via the WRM adversarial learning scheme sinha2018certifiable .

## 8 Related Work

Theoretical studies of GAN have focused on three different aspects: approximation, generalization, and optimization. On the approximation properties of GAN, liu2017approximation studies GAN’s approximation power using a moment-matching approach. The authors view the maximized discriminator objective as an -based adversarial divergence, showing that the adversarial divergence between two distributions takes its minimum value if and only if the two distributions share the same moments over . Our convex duality framework interprets their result and further draws the connection to the original divergence measure. nock2017f studies the f-GAN problem through an information geometric approach based on the Bregman divergence and its connection to f-divergence.

Analyzing GAN’s generalization performance is another problem of interest in several recent works. arora2017generalization proves generalization guarantees for GANs in terms of -based distance measures. arora2017gans uses an elegant approach based on the Birthday Paradox to empirically study the generalizibility of GAN’s learned models. santurkar2017classification develops a quantitative approach for examining diversity and generalization in GAN’s learned distribution. zhang2018on studies approximation-generalization trade-offs in GAN by analyzing the discriminative power of -based distances. Regarding optimization properties of GAN, chen2018training ; zhao2018information propose duality-based methods for improving the optimization performance in training deep generative models. roth2017stabilizing suggests applying noise convolution with input data for boosting the training performance in f-GAN. Moreover, several other works including nagarajan2017gradient ; mescheder2017numerics ; daskalakis2017training ; feizi2017understanding ; sanjabi2018solving explore the optimization and stability properties of training GANs. Finally, we note that the same convex analysis approach used in this paper has further provided a powerful theoretical framework to analyze various supervised and unsupervised learning problems dudik2007maximum ; razaviyayn2015discrete ; farnia2016minimax ; fathony2016adversarial ; fathony2017adversarial .

Acknowledgments: We are grateful for support under a Stanford Graduate Fellowship, the National Science Foundation grant under CCF-1563098, and the Center for Science of Information (CSoI), an NSF Science and Technology Center under grant agreement CCF-0939370.

## 9 Appendix

#### 9.1.1 LSUN divergence estimates for different training schemes

Figure 4 shows the complete divergence estimates over LSUN dataset for the GAN training schemes described in the main text. While the hybrid divergence measures , decreased smoothly as the DCGAN was being trained, the JS-divergence always remained close to its maximum value which led to lower-quality produced samples.

#### 9.1.2 CelebA, LSUN, MNIST images generated by different trainings of DCGAN

Figures 5, 6, and 7 show the CelebA, LSUN, and MNIST samples generated by vanilla DCGAN trained via the different methods described in the main text. Observe that applying Lipschitz regularization and adversarial training to the discriminator consistently result in the highest quality generator output samples. We note that tight SN in these figures refers to [42]’s spectral normalization method for convolutional layers, which precisely normalizes a conv layer’s spectral norm and hence guarantees the

-Lipschitzness of the discriminator neural net. Note that for non-tight SN we use the original heuristic for normalizing convolutional layers’ operator norm introduced in

[12].

### 9.2 Proof of Theorem 1

Theorem 1 and Corollary 1 directly result from the following two lemmas.

###### Lemma 1.

Suppose divergence is non-negative, lower semicontinuous and convex in distribution . Consider a convex subset of continuous functions and assume support set is compact. Then, the following duality holds for any pair of distributions :

 maxD∈FEP2[D(X)]−d∗P1(D)=minQ{d(P1,Q)+maxD∈F{EP2[D(X)]−EQ[D(X)]}}. (23)
###### Proof.

Note that

 minQ{d(P1,Q)+maxD∈F{EP2[D(X)]−EQ[D(X)]}} =minQmaxD∈F{d(P1,Q)+EP2[D(X)]−EQ[D(X)]} (a)=maxD∈FminQ{d(P1,Q)+EP2[D(X)]−EQ[D(X)]} (24) =maxD∈F{EP2[D(X)]+minQ{d(P1,Q)−EQ[D(X)]}} =maxD∈F{EP2[D(X)]−maxQ{EQ[D(X)]−d(P1,Q)}} (b)=maxD∈FEP2[D(X)]−d∗P1(D).

Here (a) is a consequence of the generalized Sion’s minimax theorem [43], because the space of probability measures on compact is convex and weakly compact [44], is assumed to be convex, the minimiax objective is lower semicontinuous and convex in and linear in . (b) holds according to the conjugate ’s definition. ∎

###### Lemma 2.

Assume divergence is non-negative, lower semicontinuous and convex in distribution over compact . Consider a linear space subset of continuous functions . Then, the following duality holds for any pair of distributions :

 (25)
###### Proof.

This lemma is a consequence of Lemma 1. Note that a linear space is a convex set. Therefore, Lemma 1 applies to . However, since is a linear space i.e. for any and it includes we have

 maxD∈F{EP2[D(X)]−EQ[D(X)]}={0\rm ifQ∈PF(P2)+∞\rm otherwise. (26)

As a result, the minimizing precisely matches the moments over to ’s moments, which completes the proof. ∎

### 9.3 Proof of Theorem 2

We first prove the following lemma.

###### Lemma 3.

Consider f-divergence corresponding to function which has a non-decreasing convex-conjugate . Then, for any continuous

 df∗P(D)=EP[f∗(D(X)+λ0)]−λ0 (27)

where satisfies . Here stands for the derivative of conjugate function which is supposed to be non-negative everywhere.

###### Proof.

Note that

 df∗P(D) (a)=supQEQ[D(X)]−df(P,Q) (b)=supQEQ[D(X)]−EP[f(q(X)p(X))] (c)=maxq(x)≥0,∫q(x)dx=1∫q(x)D(x)dx−EP[f(q(X)p(X))] (d)=minλ∈R−λ+maxq(x)≥0∫q(x)(D(x)+λ)dx−EP[f(q(X)p(X))] (g)=minλ∈R−λ+EP[f∗(D(X)+λ)] =−maxλ∈Rλ−EP[f∗(D(X)+λ)] (28) (29)

Here (a) and (b) follow from the conjugate and f-divergence definitions. (c) rewrites the optimization problem in terms of the density function corresponding to distribution . (d) uses the strong convex duality to move the density constraint to the objective. Note that strong duality holds, since we have a convex optimization problem with affine constraints. (e) rewrites the problem after a change of variable . (f) holds since and are assumed to be continuous. (g) follows from the assumption that the derivative of takes non-negative values, and hence the minimizing also minimizes the unconstrained optimization for the convex conjugate

 f∗(D(X)+λ):=maxr(X)r(X)(D(X)+λ)−f(r(X)).

Taking the derivative of the concave objective, the value maximizing the objective solves the equation which is assumed to be . Therefore, (h) holds and the proof is complete. ∎

Now we prove Theorem 2 which can be broken into two parts as follows.

###### Theorem (Theorem 2).

Consider f-divergence where has a non-decreasing conjugate .
(a) Suppose is a convex set closed to a constant addition, i.e. for any we have . Then,

 minPG(Z)∈PGminQXdf(PG(Z),Q)+maxD∈F{EPX[D(X)]−EQ[D(X)]} = minG∈GmaxD∈FEPX[D(X)]−E[f∗(D(G(Z)))]. (30)

(b) Suppose is a linear space including the constant function . Then,

 minPG(Z)∈PGminQX∈PF(PX)df(PG(Z),Q)=minG∈GmaxD∈FEPX[D(X)]−E[f∗(D(G(Z)))]. (31)
###### Proof.

This theorem is an application of Theorem 1 and Corollary 1. For part (a) we have

 minPG(Z)∈PGminQXdf(PG(Z),Q)+maxD∈F{EPX[D(X)]−EQ[D(X)]} (c)= minG∈GmaxD∈FEPX[D(X)]−df∗PG(Z)(D) (d)= minG∈GmaxD∈FEPX[D(X)]+maxλ∈Rλ−E[f∗(D(G(Z))+λ)] = minG∈GmaxD∈F,λ∈REPX[D(X)+λ]−E[f∗(D(G(Z))+λ)] (e)= minG∈GmaxD∈FEPX[D(X)]−E[f∗(D(G(Z)))].

Here (c) is a direct result of Theorem 1. (d) uses the simplified version (28) for . (e) follows from the assumption that is closed to constant additions.

For part (b) note that since is a linear space and includes , it is closed to constant additions. Hence, an application of Corollary 1 reveals

 minPG(Z)∈PGminQX∈PF(PX)df(PG(Z),Q) =minG∈GmaxD∈FEPX[D(X)]−df∗PG(Z)(D) =minG∈GmaxD∈FEPX[D(X)]+maxλ∈Rλ−E[f∗(D(G(Z))+λ)] =minG∈GmaxD∈F,λ∈REPX[D(X)+λ]−E[f∗(D(G(Z))+λ)] =minG∈GmaxD∈FEPX[D(X)]−E[f∗(D(G(Z)))],

which makes the proof complete. ∎

### 9.4 Proof of Theorem 3

Theorem 3 is a direct application of the following lemma to Theorem 1 and Corollary 1.

###### Lemma 4.

Let be a lower semicontinuous non-negative cost function. Considering the c-transform operation defined in the text, the following holds for any continuous

 OTc∗P(D)=EP[Dc(X)]. (32)
###### Proof.

We have

 OTc∗P(D) (a)=supQEQ[D(X′)]−OTc(P,Q) (b)=−infQinfM∈Π(P,Q)EM[c(X,X′)−D(X′)] =−infQ,M∈Π(P,Q)EM[c(X,X′)−D(X′)] (c)≥−EP[infx′c(X,x′)−D(x′)] =EP[supx′D(x′)−c(X,x′)] (d)=EP[Dc(X)].

Here (a), (b), (d) hold according to the definitions. Moreover, we show (c) will hold with equality under the lemma’s assumptions. is lower semicontinuous, and hence for every there exists a measurable function such that for the coupling the absolute difference is -bounded. Therefore, holds with equality and the proof is complete. ∎

### 9.5 Proof of Theorem 4

Consider a convex combination of functions from as where

can be considered as a probability density function over feasible set

. Consider samples taken i.i.d. from . Since any is -bounded, according to Hoeffding’s inequality for a fixed we have

 Pr(∣∣∣1mm∑i=1fWi(x)−EW∼α[fW(x)]∣∣∣≥ϵ2)≤2exp(−mϵ28M2). (33)

Next we consider a -covering for the ball , where we choose . We know a -covering exists with a bounded size [15]. Then, an application of the union bound implies

 ≤2Nexp(−mϵ28M2) ≤exp(−mϵ28M2+klog(12LRϵ)+log2)

Hence if we have