# Approximation and Convergence Properties of Generative Adversarial Learning

Generative adversarial networks (GAN) approximate a target data distribution by jointly optimizing an objective function through a "two-player game" between a generator and a discriminator. Despite their empirical success, however, two very basic questions on how well they can approximate the target distribution remain unanswered. First, it is not known how restricting the discriminator family affects the approximation quality. Second, while a number of different objective functions have been proposed, we do not understand when convergence to the global minima of the objective function leads to convergence to the target distribution under various notions of distributional convergence. In this paper, we address these questions in a broad and unified setting by defining a notion of adversarial divergences that includes a number of recently proposed objective functions. We show that if the objective function is an adversarial divergence with some additional conditions, then using a restricted discriminator family has a moment-matching effect. Additionally, we show that for objective functions that are strict adversarial divergences, convergence in the objective function implies weak convergence, thus generalizing previous results.

## Authors

• 9 publications
• 18 publications
• 35 publications
• ### The Inductive Bias of Restricted f-GANs

Generative adversarial networks are a novel method for statistical infer...
09/12/2018 ∙ by Shuang Liu, et al. ∙ 0

• ### Some Theoretical Properties of GANs

Generative Adversarial Networks (GANs) are a class of generative algorit...
03/21/2018 ∙ by G. Biau, et al. ∙ 0

• ### Quality Aware Generative Adversarial Networks

Generative Adversarial Networks (GANs) have become a very popular tool f...
11/08/2019 ∙ by Parimala Kancharla, et al. ∙ 17

• ### Multi-Agent Diverse Generative Adversarial Networks

This paper describes an intuitive generalization to the Generative Adver...
04/10/2017 ∙ by Arnab Ghosh, et al. ∙ 0

• ### Annealed Generative Adversarial Networks

We introduce a novel framework for adversarial training where the target...
05/21/2017 ∙ by Arash Mehrjou, et al. ∙ 0

• ### GANs beyond divergence minimization

09/06/2018 ∙ by Alexia Jolicoeur-Martineau, et al. ∙ 0

• ### On Relativistic f-Divergences

This paper provides a more rigorous look at Relativistic Generative Adve...
01/08/2019 ∙ by Alexia Jolicoeur-Martineau, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Generative adversarial networks (GANs) have attracted an enormous amount of recent attention in machine learning. In a generative adversarial network, the goal is to produce an approximation to a target data distribution

from which only samples are available. This is done iteratively via two components – a generator and a discriminator, which are usually implemented by neural networks. The generator takes in random (usually Gaussian or uniform) noise as input and attempts to transform it to match the target distribution

; the discriminator aims to accurately discriminate between samples from the target distribution and those produced by the generator. Estimation proceeds by iteratively refining the generator and the discriminator to optimize an objective function until the target distribution is indistinguishable from the distribution induced by the generator. The practical success of GANs has led to a large volume of recent literature on variants which have many desirable properties; examples are the f-GAN

[10], the MMD-GAN [9, 5], the Wasserstein-GAN [2], among many others.

In spite of their enormous practical success, unlike more traditional methods such as maximum likelihood inference, GANs are theoretically rather poorly-understood. In particular, two very basic questions on how well they can approximate the target distribution , even in the presence of a very large number of samples and perfect optimization, remain largely unanswered. The first relates to the role of the discriminator in the quality of the approximation. In practice, the discriminator is usually restricted to belong to some family, and it is not understood in what sense this restriction affects the distribution output by the generator. The second question relates to convergence; different variants of GANs have been proposed that involve different objective functions (to be optimized by the generator and the discriminator). However, it is not understood under what conditions minimizing the objective function leads to a good approximation of the target distribution. More precisely, does a sequence of distributions output by the generator that converges to the global minimum under the objective function always converge to the target distribution under some standard notion of distributional convergence?

In this work, we consider these two questions in a broad setting. We first characterize a very general class of objective functions that we call adversarial divergences, and we show that they capture the objective functions used by a variety of existing procedures that include the original GAN [7], f-GAN [10], MMD-GAN [5, 9], WGAN [2], improved WGAN [8], as well as a class of entropic regularized optimal transport problems [6]. We then define the class of strict adversarial divergences – a subclass of adversarial divergences where the minimizer of the objective function is uniquely the target distribution. This characterization allows us to address the two questions above in a unified setting, and translate the results to an entire class of GANs with little effort.

First, we address the role of the discriminator in the approximation in Section 4. We show that if the objective function is an adversarial divergence that obeys certain conditions, then using a restricted class of discriminators has the effect of matching generalized moments

. A concrete consequence of this result is that in linear f-GANs, where the discriminator family is the set of all affine functions over a vector

of features maps, and the objective function is an f-GAN, the optimal distribution output by the GAN will satisfy regardless of the specific -divergence chosen in the objective function. Furthermore, we show that a neural network GAN is just a supremum of linear GANs, therefore has the same moment-matching effect.

We next address convergence in Section 5. We show that convergence in an adversarial divergence implies some standard notion of topological convergence. Particularly, we show that provided an objective function is a strict adversarial divergence, convergence to in the objective function implies weak convergence of the output distribution to . While convergence properties of some isolated objective functions were known before [2]

, this result extends them to a broad class of GANs. An additional consequence of this result is the observation that as the Wasserstein distance metrizes weak convergence of probability distributions (see e.g.

[14]), Wasserstein-GANs have the weakest111Weakness is actually a desirable property since it prevents the divergence from being too discriminative (saturate), thus providing more information about how to modify the model to approximate the true distribution. objective functions in the class of strict adversarial divergences.

## 2 Notations

We use bold constants (e.g., , , ) to denote constant functions. We denote by the function composition of and . We denote by the set of functions maps from the set to the set . We denote by the product measure of and . We denote by the interior of the set . We denote by the integral of with respect to measure .

Let be a convex function, we denote by the effective domain of , that is, ; and we denote by the convex conjugate of , that is, .

For a topological space , we denote by the set of continuous functions on , the set of bounded continuous functions on , the set of finite signed regular Borel measures on , and the set of probability measures on .

Given a non-empty subspace of a topological space , denote by the quotient space equipped with the quotient topology , where for any , if and only if or both belong to . The equivalence class of each element is denoted as .

## 3 General Framework

Let be the target data distribution from which we can draw samples. Our goal is to find a generative model to approximate . Informally, most GAN-style algorithms model this approximation as solving the following problem

 infνsupf∈FEx∼μ, y∼ν[f(x,y)],

where is a class of functions. The process is usually considered adversarial in the sense that it can be thought of as a two-player minimax game, where a generator is trying to mimick the true distribution , and a adversary is trying to distinguish between the true and generated distributions. However, another way to look at it is as the minimization of the following objective function

 ν⟼supf∈FEx∼μ, y∼ν[f(x,y)] (1)

This objective function measures how far the target distribution is from the current estimate . Hence, minimizing this function can lead to a good approximation of the target distribution .

Let be a topological space, , . An adversarial divergence over is a function

 P(X)×P(X) ⟶R∪{+∞} (μ,ν) ⟼τ(μ||ν)=supf∈FEμ⊗ν[f]. (2)

Observe that in Definition 1 if we have a fixed target distribution , then (2) is reduced to the objective function (1). Also, notice that because is the supremum of a family of linear functions (in each of the variables and separately), it is convex in each of its variables.

Definition 1 captures the objective functions used by a variety of existing GAN-style procedures. In practice, although the function class can be complicated, it is usually a transformation of a simple function class , which is the set of discriminators or critics, as they have been called in the GAN literature. We give some examples by specifying and for each objective function.

1. GAN [7].

 F ={x,y↦log(u(x))+log(1−u(y)):u∈V} V =(0,1)X∩Cb(X).
2. -GAN [10].  Let be a convex lower semi-continuous function. Assume for any , is continuously differentiable on , and there exists such that .

 F ={x,y↦v(x)−f∗(v(y)):v∈V}, V =(domf∗)X∩Cb(X).
3. MMD-GAN [9, 5].  Let be a universal reproducing kernel. Let be the set of signed measures on .

 F ={x,y↦v(x)−v(y):v∈V}, V ={x↦Eμ[k(x,⋅)]:μ∈M, Eμ2[k]≤1}.
4. Wasserstein-GAN (WGAN) [2].  Assume is a metric space.

 F ={x,y↦v(x)−v(y):v∈V}, V ={v∈Cb(X):∥v∥Lip≤K},

where is a positive constant, denotes the Lipschitz constant.

5. WGAN-GP (Improved WGAN) [8].  Assume is a convex subset of a Euclidean space.

 F V =C1(X),

where

is the uniform distribution on

, is a positive constant, .

6. (Regularized) Optimal Transport [6].  222To the best of our knowledge, neither (3) or (4) was used in any GAN algorithm. However, since our focus in this paper is not implementing new algorithms, we leave experiments with this formulation for future work. Let be some transportation cost function, be the strength of regularization. If (no regularization), then

 F ={x,y↦u(x)+v(y):(u,v)∈V}, (3) V ={(u,v)∈Cb(X)×Cb(X), u(x)+v(y)≤c(x,y) for any x,y∈X};

if , then

 F ={x,y↦u(x)+v(y)−ϵexp(u(x)+v(y)−c(x,y)ϵ):u,v∈V}, (4) V =Cb(X).

In order to study an adversarial divergence , it is critical to first understand at which points the divergence is minimized. More precisely, let be an adversarial divergence and be the target probability measure. We are interested in the set of probability measures that minimize the divergence when the first argument of is set to , i.e., the set . Formally, we define the set as follows.

###### Definition 2 (\textscoptτ,μ∗).

Let be an adversarial divergence over a topological space , . Define to be the set of probability measures that minimize the function . That is,

 \textscoptτ,μ∗△={μ∈P(X):τ(μ∗||μ)=infμ′∈P(X)τ(μ∗||μ′)}.

Ideally, the target probability measure should be one and the only one that minimizes the objective function. The notion of strict adversarial divergence captures this property.

###### Definition 3 (Strict adversarial divergence).

Let be an adversarial divergence over a topological space , is called a strict adversarial divergence if for any , .

For example, if the underlying space is a compact metric space, then examples (c) and (d) induce metrics on (see, e.g., [12]), therefore are strict adversarial divergences.

In the next two sections, we will answer two questions regarding the set : how well do the elements in approximate the target distribution when restricting the class of discriminators? (Section 4); and does a sequence of distributions that converges in an adversarial divergence also converges to under some standard notion of distributional convergence? (Section 5)

## 4 Generalized Moment Matching

To motivate the discussion in this section, recall example (b) in Section 3). It can be shown that under some mild conditions, , the objective function of -GAN, is actually the -divergence, and the minimizer of is only  [10]. However, in practice, the discriminator class is usually implemented by a feedforward neural network, and it is known that a fixed neural network has limited capacity (e.g., it cannot implement the set of all the bounded continuous function). Therefore, one could ask what will happen if we restrict to a sub-class ? Obviously one would expect not be the unique minimizer of anymore, that is, contains elements other than . What can we say about the elements in now? Are all of them close to in a certain sense? In this section we will answer these questions.

More formally, we consider to be a function class indexed by a set . We can think of as the parameter set of a feedforward neural network. Each is thought to be a matching between two distributions, in the sense that and are matched under if and only if . In particular, if each is corresponding to some function such that , then and are matched under if and only if some generalized moment of and are equal: . Each can be thought as a residual.

We will now relate the matching condition to the optimality of the divergence. In particular, define

 Mμ∗△={μ:∀θ∈Θ, Eμ∗[vθ]=Eμ[vθ]},

We will give sufficients conditions for members of to be in .

###### Theorem 4.

Let be a topological space, , , . Let . If there exists such that for any , and there exists some such that and , then is an adversarial divergence over and for any ,

 \textscoptτ,μ∗⊃Mμ∗.

We now review the examples (a)-(e) in Section 3, show how to write each into , and specify in each case such that the conditions of Theorem 4 can be satisfied.

1. GAN.  Note that for any , . Let ,

 fθ(x,y) =log(uθ(x))+log(1−uθ(y)) =log(uθ(x))−log(uθ(y))mθ(x,y)(note Eμ⊗ν[mθμν]=0)−log(1/(uθ(y)(1−uθ(y))))rθ(x,y)(note rθ(x,y)≥rθμν(x,y)=log(4)).
2. -GAN.  Recall that for any and . Let ,

 fθ(x,y) =vθ(x)−f∗(vθ(y)) =vθ(x)−vθ(y)mθ(x,y)(note Eμ⊗ν[mθμν]=0)−(f∗(vθ(y))−vθ(y))rθ(x,y)(note rθ(x,y)≥rθμν(x,y)=0). (5)
3. MMD-GAN or Wasserstein-GAN. Let ,

 fθ(x,y) =vθ(x)−vθ(y)mθ(x,y)(note Eμ⊗ν[mθμν]=0)−0rθ(x,y)(note rθ(x,y)=rθμν(x,y)=0).
4. WGAN-GP.  Note that the function is nonnegative on . Let

 vθμν=⎧⎪ ⎪⎨⎪ ⎪⎩(x1,x2,⋯,xn)↦∑ni=1xi√n,~{}~{}if Eμ[∑ni=1xi]≥Eν[∑ni=1xi],(x1,x2,⋯,xn)↦−∑ni=1xi√n,~{}~{}otherwise,
 fθ(x,y)

We now refine the previous result and show that under some additional conditions on and , the optimal elements of are fully characterized by the matching condition, i.e. .

###### Theorem 5.

Under the assumptions of Theorem 4, if and both and have gradients at , and

 (6)

Then for any ,

 \textscoptτ,μ∗=Mμ∗. (7)

We remark that Theorem 4 is relatively intuitive, while Theorem 5 requires extra conditions, and is quite counter-intuitive especially for algorithms like -GANs.

### 4.1 Example: Linear f-Gan

We first consider a simple algorithm called linear -GAN. Suppose we are provided with a feature map that maps each point in the sample space to a feature vector where each . We are satisfied that any distribution is a good approximation of the target distribution as long as . For example, if and , to say is equivalent to say the first moments of and are matched. Recall that in the standard -GAN (example (b) in Section 3), . Now instead of using the discriminator class , we use a restricted discriminator class , containing the linear (or more precisely, affine) transformations of

 V′={θT(ψ,1):θ∈Θ}⊆V,

where . We will show that now contains exactly those such that , regardless of the specific chosen. Formally,

###### Corollary 6 (linear f-Gan).

Let be a compact topological space. Let be a function as defined in example (b) of Section 3. Let be a vector of continuously differentiable functions on . Let . Let be the objective function of the linear -GAN

 τ(μ||ν)=supθ∈Θ(Eμ[θT(ψ,1)]−Eν[f∗\vbox∘(θT(ψ,1))]).

Then for any ,

A very concrete example of Corollary 6 could be, for example, the linear KL-GAN, where , , , . The objective function is

 τ(μ||ν)=supθ∈Rn+1(Eμ[θT(ψ,1)]−Eν[exp(θT(ψ,1)−1)]),

### 4.2 Example: Neural Network f-Gan

Next we consider a more general and practical example: an -GAN where the discriminator class is implemented through a feedforward neural network with weight parameter set

. We assume that all the activation functions are continuously differentiable (e.g., sigmoid, tanh), and the last layer of the network is a linear transformation plus a bias. We also assume

(e.g., the KL-GAN where ).

Now observe that when all the weights before the last layer are fixed, the last layer acts as a discriminator in a linear -GAN. More precisely, let be the index set for the weights before the last layer. Then each corresponds to a feature map . Let the linear -GAN that corresponds to be , the adversarial divergence induced by the Neural Network -GAN is

 τ(μ∗||μ)=supθpre∈Θpreτθpre(μ∗||μ)

Clearly . For the other direction, note that by Corollary 6, for any , and . Therefore and . If , then . As a consequence, for any . Therefore . Therefore, by Corollary 6,

 \textscoptτ,μ∗=⋂θpre∈Θpre\textscoptτθpre,μ∗={μ:∀θ∈Θ, Eμ∗[vθ]=Eμ[vθ]}.

That is, the minimizer of the Neural Network -GAN are exactly those distributions that are indistinguishable under the expectation of any discriminator network .

## 5 Convergence

To motivate the discussion in this section, consider the following question. Let be the delta distribution at , that is, with probability . Now, does the sequence of delta distributions converges to ? Almost all the people would answer no. However, does the sequence of delta distributions converges to ? Most people would answer yes based on the intuition that and so does the sequence of corresponding delta distributions, even though the support of never has any intersection with the support of . Therefore, convergence can be defined for distributions not only in a point-wise way, but in a way that takes consideration of the underlying structure of the sample space.

Now returning to our adversarial divergence framework. Given an adversarial divergence , is it possible that convreges to the global minimum of ? How to we define convergence to a set of points instead of only one point, in order to explain the convergence behaviour of any adversarial divergence? In this section we will answer these questions.

We start from two standard notions from functional analysis.

###### Definition 7 (Weak-* topology on P(X) (see e.g. [11])).

Let be a compact metric space. By associating with each a linear function on , we have that is the continuous dual of with respect to the uniform norm on (see e.g. [4]). Therefore we can equip (and therefore ) with a weak-* topology, which is the coarsest topology on such that is a set of continuous linear functions on .

###### Definition 8 (Weak convergence of probability measures (see e.g. [11])).

Let be a compact metric space. A sequence of probability measures in is said to weakly converge to a measure , if , or equivalently, if is weak-* convergent to .

The definition of weak-* topology and weak convergence respect the topological structure of the sample space. For example, it is easy to check that the sequence of delta distributions weakly converges to , but not to .

Now note that Definition 8 only defines weak convergence of a sequence of probability measures to a single target measure. Here we generalize the definition for the single target measure to a set of target measures through quotient topology as follows.

###### Definition 9 (Weak convergence of probability measures to a set).

Let be a compact metric space, equip with the weak-* topology and let be a non-empty subspace of . A sequence of probability measures in is said to weakly converge to the set if converges to in the quotient space .

With everything properly defined, we are now ready to state our convergence result. Note that an adversarial divergence is not necessarily a metric, and therefore does not necessarily induce a topology. However, convergence in an adversarial divergence can still imply some type of topological convergence. More precisely, we show a convergence result that holds for any adversarial divergence as long as the sample space is a compact metric space. Informally, we show that for any target probability measure, if converges to the global minimum of , then weakly converges to the set of measures that achieve the global minimum. Formally,

###### Theorem 10.

Let be a compact metric space, be an adversarial divergence over , , then . Let be a sequence of probability measures in . If , then weakly converges to the set .

As a special case of Theorem 10, if is a strict adversarial divergence, i.e., , then converging to the minimizer of the objective function implies the usual weak convergence to the target probability measure. For example, it can be checked that the objective function of -GAN is a strict adversarial divergence, therefore converging in the objective function of an -GAN implies the usual weak convergence to the target probability measure.

To compare this result with our intuition, we return to the example of a sequence of delta distributions and show that as long as is a strict adversarial divergence, does not converge to the global minimum of . Observe that if converges to the global minimum of , then according to Theorem 10, will weakly converge to , which leads to a contradiction.

However Theorem 10 does more than excluding undesired possibilities. It also enables us to give general statements about the structure of the class of adversarial divergences. The structural result can be easily stated under the notion of relative strength between adversarial divergences, which is defined as follows.

###### Definition 11 (Relative strength between adversarial divergences).

Let and be two adversarial divergences, if for any sequence of probability measures and any target probability measure , implies , then we say is stronger than and is weaker than . We say is equivalent to if is both stronger and weaker than . We say is strictly stronger (strictly weaker) than if is stronger (weaker) than but not equivalent. We say and are not comparable if is neither stronger nor weaker than .

Not much is known about the relative strength between different adversarial divergences. If the underlying sample space is nice (e.g., subset of Euclidean space), then the variational (GAN-style) formulation of -divergences using bounded continuous functions coincides with the original definition [15], and therefore -divergences are adversarial divergences. [2] showed that the KL-divergence is stronger than the JS-divergence, which is equivalent to the total variation distance, which is strictly stronger than the Wasserstein-1 distance.

However, the novel fact is that we can reach the weakest strict adversarial divergence. Indeed, one implicatoin of Theorem 10 is that if is a compact metric space and is a strict adversarial divergence over , then -convergence implies the usual weak convergence on probability measures. In particular, since the Wasserstein distance metrizes weak convergence of probability distributions (see e.g. [14]), as a direct consequence of Theorem 10, the Wasserstein distance is in the equivalence class of the weakest strict adversarial divergences. In the other direction, there exists a trivial strict adversarial divergence

 τTrivial(μ||ν)△={0,~{}~{}if μ=ν,+∞,~{}~{}otherwise, (8)

that is stronger than any other strict adversarial divergence. We now incorporate our convergence results with some previous results and get the following structural result.

###### Corollary 12.

The class of strict adversarial divergences over a bounded and closed subset of a Euclidean space has the structure as shown in Figure 1, where is defined as in (8), is corresponding to example (c) in Section 3, is corresponding to example (d) in Section 3, and , , , , are corresponding to example (b) in Section 3 with being , , , , , respectively. Each rectangle in Figure 1 represents an equivalence class, inside of which are some examples. In particular, is in the equivalence class of the strongest strict adversarial divergences, while and are in the equivalence class of the weakest strict adversarial divergences.

## 6 Related Work

There has been an explosion of work on GANs over the past couple of years; however, most of the work has been empirical in nature. A body of literature has looked at designing variants of GANs which use different objective functions. Examples include [10], which propose using the f-divergence between the target and the generated distribution , and [9, 5], which propose the MMD distance. Inspired by previous work, we identify a family of GAN-style objective functions in full generality and show general properties of the objective functions in this family.

There has also been some work on comparing different GAN-style objective functions in terms of their convergence properties, either in a GAN-related setting [2], or in a general IPM setting [12]. Unlike these results, which look at the relationship between several specific strict adversarial divergences, our results apply to an entire class of GAN-style objective functions and establish their convergence properties. For example, [2] shows that KL-divergnce, JS-divergence, total-variation distance are all stronger than the Wasserstein distance, while our results generalize this part of their result and says that any strict adversarial divergence is stronger than the Wasserstein distance and its equivalences. Furthermore, our results also apply to non-strict adversarial divergences.

That being said, it does not mean our results are a complete generalization of the previous convergence results such as [2, 12]. Our results do not provide any methods to compare two strict adversarial divergences if none of them is equivalent to the Wasserstein distance or the trivial divergence. In contrast, [2] show that the KL-divergence is stronger than the JS-divergence, which is equivalent to the total variation distance, which is strictly stronger than the Wasserstein-1 distance.

Finally, there has been some additional theoretical literature on understanding GANs, which consider orthogonal aspects of the problem. [3] address the question of whether we can achieve generalization bounds when training GANs. [13] focus on optimizing the estimating power of kernel distances. [5] study generalization bounds for MMD-GAN in terms of fat-shattering dimension.

## 7 Acknowledgments

We thank Iliya Tolstikhin, Sylvain Gelly, and Robert Williamson for helpful discussions. The work of KC and SL were partially supported by NSF under IIS 1617157.

## References

• Aliprantis and Burkinshaw [1998] C. D. Aliprantis and O. Burkinshaw. Principles of real analysis. Academic Press, 1998.
• Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. CoRR, abs/1701.07875, 2017.
• Arora et al. [2017] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (gans). CoRR, abs/1703.00573, 2017.
• Dales et al. [2016] H. G. Dales, J. F.K. Dashiell, A.-M. Lau, and D. Strauss. Banach Spaces of Continuous Functions as Dual Spaces. CMS Books in Mathematics. Springer International Publishing, 2016.
• [5] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI 2015.
• [6] A. Genevay, M. Cuturi, G. Peyré, and F. R. Bach. Stochastic optimization for large-scale optimal transport. In NIPS 2016.
• [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS 2014.
• Gulrajani et al. [2017] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. CoRR, abs/1704.00028, 2017.
• [9] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML 2015.
• [10] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In NIPS 2016.
• Rudin [1991] W. Rudin. Functional Analysis. International Series in Pure and Applied Mathematics. McGraw-Hill, Inc, 1991.
• Sriperumbudur et al. [2010] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517–1561, 2010.
• [13] D. J. Sutherland, H. F. Tung, H. Strathmann, S. De, A. Ramdas, A. J. Smola, and A. Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In ICLR 2017.
• Villani [2009] C. Villani. Optimal transport, old and new. Grundlehren der mathematischen Wissenschaften. Springer-Verlag Berlin Heidelberg, 2009.
• Wu [2017] Y. Wu.

Lecture notes: Information-theoretic methods for high-dimensional statistics.

2017.

## Appendix A Proof of Theorem 4

We observe that the assumptions of the theorem imply that for any ,

 τ(μ||μ)=supθ∈ΘEμ⊗μ[−rθ]=Eμ⊗μ[−rθμμ]=−c. (9)

The assumptions also imply that for any ,

 τ(μ||ν) ≥Eμ⊗ν[mθμν−rθμν] ≥Eμ⊗ν[−rθμν] =−c

Fix and assume , i.e. for any . Then

 τ(μ∗||μ) =supθ∈ΘEμ∗⊗μ[mθ−rθ] =supθ∈ΘEμ∗⊗μ[−rθ] =Eμ∗⊗μ[−rθμ∗μ] (10) =−c. (11)

Therefore .

## Appendix B Proof of Theorem 5

Since by Theorem 4 we already have , we only need to prove for any ,

 P(X)∖Mμ∗⊆P(X)∖\textscoptτ,μ∗.

Fix . Assume there exists such that . If , then we have . Now note that

 τ(μ∗||μ) =supθ∈ΘEμ∗⊗μ[mθ−rθ] ≥Eμ∗⊗μ[mθμ∗μ−rθμ∗μ] =Eμ∗⊗μ[mθμ∗μ]−Eμ∗⊗μ[rθμ∗μ] >−Eμ∗⊗μ[rθμ∗μ] =−c =τ(μ∗||μ∗),

where the last equality is due to (9). Thus . For the rest of the proof we assume . Then by (6) we have . Also because is an interior point of and is a minimizer of , by Fermat’s stationary points theorem, we have . Therefore . Thus again by Fermat’s stationary points theorem there exists a such that

 Eμ∗⊗μ[mθ′−rθ′] >Eμ∗⊗μ[mθμ∗μ−rθμ∗μ] ≥Eμ∗⊗μ[−rθμ∗μ] =−c =τ(μ∗||μ∗),

where the last equality is due to (9). Finally note that

 τ(μ∗||μ)≥Eμ∗⊗μ[mθ′−rθ′]>τ(μ∗||μ∗).

Therefore . This concludes the proof.

## Appendix C Proof of Corollary 6

Recall that by assumption for any and for some . Since is continuously differentiable on necessarily we have .

For each , let , , , then . and are bounded continuous functions since both and are continuous functions, for any , and is a compact set. Let , that is, a vector whose last coordinate is and elsewhere. We have that for any

 τ(μ∗||μ)≥Eμ∗⊗μ[mθμ∗μ−rθμ∗μ]=Eμ∗⊗μ[f(x0)−x