Generative adversarial networks (GANs) [goodfellow2014generative]
are a novel method for statistical inference that have received a great deal of recent attention. Given input samples from a data distribution, inference is carried out in the form of a two-player game between a generator and a discriminator, which are usually neural networks with pre-specified architectures. The generator attempts to generate samples that progressively mimic the input data; the discriminator attempts to accurately discriminate between the input and samples produced by the generator. The game continues until the discriminator fails to detect if an instance comes from the input or is produced by the generator, at which point the generator is said to have learned the data distribution.
While generative adversarial networks have achieved much empirical success, the factors contributing to their success remain a mystery. For example, even if we ignore finite sample and optimization issues, it is still unknown what the GAN solution looks like, and what its relationship is to classical statistical solutions such as maximum likelihood and method of moments. Properties of the solution are partially understood when the generator is unrestricted [Nowozin2016f, goodfellow2014generative, liu2017approximation] and can produce samples from any distribution. In practice, we always have model mismatch – the class of distributions that the generator produce samples from is restricted, and the input data distribution usually does not lie in this class. In this case, the relationship between the generator class, the discriminator class and the output distribution remains ill-understood.
In this paper, we consider this problem in the context of restricted -GANs – which are -GANs [Nowozin2016f] where the discriminator belongs to a class of functions . We provide a theoretical characterization of the solutions provided in these cases under model mismatch. Our analysis relies on the Fenchel-Moreau theorem and Ky Fan’s minimax theorem, with subroutines heavily inspired by [rockafellar1968integrals, rockafellar2015measures, rockafellar2018risk].
An important consequence of our result can be seen when we specialize it to linear KL-GANs – -GANs whose objective function correspond to the variational form of the KL-divergence, and whose discriminator class is the set of all functions linear over a pre-specified feature set. In this case, we show that the distribution induced by the optimal generator is neither the maximum likelihood nor the method of moments solution, but an interesting combination of both.
The basic problem of statistical inference is as follows. We are given samples from an unknown underlying distribution . Let denote the empirical distribution of the input samples, our goal is to find a distribution in a distribution class to approximate .
The problem is typically solved by using an objective function that measures how well fits the data, and then finding a as follows:
Here, large means that fits the data poorly, and different choices of lead to different inference solutions.
2.1 Background: Maximum Likelihood and Method of Moments
Most classical statistical literature has looked at two major categories of inference methods – maximum likelihood estimation and the method of moments.
Maximum Likelihood Estimation.
In maximum likelihood estimation (MLE), the goal is to select the distribution in that maximizes the likelihood of generating the data . For ease of discussion, let us assume that there is a base measure on the instance space, and and are density functions of and respectively at with respect to this base measure. The goal of maximum likelihood estimation is to find:
Since is fixed, this is equivalent to finding the minimizer of:
Thus the objective function in (1) for MLE is the KL-divergence.
Method of Moments.
An alternative method for statistical inference, which dates back to Chebyshev, and has recently seen renewed interest, is the method of moments. In the generalized method of moments (GMM) [hansen1982large], in addition to the data and the distribution class , we are given a set of relevant feature functions over the instance space. The goal is to find the minimizer:
Thus, for GMM, the objective function in (1) is .
Our goal is to understand how the solutions provided by GANs relate to these two standard ways of doing inference.
2.2 -Divergences and -GANs
For the rest of the paper, we assume that we have an underlying probability space; all distributions we consider below are measures over this space.
Definition 1 (-divergence, [ali1966general, csiszar1967information]).
Suppose is a lower semi-continuous convex function such that , is finite in some neighbourhood of , and for any . Let and be probability measures over where is absolutely continuous with respect to . Then, the -Divergence of from is defined as:
Let be the convex conjugate function of , given by: ; it is well-known that the -divergences also have a variational formulation [keziou2003dual, nguyen2010estimating] under certain conditions:
where the supremum is taken over, informally speaking, all possible functions. More details will be discussed in Section 3.4.
Inspired by this variational formulation, [Nowozin2016f] introduces a family of GANs, called f-GANs, that use an -divergnece as the objective function in (1). Inference is then formulated as solving the following minimax problem:
where is a sufficiently large function class.
The standard GAN [goodfellow2014generative] is a special case of (4), where , which corresponds to the Jensen-Shannon Divergence.
2.3 Restricted -divergences and Restricted -GANs
To reduce the sample requirement [arora2017generalization], one might want to restrict the discriminator class in (4) to be a relatively small function class. To this end, we define the restricted -divergence111Note that [ruderman2012tighter] also uses the term “restricted -divergence”, but for a very different purpose.
In practice, the discriminator class is often implemented by a neural network [Nowozin2016f], therefore f-GANs are in fact restricted f-GANs that solve the following minimax problem
3 Main Result
We begin with stating our main result in its most general form.
3.1 Additional Notations
We start out by introducing some notation. Recall that we have an underlying probability space . Let be the set of all real-valued bounded and measurable functions on equipped with the topology induced by the uniform norm.
We use to denote the set of all bounded and finitely additive signed measures over , to denote the set of all finitely additive probability measures over , and to denote the set of all countably additive probability measures over . Note that .
For any and , we write to denote that is absolutely continuous w.r.t. ; that is, for any , . Furthermore, if both and are countably additive, we use to denote the Radon-Nikodym derivative.
3.2 General Result
We begin with a slight generalization of the definition of restricted -divergences in (5). Let the functional be a regularizer, we can define
then we have .
An important property of the functional is shift-invariance.
Definition 2 (shift invariant).
is said to be shift invariant if for any and , .
We are also interested in the convex conjugate of , denoted by . According to Theorem 9 in the appendix, the functional , although defined on by definition, can be equivalently defined on such that
We are now ready for our main result.
If is convex and shift invariant, , and , then
We remark here that when has finite support and takes value outside a RKHS space, Theorem 3 basically reduces to Theorem 2 in [ruderman2012tighter]; and in this special case the proof can be greatly simplified.
Returning to the special case of , recall that in this case
and note that
we have the following corollary.
If is a convex subset of and for any , , we have , then for any and ,
3.3 Implication for Linear -GANs
Finally, because of its importance, it is worth emphasizing the special case of linear -GANs. Recall that linear f-GANs minimize the objective where and is a convex subset of . In this case, take in Corollary 4 to be , then
In particular, if , then in (10) using Cauchy-Schwarz inequality we have
We would like to point out that more generally, for any such that , if , using Hölder’s inequality, we have
But to keep the discussion concise, we state the results with .
As a consequence, we have the following corollary.
If where is a positive real number, then for any and ,
Observe that when and , is the extended -divergence, which we denote as . Specifically,
Therefore, we have that when and ,
Contrasting with maximum likelihood and method of moments estimators, the linear KL-GANs are an interesting combination of both when there is model mismatch. Table 1 provides a summary of the differences.
It is also possible to consider the case where , which means . In this case
This will result in the following corollary.
If , then for any and ,
3.4 Variational Representation of f-divergences
In this section, we will explain why equality (a) in (8) holds. The following theorem, which is complementary to Theorem 2.1 in [keziou2003dual] and Lemma 1 in [nguyen2010estimating]222We would like to note two things here. First, the “only if” part of Lemma 1 in [nguyen2010estimating] is unproved, and does not hold, therefore our result does not contradict theirs. Second, while both [keziou2003dual] and [nguyen2010estimating] mention that the supremum can be attained at , this sub-differential may not exist (especially when can take value ), and even if is well-defined everywhere needed, it is possible that the sub-differential is not bounded, hence not in ; therefore, their results do not imply ours., gives a rigorous variational representation of the -divergence.
For any probability measures and over such that is absolutely continuous with respect to ,
4 Related Work
As a novel method for statistical inference, generative adversarial networks [goodfellow2014generative] have sparked a great deal of follow-up work on both theoretical and empirical sides.
The work most relevant to us are [goodfellow2014generative, Nowozin2016f] and [liu2017approximation]. [goodfellow2014generative] shows that when both generators and discriminators are unrestricted, the optimal GAN solution converges to the input data distribution. [Nowozin2016f] introduces -GANs – given samples from an unknown data distribution , the objective is to find a distribution that minimizes , where is an -divergence. They show that minimizing this objective is equivalent to a GAN where the discriminators are unrestricted, and the objective corresponds to the variational form of the relevant -divergence.
[liu2017approximation] considers approximation properties of GANs when the discriminators are restricted, but the input distribution lies in the interior of the class of distributions that can be produced by the generators – in short, there is no model mismatch. They show that in this case, the solution produced by linear -GANs – that is, -GANs whose discriminators are linear over a pre-specified feature space – have the property that: . In other words, the optimal solution agrees with the generalized method of moments solution. Our work can be thought of as an extension of this work to the model mismatch case. [nock2017f] provides an information-geometric characterization of -GANs when the input and the generator belongs to a class of distributions called the deformed exponential family.
On the theoretical side, [arora2017generalization, singh2018nonparametric, liang2017well, bai2018approximability, feizi2017understanding] consider finite sample issues in GANs under different objective functions in various parametric and non-parametric settings, and provide bounds on their sample requirement. [biau2018some] provides asymptotic convergence bounds on GAN solutions when both generators and discriminators are unrestricted. [bottou2018geometrical] provide an analysis of the geometry of different GAN objective functions, with a view towards explaining their relative performance.
Finally, there has also been much recent work on the theoretical analysis of the optimization challenges that arise in the inference process of GANs; some examples include [heusel2017gans, nagarajan2017gradient, li2017towards, mescheder2017numerics, barnett2018convergence].
In conclusion, we provide a theoretical characterization of the distribution induced by the optimal generator in generative adversarial learning. Unlike prior work [goodfellow2014generative, liu2017approximation], our result applies when both the generator and the discriminator are restricted. When applied to linear -GANs, our characterization shows that the optimal linear KL-GAN solution offers an interesting mix of maximum likelihood and the method of moments.
Our work assumes that a sufficient number of samples is always available and that the optimal solution is always attainable. We believe removing these assumptions is an important avenue for future work.
We thank NSF under IIS 1617157 and ONR under N00014-16-1-261 for research support.
Appendix A Preliminaries for the Proofs
For any , denote by the indicator function that takes value over and everywhere else. We will sometimes use constants to represent constant functions. For any two real-valued functions and defined over the same domain , we write if for any , . For any topological vector space , we denote by the topological dual of , which is the set of all continuous linear functions over .
Theorem 9 (dual of [hildebrandt1934bounded]).
can be identified with by defining for any and any
Definition 10 (general convex conjugacy [rockafellar1968integrals]).
Let be a pair of real vector spaces, be a real bilinear function of and , and be a proper convex function, then we can define on the conjugate of , denoted by , as
and define on the conjugate of , denoted by , as
if only is specified, then it is assumed that is the domain of and is , and the bilinear function is given by for and .
Theorem 11 (Fenchel-Moreau, [zalinescu2002convex] Theorem 2.3.3).
If is a Hausdorff locally convex space, and is a proper lower semi-continuous convex function on , then .
with the usual topology is a Hausdorff locally convex space.
The usual topology on can be induced by the usual norm on and a normed space is a Hausdorff locally convex space. ∎
is a Hausdorff locally convex space.
The topology on is induced from the uniform norm and a normed space is a Hausdorff locally convex space. ∎
is a proper lower semi-continuous convex function.
Recall that by assumption is a lower semi-continuous convex function. To see is also proper, note that by assumption , therefore is a lower semi-continuous convex function that takes finite value at some point, hence also a proper function. ∎
is a proper lower semi-continuous convex function.
Note that , and the weak* topology on is the same as the usual topology. Therefore is a lower semi-continuous convex function ([zalinescu2002convex] Theorem 2.3.1). is proper because is proper. ∎
This is because by assumption for any , and then by Fact 16, for any . This means is non-decreasing. ∎
This is because by assumption , and then by Fact 16, . Therefore by the definition of , we have ∎
We will need the following definition and result from [rockafellar1968integrals], which we note to be simplied because in our case is a probability measure (instead of a -finite measure in their case) and we only consider real-valued functions (instead of vector-valued function in their case).
Definition 19 (decomposable , [rockafellar1968integrals] simplified).
We say a set of real-valued measurable functions over is decomposable if
for any and , .
Theorem 20 ([rockafellar1968integrals], corollary of Theorem 2, simplified).
Let . Suppose and are decomposable and for any and the function is integrable w.r.t. , is a lower semi-continuous proper convex function, then for any
Appendix B Proof of Theorem 8
Note that by the Radon-Nikodym theoremm ([folland1999real] Theorem 3.8), for each bounded and countably additive signed measure on that is absolutely continuous w.r.t. , there is an element in , denoted by (the Radon-Nikodym derivative), such that
Here (a) is from (12). To see why (b) holds, note that and are decomposable spaces (as defined in Definition 19) such that for any and , the function is integrable w.r.t. , is lower semi-continuous proper convex function by Fact 15, and by Fact 16; therefore we can apply Theorem 20 and get the equality.
Appendix C Proof of Theorem 3
Define the functional to be
We first show some properties related to .
If and , then for any , .
where (a) and (b) are because . ∎
The function defined on is lower semi-continuous.
Note that is a decomposable space (as defined in Definition 19), and for any , the function is integrable w.r.t. , and is lower semi-continuous proper convex function by Fact 14, then according to Theorem 20,
Note that for each , the function defined on is a continuous linear function, therefore the r.h.s. of (14) is the supremum of linear continuous functions, hence a lower semi-continuous function. ∎
For any , , sequence in , sequence in , sequence in , if , , and for every , then has a convergent subsequence whose limit point satisfies
We first prove that the sequence is bounded. We prove this by contradiction. Suppose is not bounded, since and , we have that for any , there exists such that
However, by assumpition is finite in a neighbourhood of , along with Fact 16, this implies that as . Therefore we have
Because and , we have that is bounded for , therefore
which contradicts the assumption that for every because .
Now by Bolzano-Weierstrass theorem, the bounded sequence has a convergent subsequence , whose limit point we denote by . Let be any positive real number, we will show that
Because by assumption and and for every , we have that for large enough
Lemma 22 says that the function is lower semi-continuous, since , this implies that
Because we can choose to be arbitrarily small, we can conclude that
The infimum in the definition of as in (13) can be attained.
Let . We need to show that there exists such that and
By the definition of infimum, there exists a sequence in and a sequence in such that
Applying Lemma 23, where , , and , we have that there exists a subsequence of whose limit point satisfies
Therefore the infimum in the definition of is attained at .
The functional has the following properties
is lower semi-continuous,
For any , if , then .
If is a constant function, then .
We will prove ()-() separately:
Proof of ().
Proof of ().
For any , , , we need to show that
If either (a) or (b) is infinite, then (17) is trivially true; therefore we assume both of them to be finite. In this case, for any , there exists and such that
We can see that