The Inductive Bias of Restricted f-GANs

09/12/2018 ∙ by Shuang Liu, et al. ∙ University of California, San Diego 0

Generative adversarial networks are a novel method for statistical inference that have achieved much empirical success; however, the factors contributing to this success remain ill-understood. In this work, we attempt to analyze generative adversarial learning -- that is, statistical inference as the result of a game between a generator and a discriminator -- with the view of understanding how it differs from classical statistical inference solutions such as maximum likelihood inference and the method of moments. Specifically, we provide a theoretical characterization of the distribution inferred by a simple form of generative adversarial learning called restricted f-GANs -- where the discriminator is a function in a given function class, the distribution induced by the generator is restricted to lie in a pre-specified distribution class and the objective is similar to a variational form of the f-divergence. A consequence of our result is that for linear KL-GANs -- that is, when the discriminator is a linear function over some feature space and f corresponds to the KL-divergence -- the distribution induced by the optimal generator is neither the maximum likelihood nor the method of moments solution, but an interesting combination of both.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative adversarial networks (GANs)  [goodfellow2014generative]

are a novel method for statistical inference that have received a great deal of recent attention. Given input samples from a data distribution, inference is carried out in the form of a two-player game between a generator and a discriminator, which are usually neural networks with pre-specified architectures. The generator attempts to generate samples that progressively mimic the input data; the discriminator attempts to accurately discriminate between the input and samples produced by the generator. The game continues until the discriminator fails to detect if an instance comes from the input or is produced by the generator, at which point the generator is said to have learned the data distribution.

While generative adversarial networks have achieved much empirical success, the factors contributing to their success remain a mystery. For example, even if we ignore finite sample and optimization issues, it is still unknown what the GAN solution looks like, and what its relationship is to classical statistical solutions such as maximum likelihood and method of moments. Properties of the solution are partially understood when the generator is unrestricted [Nowozin2016f, goodfellow2014generative, liu2017approximation] and can produce samples from any distribution. In practice, we always have model mismatch – the class of distributions that the generator produce samples from is restricted, and the input data distribution usually does not lie in this class. In this case, the relationship between the generator class, the discriminator class and the output distribution remains ill-understood.

In this paper, we consider this problem in the context of restricted -GANs – which are -GANs [Nowozin2016f] where the discriminator belongs to a class of functions . We provide a theoretical characterization of the solutions provided in these cases under model mismatch. Our analysis relies on the Fenchel-Moreau theorem and Ky Fan’s minimax theorem, with subroutines heavily inspired by [rockafellar1968integrals, rockafellar2015measures, rockafellar2018risk].

An important consequence of our result can be seen when we specialize it to linear KL-GANs – -GANs whose objective function correspond to the variational form of the KL-divergence, and whose discriminator class is the set of all functions linear over a pre-specified feature set. In this case, we show that the distribution induced by the optimal generator is neither the maximum likelihood nor the method of moments solution, but an interesting combination of both.

2 Preliminaries

The basic problem of statistical inference is as follows. We are given samples from an unknown underlying distribution . Let denote the empirical distribution of the input samples, our goal is to find a distribution in a distribution class to approximate .

The problem is typically solved by using an objective function that measures how well fits the data, and then finding a as follows:

(1)

Here, large means that fits the data poorly, and different choices of lead to different inference solutions.

2.1 Background: Maximum Likelihood and Method of Moments

Most classical statistical literature has looked at two major categories of inference methods – maximum likelihood estimation and the method of moments.

Maximum Likelihood Estimation.

In maximum likelihood estimation (MLE), the goal is to select the distribution in that maximizes the likelihood of generating the data . For ease of discussion, let us assume that there is a base measure on the instance space, and and are density functions of and respectively at with respect to this base measure. The goal of maximum likelihood estimation is to find:

Since is fixed, this is equivalent to finding the minimizer of:

Thus the objective function in (1) for MLE is the KL-divergence.

Method of Moments.

An alternative method for statistical inference, which dates back to Chebyshev, and has recently seen renewed interest, is the method of moments. In the generalized method of moments (GMM) [hansen1982large], in addition to the data and the distribution class , we are given a set of relevant feature functions over the instance space. The goal is to find the minimizer:

Thus, for GMM, the objective function in (1) is .

Our goal is to understand how the solutions provided by GANs relate to these two standard ways of doing inference.

2.2 -Divergences and -GANs

For the rest of the paper, we assume that we have an underlying probability space

; all distributions we consider below are measures over this space.

Definition 1 (-divergence, [ali1966general, csiszar1967information]).

Suppose is a lower semi-continuous convex function such that , is finite in some neighbourhood of , and for any . Let and be probability measures over where is absolutely continuous with respect to . Then, the -Divergence of from is defined as:

(2)

Let be the convex conjugate function of , given by: ; it is well-known that the -divergences also have a variational formulation [keziou2003dual, nguyen2010estimating] under certain conditions:

(3)

where the supremum is taken over, informally speaking, all possible functions. More details will be discussed in Section 3.4.

Inspired by this variational formulation, [Nowozin2016f] introduces a family of GANs, called f-GANs, that use an -divergnece as the objective function in (1). Inference is then formulated as solving the following minimax problem:

(4)

where is a sufficiently large function class.

The standard GAN [goodfellow2014generative] is a special case of (4), where , which corresponds to the Jensen-Shannon Divergence.

2.3 Restricted -divergences and Restricted -GANs

To reduce the sample requirement [arora2017generalization], one might want to restrict the discriminator class in (4) to be a relatively small function class. To this end, we define the restricted -divergence111Note that [ruderman2012tighter] also uses the term “restricted -divergence”, but for a very different purpose.

(5)

In practice, the discriminator class is often implemented by a neural network [Nowozin2016f], therefore f-GANs are in fact restricted f-GANs that solve the following minimax problem

(6)

A special case of (5) is linear f-divergence, introduced in [liu2017approximation]

. Specifically, given a vector of feature functions

over the data domain, let be a convex set of , define

-GANs that solve are called linear f-GANs.

3 Main Result

We begin with stating our main result in its most general form.

3.1 Additional Notations

We start out by introducing some notation. Recall that we have an underlying probability space . Let be the set of all real-valued bounded and measurable functions on equipped with the topology induced by the uniform norm.

We use to denote the set of all bounded and finitely additive signed measures over , to denote the set of all finitely additive probability measures over , and to denote the set of all countably additive probability measures over . Note that .

For any and , we write to denote that is absolutely continuous w.r.t. ; that is, for any , . Furthermore, if both and are countably additive, we use to denote the Radon-Nikodym derivative.

We extend definition (2) such that now can be a finitely additive probability measure that is not necessarily absolutely continuous w.r.t. . Formally, for any and , define

(7)
(8)

where the equality (a) is justified by Theorem 8 in Section 3.4, which is a rigorous version of (3).

3.2 General Result

We begin with a slight generalization of the definition of restricted -divergences in (5). Let the functional be a regularizer, we can define

(9)

To see why (9) is a more general definition than (5), let and define to be

then we have .

An important property of the functional is shift-invariance.

Definition 2 (shift invariant).

is said to be shift invariant if for any and , .

We are also interested in the convex conjugate of , denoted by . According to Theorem 9 in the appendix, the functional , although defined on by definition, can be equivalently defined on such that

We are now ready for our main result.

Theorem 3.

If is convex and shift invariant, , and , then

We remark here that when has finite support and takes value outside a RKHS space, Theorem 3 basically reduces to Theorem 2 in [ruderman2012tighter]; and in this special case the proof can be greatly simplified.

Returning to the special case of , recall that in this case

and note that

we have the following corollary.

Corollary 4.

If is a convex subset of and for any , , we have , then for any and ,

We would like to point out that if we take to be in Corollary 4, then

and we recover Theorem 8.

3.3 Implication for Linear -GANs

Finally, because of its importance, it is worth emphasizing the special case of linear -GANs. Recall that linear f-GANs minimize the objective where and is a convex subset of . In this case, take in Corollary 4 to be , then

(10)

In particular, if , then in (10) using Cauchy-Schwarz inequality we have

Remark 5.

We would like to point out that more generally, for any such that , if , using Hölder’s inequality, we have

But to keep the discussion concise, we state the results with .

As a consequence, we have the following corollary.

Corollary 6.

If where is a positive real number, then for any and ,

Observe that when and , is the extended -divergence, which we denote as . Specifically,

Therefore, we have that when and ,

Contrasting with maximum likelihood and method of moments estimators, the linear KL-GANs are an interesting combination of both when there is model mismatch. Table 1 provides a summary of the differences.

MLE GMM Linear KL-GAN
Table 1: Linear KL-GAN combines MLE and GMM

It is also possible to consider the case where , which means . In this case

This will result in the following corollary.

Corollary 7.

If , then for any and ,

3.4 Variational Representation of f-divergences

In this section, we will explain why equality (a) in (8) holds. The following theorem, which is complementary to Theorem 2.1 in [keziou2003dual] and Lemma 1 in [nguyen2010estimating]222We would like to note two things here. First, the “only if” part of Lemma 1 in [nguyen2010estimating] is unproved, and does not hold, therefore our result does not contradict theirs. Second, while both [keziou2003dual] and [nguyen2010estimating] mention that the supremum can be attained at , this sub-differential may not exist (especially when can take value ), and even if is well-defined everywhere needed, it is possible that the sub-differential is not bounded, hence not in ; therefore, their results do not imply ours., gives a rigorous variational representation of the -divergence.

Theorem 8.

For any probability measures and over such that is absolutely continuous with respect to ,

(11)

4 Related Work

As a novel method for statistical inference, generative adversarial networks [goodfellow2014generative] have sparked a great deal of follow-up work on both theoretical and empirical sides.

The work most relevant to us are [goodfellow2014generative, Nowozin2016f] and [liu2017approximation]. [goodfellow2014generative] shows that when both generators and discriminators are unrestricted, the optimal GAN solution converges to the input data distribution. [Nowozin2016f] introduces -GANs – given samples from an unknown data distribution , the objective is to find a distribution that minimizes , where is an -divergence. They show that minimizing this objective is equivalent to a GAN where the discriminators are unrestricted, and the objective corresponds to the variational form of the relevant -divergence.

[liu2017approximation] considers approximation properties of GANs when the discriminators are restricted, but the input distribution lies in the interior of the class of distributions that can be produced by the generators – in short, there is no model mismatch. They show that in this case, the solution produced by linear -GANs – that is, -GANs whose discriminators are linear over a pre-specified feature space – have the property that: . In other words, the optimal solution agrees with the generalized method of moments solution. Our work can be thought of as an extension of this work to the model mismatch case. [nock2017f] provides an information-geometric characterization of -GANs when the input and the generator belongs to a class of distributions called the deformed exponential family.

On the theoretical side, [arora2017generalization, singh2018nonparametric, liang2017well, bai2018approximability, feizi2017understanding] consider finite sample issues in GANs under different objective functions in various parametric and non-parametric settings, and provide bounds on their sample requirement. [biau2018some] provides asymptotic convergence bounds on GAN solutions when both generators and discriminators are unrestricted. [bottou2018geometrical] provide an analysis of the geometry of different GAN objective functions, with a view towards explaining their relative performance.

Finally, there has also been much recent work on the theoretical analysis of the optimization challenges that arise in the inference process of GANs; some examples include [heusel2017gans, nagarajan2017gradient, li2017towards, mescheder2017numerics, barnett2018convergence].

5 Conclusion

In conclusion, we provide a theoretical characterization of the distribution induced by the optimal generator in generative adversarial learning. Unlike prior work [goodfellow2014generative, liu2017approximation], our result applies when both the generator and the discriminator are restricted. When applied to linear -GANs, our characterization shows that the optimal linear KL-GAN solution offers an interesting mix of maximum likelihood and the method of moments.

Our work assumes that a sufficient number of samples is always available and that the optimal solution is always attainable. We believe removing these assumptions is an important avenue for future work.

Acknowledgments.

We thank NSF under IIS 1617157 and ONR under N00014-16-1-261 for research support.

Appendix A Preliminaries for the Proofs

For any , denote by the indicator function that takes value over and everywhere else. We will sometimes use constants to represent constant functions. For any two real-valued functions and defined over the same domain , we write if for any , . For any topological vector space , we denote by the topological dual of , which is the set of all continuous linear functions over .

Theorem 9 (dual of  [hildebrandt1934bounded]).

can be identified with by defining for any and any

Definition 10 (general convex conjugacy [rockafellar1968integrals]).

Let be a pair of real vector spaces, be a real bilinear function of and , and be a proper convex function, then we can define on the conjugate of , denoted by , as

and define on the conjugate of , denoted by , as

if only is specified, then it is assumed that is the domain of and is , and the bilinear function is given by for and .

Theorem 11 (Fenchel-Moreau, [zalinescu2002convex] Theorem 2.3.3).

If is a Hausdorff locally convex space, and is a proper lower semi-continuous convex function on , then .

Fact 12.

with the usual topology is a Hausdorff locally convex space.

Proof.

The usual topology on can be induced by the usual norm on and a normed space is a Hausdorff locally convex space. ∎

Fact 13.

is a Hausdorff locally convex space.

Proof.

The topology on is induced from the uniform norm and a normed space is a Hausdorff locally convex space. ∎

Fact 14.

is a proper lower semi-continuous convex function.

Proof.

Recall that by assumption is a lower semi-continuous convex function. To see is also proper, note that by assumption , therefore is a lower semi-continuous convex function that takes finite value at some point, hence also a proper function. ∎

Fact 15.

is a proper lower semi-continuous convex function.

Proof.

Note that , and the weak* topology on is the same as the usual topology. Therefore is a lower semi-continuous convex function ([zalinescu2002convex] Theorem 2.3.1). is proper because is proper. ∎

Fact 16.

.

Proof.

According to Fact 12 and Fact 14, is a proper lower semi-continuous convex function on a Hausdorff locally convex space, therefore by Theorem 11, we have . ∎

Fact 17.

is non-decreasing.

Proof.

This is because by assumption for any , and then by Fact 16, for any . This means is non-decreasing. ∎

Fact 18.

.

Proof.

This is because by assumption , and then by Fact 16, . Therefore by the definition of , we have

We will need the following definition and result from [rockafellar1968integrals], which we note to be simplied because in our case is a probability measure (instead of a -finite measure in their case) and we only consider real-valued functions (instead of vector-valued function in their case).

Definition 19 (decomposable , [rockafellar1968integrals] simplified).

We say a set of real-valued measurable functions over is decomposable if

  • ;

  • for any and , .

Theorem 20 ([rockafellar1968integrals], corollary of Theorem 2, simplified).

Let . Suppose and are decomposable and for any and the function is integrable w.r.t. , is a lower semi-continuous proper convex function, then for any

Appendix B Proof of Theorem 8

Note that by the Radon-Nikodym theoremm ([folland1999real] Theorem 3.8), for each bounded and countably additive signed measure on that is absolutely continuous w.r.t. , there is an element in , denoted by (the Radon-Nikodym derivative), such that

(12)

Therefore,

Here (a) is from (12). To see why (b) holds, note that and are decomposable spaces (as defined in Definition 19) such that for any and , the function is integrable w.r.t. , is lower semi-continuous proper convex function by Fact 15, and by Fact 16; therefore we can apply Theorem 20 and get the equality.

Appendix C Proof of Theorem 3

Define the functional to be

(13)

We first show some properties related to .

Lemma 21.

If and , then for any , .

Proof.

Observe that

where (a) and (b) are because . ∎

Lemma 22.

The function defined on is lower semi-continuous.

Proof.

Note that is a decomposable space (as defined in Definition 19), and for any , the function is integrable w.r.t. , and is lower semi-continuous proper convex function by Fact 14, then according to Theorem 20,

(14)

Note that for each , the function defined on is a continuous linear function, therefore the r.h.s. of (14) is the supremum of linear continuous functions, hence a lower semi-continuous function. ∎

Lemma 23.

For any , , sequence in , sequence in , sequence in , if , , and for every , then has a convergent subsequence whose limit point satisfies

Proof.

We first prove that the sequence is bounded. We prove this by contradiction. Suppose is not bounded, since and , we have that for any , there exists such that

(15)

However, by assumpition is finite in a neighbourhood of , along with Fact 16, this implies that as . Therefore we have

Because and , we have that is bounded for , therefore

which contradicts the assumption that for every because .

Now by Bolzano-Weierstrass theorem, the bounded sequence has a convergent subsequence , whose limit point we denote by . Let be any positive real number, we will show that

Because by assumption and and for every , we have that for large enough

Lemma 22 says that the function is lower semi-continuous, since , this implies that

Because we can choose to be arbitrarily small, we can conclude that

Lemma 24.

The infimum in the definition of as in (13) can be attained.

Proof.

Let . We need to show that there exists such that and

By the definition of infimum, there exists a sequence in and a sequence in such that

Applying Lemma 23, where , , and , we have that there exists a subsequence of whose limit point satisfies

Therefore the infimum in the definition of is attained at .

Lemma 25.

The functional has the following properties

  1. is lower semi-continuous,

  2. is convex,

  3. For any , if , then .

  4. If is a constant function, then .

Proof.

We will prove ()-() separately:

Proof of ().

We need to show that for any , ), and any sequence in such that and for any , we have that

(16)

Lemma 24 guarantees that there exists a sequence in and a seqeunce in such that for any ,

Applying Lemma 23, where , , we have that there exists such that

which will imply (16).

Proof of ().

For any , , , we need to show that

(17)

If either (a) or (b) is infinite, then (17) is trivially true; therefore we assume both of them to be finite. In this case, for any , there exists and such that

(18)

and

(19)

We can see that