Geometrical Insights for Implicit Generative Modeling

12/21/2017 ∙ by Leon Bottou, et al. ∙ 0

Learning algorithms for implicit generative models can optimize a variety of criteria that measure how the data distribution differs from the implicit model distribution, including the Wasserstein distance, the Energy distance, and the Maximum Mean Discrepancy criterion. A careful look at the geometries induced by these distances on the space of probability measures reveals interesting differences. In particular, we can establish surprising approximate global convergence guarantees for the 1-Wasserstein distance,even when the parametric generator has a nonconvex parametrization.

READ FULL TEXT VIEW PDF

Authors

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Instead of representing the model distribution with a parametric density function, implicit generative models directly describe how to draw samples of the model distribution by first drawing a sample from a fixed random generator and mapping into the data space with a parametrized generator function . The reparametrization trick [13, 42], Variational Auto-Encoders (VAEs) [27], and Generative Adversarial Networks (GANs) [20] are recent instances of this approach.

Many of these authors motivate implicit modeling with the computational advantage that results from the ability of using the efficient back-propagation algorithm to update the generator parameters. In contrast, our work targets another, more fundamental, advantage of implicit modeling.

Although unsupervised learning is often formalized as estimating the data distribution 

[24, §14.1], the practical goal of the learning process rarely consists in recovering actual probabilities. Instead, the probability models are often structured in a manner that is interpretable as a physical or causal model of the data. This is often achieved by defining an interpretable density for well chosen latent variables and letting the appearance model take the slack. This approach is well illustrated by the inverse graphics

approach to computer vision 

[31, 30, 43]. Implicit modeling makes this much simpler:

  • The structure of the generator function could be directly interpreted as a set of equations describing a physical or causal model of the data [28].

  • There is no need to deal with latent variables, since all the variables of interest are explicitly computed by the generator function.

  • Implicit modeling can easily represent simple phenomena involving a small set of observed or inferred variables. The corresponding model distribution cannot be represented with a density function because it is supported by a low-dimensional manifold. But nothing prevents an implicit model from generating such samples.

Unfortunately, we cannot fully realize these benefits using the popular Maximum Likelihood Estimation (MLE) approach, which asymptotically amounts to minimizing the Kullback-Leibler (KL) divergence between the data distribution and the model distribution ,

(1)

where and are the density functions of and with respect to a common measure . This criterion is particularly convenient because it enjoys favorable statistical properties [14] and because its optimization can be written as an expectation with respect to the data distribution,

which is readily amenable to computationally attractive stochastic optimization procedures [10]. First, this expression is ill-defined when the model distribution cannot be represented by a density. Second, if the likelihood of a single example

is zero, the dataset likelihood is also zero, and there is nothing to maximize. The typical remedy is to add a noise term to the model distribution. Virtually all generative models described in the classical machine learning literature include such a noise component whose purpose is not to model anything useful, but merely to make MLE work.

Instead of using ad-hoc noise terms to coerce MLE into optimizing a different similarity criterion between the data distribution and the model distribution, we could as well explicitly optimize a different criterion. Therefore it is crucial to understand how the selection of a particular criterion will influence the learning process and its final result.

Section 2 reviews known results establishing how many interesting distribution comparison criteria can be expressed in adversarial form, and are amenable to tractable optimization algorithms. Section 3 reviews the statistical properties of two interesting families of distribution distances, namely the family of the Wasserstein distances and the family containing the Energy Distances and the Maximum Mean Discrepancies. Although the Wasserstein distances have far worse statistical properties, experimental evidence shows that it can deliver better performances in meaningful applicative setups. Section 4 reviews essential concepts about geodesic geometry in metric spaces. Section 5 shows how different probability distances induce different geodesic geometries in the space of probability measures. Section 6 leverages these geodesic structures to define various flavors of convexity for parametric families of generative models, which can be used to prove that a simple gradient descent algorithm will either reach or approach the global minimum regardless of the traditional nonconvexity of the parametrization of the model family. In particular, when one uses implicit generative models, minimizing the Wasserstein distance with a gradient descent algorithm offers much better guarantees than minimizing the Energy distance.

2 The adversarial formulation

The adversarial training framework popularized by the Generative Adversarial Networks (GANs) [20] can be used to minimize a great variety of probability comparison criteria. Although some of these criteria can also be optimized using simpler algorithms, adversarial training provides a common template that we can use to compare the criteria themselves.

This section presents the adversarial training framework and reviews the main categories of probability comparison criteria it supports, namely Integral Probability Metrics (IPM) (Section 2.4), f-divergences (Section 2.5), Wasserstein distances (WD) (Section 2.6), and Energy Distances (ED) or Maximum Mean Discrepancy distances (MMD) (Section 2.7).

2.1 Setup

Although it is intuitively useful to consider that the sample space is some convex subset of , it is also useful to spell out more precisely which properties are essential to the development. In the following, we assume that is a Polish metric space, that is, a complete and separable space whose topology is defined by a distance function

satisfying the properties of a metric distance:

(2)

Let be the Borel -algebra generated by all the open sets of . We use the notation for the set of probability measures defined on , and the notation for those satisfying . This condition is equivalent to for an arbitrary origin when is finite, symmetric, and satisfies the triangular inequality.

We are interested in criteria to compare elements of ,

Although it is desirable that also satisfies the properties of a distance (2), this is not always possible. In this contribution, we strive to only reserve the word distance for criteria that satisfy the properties (2) of a metric distance. We use the word pseudodistance111Although failing to satisfy the separation property (2.) can have serious practical consequences, recall that a pseudodistance always becomes a full fledged distance on the quotient space where denotes the equivalence relation . All the theory applies as long as one never distinguishes two points separated by a zero distance. when a nonnegative criterion fails to satisfy the separation property (2.). We use the word divergence for criteria that are not symmetric (2.) or fail to satisfy the triangular inequality (2.).

We generally assume in this contribution that the distance defined on is finite. However we allow probability comparison criteria to be infinite. When the distributions do not belong to the domain for which a particular criterion  is defined, we take that if and otherwise.

2.2 Implicit modeling

We are particularly interested in model distributions that are supported by a low-dimensional manifold in a large ambient sample space (recall Section 1). Since such distributions do not typically have a density function, we cannot represent the model family using a parametric density function. Following the example of Variational Auto-Encoders (VAE) [27] and Generative Adversarial Networks (GAN) [20], we represent the model distributions by defining how to produce samples.

Let

be a random variable with known distribution

defined on a suitable probability space and let be a measurable function, called the generator, parametrized by ,

The random variable follows the push-forward distribution222We use the notation or

to denote the probability distribution obtained by applying function

or expression to samples of the distribution .

By varying the parameter of the generator , we can change this push-forward distribution and hopefully make it close to the data distribution according to the criterion of interest.

This implicit modeling approach is useful in two ways. First, unlike densities, it can represent distributions confined to a low-dimensional manifold. Second, the ability to easily generate samples is frequently more useful than knowing the numerical value of the density function (for example in image superresolution or semantic segmentation when considering the conditional distribution of the output image given the input image). In general, it is computationally difficult to generate samples given an arbitrary high-dimensional density [37].

Learning algorithms for implicit models must therefore be formulated in terms of two sampling oracles. The first oracle returns training examples, that is, samples from the data distribution . The second oracle returns generated examples, that is, samples from the model distribution . This is particularly easy when the comparison criterion can be expressed in terms of expectations with respect to the distributions or .

2.3 Adversarial training

We are more specifically interested in distribution comparison criteria that can be expressed in the form

(3)

The set defines which pairs of real-valued critic functions defined on are considered in this maximization. As discussed in the following subsections, different choices of lead to a broad variety of criteria. This formulation is a mild generalization of the Integral Probability Metrics (IPMs) [36] for which both functions and are constrained to be equal (Section 2.4).

Finding the optimal generator parameter then amounts to minimizing a cost function which itself is a supremum,

(4)

Although it is sometimes possible to reformulate this cost function in a manner that does not involve a supremum (Section 2.7), many algorithms can be derived from the following variant of the envelope theorem [35].

Theorem 2.1.

Let be the cost function defined in (4) and let be a specific value of the generator parameter. Under the following assumptions,

  • [nosep]

  • there is such that ,

  • the function is differentiable in ,

  • the functions are -almost surely differentiable in ,

  • and there exists an open neighborhood of and a -integrable function such that , ,

we have the equality

This result means that we can compute the gradient of without taking into account the way changes with . The most important assumption here is the differentiability of the cost . Without this assumption, we can only assert that belongs to the “local” subgradient

Proof   Let and

be an arbitrary unit vector. From (

3),

Dividing this last inequality by , taking its limit when , recalling that the dominated convergence theorem and assumption (d) allow us to take the limit inside the expectation operator, and rearranging the result gives

Writing the same for unit vector yields inequality . Therefore .  

Thanks to this result, we can compute an unbiased333Stochastic gradient descent often relies on unbiased gradient estimates (for a more general condition, see [10, Assumption 4.3]). This is not a given: estimating the Wasserstein distance (14) and its gradients on small minibatches gives severely biased estimates [7]. This is in fact very obvious for minibatches of size one. Theorem 2.1 therefore provides an imperfect but useful alternative. stochastic estimate of the gradient by first solving the maximization problem in (4), and then using the back-propagation algorithm to compute the average gradient on a minibatch  sampled from ,

Such an unbiased estimate can then be used to perform a stochastic gradient descent update iteration on the generator parameter

Although this algorithmic idea can be made to work relatively reliably [3, 22], serious conceptual and practical issues remain:

Remark 2.2.

In order to obtain an unbiased gradient estimate , we need to solve the maximization problem in (4) for the true distributions rather than for a particular subset of examples. On the one hand, we can use the standard machine learning toolbox to avoid overfitting the maximization problem. On the other hand, this toolbox essentially works by restricting the family in ways that can change the meaning of the comparison criteria itself [5, 34].

Remark 2.3.

In practice, solving the maximization problem (4) during each iteration of the stochastic gradient algorithm is computationally too costly. Instead, practical algorithms interleave two kinds of stochastic iterations: gradient ascent steps on , and gradient descent steps on , with a much smaller effective stepsize. Such algorithms belong to the general class of stochastic algorithms with two time scales [9, 29]. Their convergence properties form a delicate topic, clearly beyond the purpose of this contribution.

2.4 Integral probability metrics

Integral probability metrics (IPMs) [36] have the form

Note that the surrounding absolute value can be eliminated by requiring that also contains the opposite of every one of its functions.

(5)

Therefore an IPM is a special case of (3) where the critic functions  and  are constrained to be identical, and where is again constrained to contain the opposite of every critic function. Whereas expression (3) does not guarantee that is finite and is a distance, an IPM is always a pseudodistance.

Proposition 2.4.

Any integral probability metric , (5) is a pseudodistance.

Proof   To establish the triangular inequality (2.), we can write, for all ,

The other properties of a pseudodistance are trivial consequences of (5).  

The most fundamental IPM is the Total Variation (TV) distance.

(6)

where is the space of continuous functions from to .

2.5 f-Divergences

Many classical criteria belong to the family of f-divergences

(7)

where and are respectively the densities of and relative to measure and where is a continuous convex function defined on such that .

Expression (7) trivially satisfies (2.). It is always nonnegative because we can pick a subderivative and use the inequality . This also shows that the separation property (2.) is satisfied when this inequality is strict for all .

Proposition 2.5 ([38, 39] (informal)).

Usually,444The statement holds when there is an such that Restricting to exclude such subsets and taking the limit may not work because in general. Yet, in practice, the result can be verified by elementary calculus for the usual choices of f, such as those shown in Table 1.

where denotes the convex conjugate of f.

Table 1 provides examples of f-divergences and provides both the function f and the corresponding conjugate function  that appears in the variational formulation. In particular, as argued in [39], this analysis clarifies the probability comparison criteria associated with the early GAN variants [20].

dom
Total variation (6)
Kullback-Leibler (1)
Reverse Kullback-Leibler
GAN’s Jensen Shannon [20]

Table 1: Various f-divergences and the corresponding f and .

Despite the elegance of this framework, these comparison criteria are not very attractive when the distributions are supported by low-dimensional manifolds that may not overlap. The following simple example shows how this can be a problem [3].

Figure 1: Let distribution be supported by the segment in . According to both the TV distance (6) and the f-divergences (7), the sequence of distributions does not converge to . However this sequence converges to according to either the Wasserstein distances (8) or the Energy distance (15).
Example 2.6.

Let

be the uniform distribution on the real segment

and consider the distributions defined on . Because and have disjoint support for , neither the total variation distance nor the f-divergence depend on the exact value of . Therefore, according to the topologies induced by these criteria on , the sequence of distributions does not converge to (Figure 1).

The fundamental problem here is that neither the total variation distance (6) nor the f-divergences (7) depend on the distance  defined on the sample space . The minimization of such a criterion appears more effective for adjusting the probability values than for matching the distribution supports.

2.6 Wasserstein distance

For any , the -Wasserstein distance (WD) is the -th root of

(8)

where represents the set of all measures defined on with marginals and respectively equal to and . Intuitively, represents the cost of transporting a grain of probability from point  to point 

, and the joint distributions 

represent transport plans.

Since ,

(9)
Example 2.7.

Let be defined as in Example 2.6. Since it is easy to see that the optimal transport plan from to is , the Wassertein distance converges to zero when tends to zero. Therefore, according to the topology induced by the Wasserstein distance on , the sequence of distributions  converges to (Figure 1).

Thanks to the Kantorovich duality theory, the Wasserstein distance is easily expressed in the variational form (3). We summarize below the essential results useful for this work and we direct the reader to [55, Chapters 4 and 5] for a full exposition.

Theorem 2.8 ([55, Theorem 4.1]).

Let be two Polish metric spaces and be a nonnegative continuous cost function. Let be the set of probablity measures on with marginals and . There is a that minimizes over all .

Definition 2.9.

Let be two Polish metric spaces and be a nonnegative continuous cost function. The pair of functions and is -conjugate when

(10)
Theorem 2.10 (Kantorovich duality [55, Theorem 5.10]).

Let and be two Polish metric spaces and be a nonnegative continuous cost function. For all and , let be the set of probability distributions defined on with marginal distributions and . Let be the set of all pairs of respectively and -integrable functions satisfying the property , .

  • We have the duality

    (11)
    (12)
  • Further assuming that ,

    • [nosep]

    • Both (11) and (12) have solutions with finite cost.

    • The solution of (12) is a -conjugate pair.

Corollary 2.11 ([55, Particular case 5.16]).

Under the same conditions as Theorem 2.10., when and when the cost function is a distance, that is, satisfies (2), the dual optimization problem (12) can be rewritten as

where is the set of real-valued -Lipschitz continuous functions on .

Thanks to Theorem 2.10, we can write the -th power of the -Wasserstein distance in variational form

(13)

where is defined as in Theorem 2.10 for the cost . Thanks to Corollary 2.11, we can also obtain a simplified expression in IPM form for the -Wasserstein distance.

(14)

Let us conclude this presentation of the Wassertein distance by mentioning that the definition (8) immediately implies several distance properties: zero when both distributions are equal (2.), strictly positive when they are different (2.), and symmetric (2.). Property 2.4 gives the triangular inequality (2.) for the case . In the general case, the triangular inequality can also be established using the Minkowsky inequality [55, Chapter 6].

2.7 Energy Distance and Maximum Mean Discrepancy

The Energy Distance (ED) [53] between the probability distributions and defined on the Euclidean space is the square root555We take the square root because this is the quantity that behaves like a distance. of

(15)

where, as usual, denotes the Euclidean distance.

Let and

represent the characteristic functions of the distribution

and

respectively. Thanks to a neat Fourier transform argument

[53, 52],

(16)

Since there is a one-to-one mapping between distributions and characteristic functions, this relation establishes an isomorphism between the space of probability distributions equipped with the ED distance and the space of the characteristic functions equipped with the weighted norm given in the right-hand side of (16). As a consequence, satisfies the properties (2) of a distance.

Since the squared ED is expressed with a simple combination of expectations, it is easy to design a stochastic minimization algorithm that relies only on two oracles producing samples from each distribution [11, 7]. This makes the energy distance a computationally attractive criterion for training the implicit models discussed in Section 2.2.

Generalized ED 

It is therefore natural to ask whether we can meaningfully generalize (15) by replacing the Euclidean distance with a symmetric function .

(17)

The right-hand side of this expression is well defined when . It is obviously symmetric (2.) and trivially zero (2.) when both distributions are equal. The first part of the following theorem gives the necessary and sufficient conditions on to ensure that the right-hand side of (17) is nonnegative and therefore can be the square of . We shall see later that the triangular inequality (2.) comes for free with this condition (Corollary 2.19). The second part of the theorem gives the necessary and sufficient condition for satisfying the separation property (2.).

Theorem 2.12 ([57]).

The right-hand side of definition (17) is:

  • nonnegative for all in if and only if the symmetric function is a negative definite kernel, that is,

    (18)
  • strictly positive for all in if and only if the function  is a strongly negative definite kernel, that is, a negative definite kernel such that, for any probability measure and any -integrable real-valued function such that ,

Remark 2.13.

The definition of a strongly negative kernel is best explained by considering how its meaning would change if we were only considering probability measures with finite support . This amounts to requiring that (18) is an equality only if all the s are zero. However, this weaker property is not sufficient to ensure that the separation property (2.) holds.

Remark 2.14.

The relation (16) therefore means that the Euclidean distance on is a strongly negative definite kernel. In fact, it can be shown that is a strongly negative definite kernel for [52]. When , it is easy to see that is simply the distance between the distribution means and therefore cannot satisfy the separation property (2.).

Proof of Theorem 2.12   Let be the right-hand side of (17) and let be the quantity that appears in clause (). Observe:

  • [nosep]

  • Let have respective density functions and with respect to measure . Function then satisfies , and

  • With , any such that (ie., non--almost-surely-zero) and can be written as a difference of two nonnegative functions such that . Then, and belong to , and

We can then prove the theorem:

  • [nosep]

  • From these observations, if for all , then for all and such that , implying (18). Conversely, assume there are such that

    . Using the weak law of large numbers 

    [26] (see also Theorem 3.3 later in this document,) we can find finite support distributions such that . Proceeding as in observation (a) then contradicts (18) because has also finite support.

  • By contraposition, suppose there is and such that , , and . Observation (b) gives such that . Conversely, suppose . Observation (a) gives and such that . Since must be zero, .  

Requiring that be a negative definite kernel is a quite strong assumption. For instance, a classical result by Schoenberg [45] establishes that a squared distance is a negative definite kernel if and only if the whole metric space induced by this distance is isometric to a subset of a Hilbert space and therefore has a Euclidean geometry:

Theorem 2.15 (Schoenberg, [45]).

The metric space is isometric to a subset of a Hilbert space if and only if is a negative definite kernel.

Requiring to be negative definite (not necessarily a squared distance anymore) has a similar impact on the geometry of the space equipped with the Energy Distance (Theorem 2.17). Let be an arbitrary origin point and define the symmetric triangular gap kernel as

(19)
Proposition 2.16.

The function is a negative definite kernel if and only if is a positive definite kernel, that is,

Proof   The proposition directly results from the identity

where is the chosen origin point and .  

Positive definite kernels in the machine learning literature have been extensively studied in the context of the so-called kernel trick [46]. In particular, it is well known that the theory of the Reproducing Kernel Hilbert Spaces (RKHS) [4, 1] establishes that there is a unique Hilbert space , called the RKHS, that contains all the functions

and satisfies the reproducing property

(20)

We can then relate to the RKHS norm.

Theorem 2.17 ([47] [40, Chapter 21]).

Let be a negative definite kernel and let be the RKHS associated with the corresponding positive definite triangular gap kernel (19). We have then

Proof   We can write directly

where the first equality results from (19) and where the second equality results from the identities and .  

Remark 2.18.

In the context of this theorem, the relation (16) is simply an analytic expression of the RKHS norm associated with the triangular gap kernel of the Euclidean distance.

Corollary 2.19.

If is a negative definite kernel, then is a pseudodistance, that is, it satisfies all the properties (2) of a distance except maybe the separation property (2.).

Corollary 2.20.

The following three conditions are then equivalent:

  • [nosep]

  • satisfies all the properties (2) of a distance.

  • is a strongly negative definite kernel.

  • the map is injective (characteristic kernel [21].)

Maximum Mean Discrepancy 

Following [21], we can then write as an IPM:

(21)