Instead of representing the model distribution with a parametric density function, implicit generative models directly describe how to draw samples of the model distribution by first drawing a sample from a fixed random generator and mapping into the data space with a parametrized generator function . The reparametrization trick [13, 42], Variational Auto-Encoders (VAEs) , and Generative Adversarial Networks (GANs)  are recent instances of this approach.
Many of these authors motivate implicit modeling with the computational advantage that results from the ability of using the efficient back-propagation algorithm to update the generator parameters. In contrast, our work targets another, more fundamental, advantage of implicit modeling.
approach to computer vision[31, 30, 43]. Implicit modeling makes this much simpler:
The structure of the generator function could be directly interpreted as a set of equations describing a physical or causal model of the data .
There is no need to deal with latent variables, since all the variables of interest are explicitly computed by the generator function.
Implicit modeling can easily represent simple phenomena involving a small set of observed or inferred variables. The corresponding model distribution cannot be represented with a density function because it is supported by a low-dimensional manifold. But nothing prevents an implicit model from generating such samples.
Unfortunately, we cannot fully realize these benefits using the popular Maximum Likelihood Estimation (MLE) approach, which asymptotically amounts to minimizing the Kullback-Leibler (KL) divergence between the data distribution and the model distribution ,
where and are the density functions of and with respect to a common measure . This criterion is particularly convenient because it enjoys favorable statistical properties  and because its optimization can be written as an expectation with respect to the data distribution,
which is readily amenable to computationally attractive stochastic optimization procedures . First, this expression is ill-defined when the model distribution cannot be represented by a density. Second, if the likelihood of a single example
is zero, the dataset likelihood is also zero, and there is nothing to maximize. The typical remedy is to add a noise term to the model distribution. Virtually all generative models described in the classical machine learning literature include such a noise component whose purpose is not to model anything useful, but merely to make MLE work.
Instead of using ad-hoc noise terms to coerce MLE into optimizing a different similarity criterion between the data distribution and the model distribution, we could as well explicitly optimize a different criterion. Therefore it is crucial to understand how the selection of a particular criterion will influence the learning process and its final result.
Section 2 reviews known results establishing how many interesting distribution comparison criteria can be expressed in adversarial form, and are amenable to tractable optimization algorithms. Section 3 reviews the statistical properties of two interesting families of distribution distances, namely the family of the Wasserstein distances and the family containing the Energy Distances and the Maximum Mean Discrepancies. Although the Wasserstein distances have far worse statistical properties, experimental evidence shows that it can deliver better performances in meaningful applicative setups. Section 4 reviews essential concepts about geodesic geometry in metric spaces. Section 5 shows how different probability distances induce different geodesic geometries in the space of probability measures. Section 6 leverages these geodesic structures to define various flavors of convexity for parametric families of generative models, which can be used to prove that a simple gradient descent algorithm will either reach or approach the global minimum regardless of the traditional nonconvexity of the parametrization of the model family. In particular, when one uses implicit generative models, minimizing the Wasserstein distance with a gradient descent algorithm offers much better guarantees than minimizing the Energy distance.
2 The adversarial formulation
The adversarial training framework popularized by the Generative Adversarial Networks (GANs)  can be used to minimize a great variety of probability comparison criteria. Although some of these criteria can also be optimized using simpler algorithms, adversarial training provides a common template that we can use to compare the criteria themselves.
This section presents the adversarial training framework and reviews the main categories of probability comparison criteria it supports, namely Integral Probability Metrics (IPM) (Section 2.4), f-divergences (Section 2.5), Wasserstein distances (WD) (Section 2.6), and Energy Distances (ED) or Maximum Mean Discrepancy distances (MMD) (Section 2.7).
Although it is intuitively useful to consider that the sample space is some convex subset of , it is also useful to spell out more precisely which properties are essential to the development. In the following, we assume that is a Polish metric space, that is, a complete and separable space whose topology is defined by a distance function
satisfying the properties of a metric distance:
Let be the Borel -algebra generated by all the open sets of . We use the notation for the set of probability measures defined on , and the notation for those satisfying . This condition is equivalent to for an arbitrary origin when is finite, symmetric, and satisfies the triangular inequality.
We are interested in criteria to compare elements of ,
Although it is desirable that also satisfies the properties of a distance (2), this is not always possible. In this contribution, we strive to only reserve the word distance for criteria that satisfy the properties (2) of a metric distance. We use the word pseudodistance111Although failing to satisfy the separation property (2.) can have serious practical consequences, recall that a pseudodistance always becomes a full fledged distance on the quotient space where denotes the equivalence relation . All the theory applies as long as one never distinguishes two points separated by a zero distance. when a nonnegative criterion fails to satisfy the separation property (2.). We use the word divergence for criteria that are not symmetric (2.) or fail to satisfy the triangular inequality (2.).
We generally assume in this contribution that the distance defined on is finite. However we allow probability comparison criteria to be infinite. When the distributions do not belong to the domain for which a particular criterion is defined, we take that if and otherwise.
2.2 Implicit modeling
We are particularly interested in model distributions that are supported by a low-dimensional manifold in a large ambient sample space (recall Section 1). Since such distributions do not typically have a density function, we cannot represent the model family using a parametric density function. Following the example of Variational Auto-Encoders (VAE)  and Generative Adversarial Networks (GAN) , we represent the model distributions by defining how to produce samples.
be a random variable with known distributiondefined on a suitable probability space and let be a measurable function, called the generator, parametrized by ,
The random variable follows the
push-forward distribution222We use the notation or to denote the probability
distribution obtained by applying function
to denote the probability distribution obtained by applying functionor expression to samples of the distribution .
By varying the parameter of the generator , we can change this push-forward distribution and hopefully make it close to the data distribution according to the criterion of interest.
This implicit modeling approach is useful in two ways. First, unlike densities, it can represent distributions confined to a low-dimensional manifold. Second, the ability to easily generate samples is frequently more useful than knowing the numerical value of the density function (for example in image superresolution or semantic segmentation when considering the conditional distribution of the output image given the input image). In general, it is computationally difficult to generate samples given an arbitrary high-dimensional density .
Learning algorithms for implicit models must therefore be formulated in terms of two sampling oracles. The first oracle returns training examples, that is, samples from the data distribution . The second oracle returns generated examples, that is, samples from the model distribution . This is particularly easy when the comparison criterion can be expressed in terms of expectations with respect to the distributions or .
2.3 Adversarial training
We are more specifically interested in distribution comparison criteria that can be expressed in the form
The set defines which pairs of real-valued critic functions defined on are considered in this maximization. As discussed in the following subsections, different choices of lead to a broad variety of criteria. This formulation is a mild generalization of the Integral Probability Metrics (IPMs)  for which both functions and are constrained to be equal (Section 2.4).
Finding the optimal generator parameter then amounts to minimizing a cost function which itself is a supremum,
Although it is sometimes possible to reformulate this cost function in a manner that does not involve a supremum (Section 2.7), many algorithms can be derived from the following variant of the envelope theorem .
Let be the cost function defined in (4) and let be a specific value of the generator parameter. Under the following assumptions,
there is such that ,
the function is differentiable in ,
the functions are -almost surely differentiable in ,
and there exists an open neighborhood of and a -integrable function such that , ,
we have the equality
This result means that we can compute the gradient of without taking into account the way changes with . The most important assumption here is the differentiability of the cost . Without this assumption, we can only assert that belongs to the “local” subgradient
Dividing this last inequality by , taking its limit when , recalling that the dominated convergence theorem and assumption (d) allow us to take the limit inside the expectation operator, and rearranging the result gives
Writing the same for unit vector yields inequality . Therefore .
Thanks to this result, we can compute an unbiased333Stochastic gradient descent often relies on unbiased gradient estimates (for a more general condition, see [10, Assumption 4.3]). This is not a given: estimating the Wasserstein distance (14) and its gradients on small minibatches gives severely biased estimates . This is in fact very obvious for minibatches of size one. Theorem 2.1 therefore provides an imperfect but useful alternative. stochastic estimate of the gradient by first solving the maximization problem in (4), and then using the back-propagation algorithm to compute the average gradient on a minibatch sampled from ,
Such an unbiased estimate can then be used to perform a stochastic gradient descent update iteration on the generator parameter
In order to obtain an unbiased gradient estimate , we need to solve the maximization problem in (4) for the true distributions rather than for a particular subset of examples. On the one hand, we can use the standard machine learning toolbox to avoid overfitting the maximization problem. On the other hand, this toolbox essentially works by restricting the family in ways that can change the meaning of the comparison criteria itself [5, 34].
In practice, solving the maximization problem (4) during each iteration of the stochastic gradient algorithm is computationally too costly. Instead, practical algorithms interleave two kinds of stochastic iterations: gradient ascent steps on , and gradient descent steps on , with a much smaller effective stepsize. Such algorithms belong to the general class of stochastic algorithms with two time scales [9, 29]. Their convergence properties form a delicate topic, clearly beyond the purpose of this contribution.
2.4 Integral probability metrics
Integral probability metrics (IPMs)  have the form
Note that the surrounding absolute value can be eliminated by requiring that also contains the opposite of every one of its functions.
Therefore an IPM is a special case of (3) where the critic functions and are constrained to be identical, and where is again constrained to contain the opposite of every critic function. Whereas expression (3) does not guarantee that is finite and is a distance, an IPM is always a pseudodistance.
Any integral probability metric , (5) is a pseudodistance.
Proof To establish the triangular inequality (2.), we can write, for all ,
The other properties of a pseudodistance are trivial consequences of (5).
The most fundamental IPM is the Total Variation (TV) distance.
where is the space of continuous functions from to .
Many classical criteria belong to the family of f-divergences
where and are respectively the densities of and relative to measure and where is a continuous convex function defined on such that .
Expression (7) trivially satisfies (2.). It is always nonnegative because we can pick a subderivative and use the inequality . This also shows that the separation property (2.) is satisfied when this inequality is strict for all .
Usually,444The statement holds when there is an such that Restricting to exclude such subsets and taking the limit may not work because in general. Yet, in practice, the result can be verified by elementary calculus for the usual choices of f, such as those shown in Table 1.
where denotes the convex conjugate of f.
Table 1 provides examples of f-divergences and provides both the function f and the corresponding conjugate function that appears in the variational formulation. In particular, as argued in , this analysis clarifies the probability comparison criteria associated with the early GAN variants .
|Total variation (6)|
|GAN’s Jensen Shannon |
Despite the elegance of this framework, these comparison criteria are not very attractive when the distributions are supported by low-dimensional manifolds that may not overlap. The following simple example shows how this can be a problem .
be the uniform distribution on the real segmentand consider the distributions defined on . Because and have disjoint support for , neither the total variation distance nor the f-divergence depend on the exact value of . Therefore, according to the topologies induced by these criteria on , the sequence of distributions does not converge to (Figure 1).
2.6 Wasserstein distance
For any , the -Wasserstein distance (WD) is the -th root of
where represents the set of all measures defined on with marginals and respectively equal to and . Intuitively, represents the cost of transporting a grain of probability from point to point
, and the joint distributionsrepresent transport plans.
Thanks to the Kantorovich duality theory, the Wasserstein distance is easily expressed in the variational form (3). We summarize below the essential results useful for this work and we direct the reader to [55, Chapters 4 and 5] for a full exposition.
Theorem 2.8 ([55, Theorem 4.1]).
Let be two Polish metric spaces and be a nonnegative continuous cost function. Let be the set of probablity measures on with marginals and . There is a that minimizes over all .
Let be two Polish metric spaces and be a nonnegative continuous cost function. The pair of functions and is -conjugate when
Theorem 2.10 (Kantorovich duality [55, Theorem 5.10]).
Let and be two Polish metric spaces and be a nonnegative continuous cost function. For all and , let be the set of probability distributions defined on with marginal distributions and . Let be the set of all pairs of respectively and -integrable functions satisfying the property , .
Corollary 2.11 ([55, Particular case 5.16]).
Thanks to Theorem 2.10, we can write the -th power of the -Wasserstein distance in variational form
Let us conclude this presentation of the Wassertein distance by mentioning that the definition (8) immediately implies several distance properties: zero when both distributions are equal (2.), strictly positive when they are different (2.), and symmetric (2.). Property 2.4 gives the triangular inequality (2.) for the case . In the general case, the triangular inequality can also be established using the Minkowsky inequality [55, Chapter 6].
2.7 Energy Distance and Maximum Mean Discrepancy
The Energy Distance (ED)  between the probability distributions and defined on the Euclidean space is the square root555We take the square root because this is the quantity that behaves like a distance. of
where, as usual, denotes the Euclidean distance.
represent the characteristic functions of the distributionand
respectively. Thanks to a neat Fourier transform argument[53, 52],
Since there is a one-to-one mapping between distributions and characteristic functions, this relation establishes an isomorphism between the space of probability distributions equipped with the ED distance and the space of the characteristic functions equipped with the weighted norm given in the right-hand side of (16). As a consequence, satisfies the properties (2) of a distance.
Since the squared ED is expressed with a simple combination of expectations, it is easy to design a stochastic minimization algorithm that relies only on two oracles producing samples from each distribution [11, 7]. This makes the energy distance a computationally attractive criterion for training the implicit models discussed in Section 2.2.
It is therefore natural to ask whether we can meaningfully generalize (15) by replacing the Euclidean distance with a symmetric function .
The right-hand side of this expression is well defined when . It is obviously symmetric (2.) and trivially zero (2.) when both distributions are equal. The first part of the following theorem gives the necessary and sufficient conditions on to ensure that the right-hand side of (17) is nonnegative and therefore can be the square of . We shall see later that the triangular inequality (2.) comes for free with this condition (Corollary 2.19). The second part of the theorem gives the necessary and sufficient condition for satisfying the separation property (2.).
Theorem 2.12 ().
The right-hand side of definition (17) is:
nonnegative for all in if and only if the symmetric function is a negative definite kernel, that is,
strictly positive for all in if and only if the function is a strongly negative definite kernel, that is, a negative definite kernel such that, for any probability measure and any -integrable real-valued function such that ,
The definition of a strongly negative kernel is best explained by considering how its meaning would change if we were only considering probability measures with finite support . This amounts to requiring that (18) is an equality only if all the s are zero. However, this weaker property is not sufficient to ensure that the separation property (2.) holds.
The relation (16) therefore means that the Euclidean distance on is a strongly negative definite kernel. In fact, it can be shown that is a strongly negative definite kernel for . When , it is easy to see that is simply the distance between the distribution means and therefore cannot satisfy the separation property (2.).
Let have respective density functions and with respect to measure . Function then satisfies , and
With , any such that (ie., non--almost-surely-zero) and can be written as a difference of two nonnegative functions such that . Then, and belong to , and
We can then prove the theorem:
From these observations, if for all , then for all and such that , implying (18). Conversely, assume there are such that
. Using the weak law of large numbers (see also Theorem 3.3 later in this document,) we can find finite support distributions such that . Proceeding as in observation (a) then contradicts (18) because has also finite support.
By contraposition, suppose there is and such that , , and . Observation (b) gives such that . Conversely, suppose . Observation (a) gives and such that . Since must be zero, .
Requiring that be a negative definite kernel is a quite strong assumption. For instance, a classical result by Schoenberg  establishes that a squared distance is a negative definite kernel if and only if the whole metric space induced by this distance is isometric to a subset of a Hilbert space and therefore has a Euclidean geometry:
Theorem 2.15 (Schoenberg, ).
The metric space is isometric to a subset of a Hilbert space if and only if is a negative definite kernel.
Requiring to be negative definite (not necessarily a squared distance anymore) has a similar impact on the geometry of the space equipped with the Energy Distance (Theorem 2.17). Let be an arbitrary origin point and define the symmetric triangular gap kernel as
The function is a negative definite kernel if and only if is a positive definite kernel, that is,
Proof The proposition directly results from the identity
where is the chosen origin point and .
Positive definite kernels in the machine learning literature have been extensively studied in the context of the so-called kernel trick . In particular, it is well known that the theory of the Reproducing Kernel Hilbert Spaces (RKHS) [4, 1] establishes that there is a unique Hilbert space , called the RKHS, that contains all the functions
and satisfies the reproducing property
We can then relate to the RKHS norm.
Let be a negative definite kernel and let be the RKHS associated with the corresponding positive definite triangular gap kernel (19). We have then
Proof We can write directly
where the first equality results from (19) and where the second equality results from the identities and .
In the context of this theorem, the relation (16) is simply an analytic expression of the RKHS norm associated with the triangular gap kernel of the Euclidean distance.
Maximum Mean Discrepancy
Following , we can then write as an IPM: