Given two probability measuresand on
, with finite second moments, consider the setof probability measures on the product sample space , such that the two -dimensional margins have the prescribed distributions, and . The index
as a measure of dissimilarity between distributions has been considered by many classical authors e.g., C. Gini, P. Levy, and M.R. Fréchet. There is considerable contemporary literature discussing the index , which is usually called Wasserstein distance. E.g., the monograph by C. Villani villani:2008optimal . We want also to mention Y. Brenier brenier:1991 and R.J. McCann mccann:2001 .
There is an important particular case, where the above problem reduces to the Monge transport problem. Borrowing the argument from M. Knott and C.S. Smith knott|smith:1984 , assume is a smooth strictly convex function and . Clearly, the condition
turns out to be equivalent to . Latter inequality shows that the minimum quadratic distance is attained. In view of the new formulation, let us provide a proof.
If denotes the convex conjugate of . We have
and the equality case is
By assumption so that
This argument, including an existence proof, is in Y. Brenier brenier:1991 . In the present paper we shall study the same problem where all the involved distributions are Gaussian. It would be feasible to reduce the Gaussian case to the general one. However, we resort to methods specially suited for this case.
1.1 The Gaussian case
Given two Gaussian distributions , , consider the set of Gaussian distributions on such that the two -dimensional margins have the prescribed distributions, . The corresponding index is
Observe that if and is a symmetric matrix such that , then the previous argument applies by means of the convex function .
The value of in Eq. (1) as a function of the mean and the dispersion matrix has been computed by some authors, in particular: I. Olkin and F. Pukelsheim olkin|pukelsheim:1982 , D. C. Dowson and B. V. Landau dowson|landau:1982 , C. R. Givens and R. M. Shortt givens|shortt:1984 , M. Gelbrich gelbrich:1990 . They found the (equivalent) forms
Further interpretations of are available. R. Bhatia et al. bhatia|jain|lim:2018 showed that is also the solution of constrained minimization problems for the Frobenius matrix norm , when . Especially,
Notice that is the generic transformation of the standard Gaussian to the Gaussian with dispersion matrix .
Because of the exponent 2 in Eq. (1), the distance is more precisely called -Wasserstein distance. Other exponents or other distances could be used in the definition. The quadratic case is particularly relevant as is a Riemannian distance. More references will be given later.
In an Information Geometry perspective, we can mimic the argument of the seminal paper by Amari amari:1998natural
, who derived the notion of both Fisher metric and natural gradient, from the second order approximation of the Kullback-Leibler divergence.
It will be shown (see Sec. 2) that the value of Eq. (2) has the differential second-order expansion for small :
where is the solution to the Lyapunov equation .
The quadratic form in the RHS of Eq. (3) provides a candidate to be the Riemannian inner product associated with the distance . In addition, if is a smooth real function defined on a small -sphere. i.e., for small , then the increment is maximized along the direction
where here denotes the Euclidean gradient. The operator is Amari’s natural gradient, i.e., the Riemannian gradient.
It is remarkable that all geometric objects shown in the previous equations above may be expressed as matrix operations. In this paper, we proceed in developing systematically the Wasserstein geometry of Gaussian models according to such a formalism.
1.2 Relations with the literature on the general transport theory
The Wasserstein distance and its relevant geometry can be studied non-parametrically also for general distributions. We do not pursue in this direction and refer to the monograph by C. Villani villani:2008optimal . The -Wasserstein metric geometry has been shown to be Riemannian by F. Otto (otto:2001, , §4) and J. Lott lott:2008calculations . Cf. the earlier account by J.D. Lafferty lafferty:1988 .
Let us briefly discuss Otto’s approach in the language of Information Geometry, i.e., with reference to S. Amari and H. Nagaoka amari|nagaoka:2000 . In view of the non-parametric approach first introduced in pistone|sempi:95 , and denoted by the set of
-dimensional Gaussian densities with zero mean, the vector bundle
is the Amari Hilbert bundle on . The Hilbert bundle contains the statistical bundle whose fibers consist of the scores for all smooth curves with . In turn, the statistical bundle is the tangent space of considered as an exponential manifold, see pistone|sempi:95 ; pistone:2013GSI .
In our present case, since the model is an exponential family, the natural parameter is the concentration matrix . The log-likelihood is
If is a symmetric matrix, the derivative of in the direction is
where is a symmetric matrix identified with a linear operator on symmetric matrices , equipped with the Frobenius inner product. The fiber at consists of the vector space of functions , . The inner product in the Hilbert bundle, restricted to the parameterized statistical bundle, is the Fisher metric
The study of the Fisher metric in the Gaussian case has been done first by L.T. Skovgaard skovgaard:1984 .
F. Otto (otto:2001, , §1.3)
, who was motivated by the study of a class of partial differential equation, considered a inner product defined on smooth functions of the-fiber of the Hilbert bundle, as
If we apply this definition to our score and , the gradient is and the metric becomes
The equivalence between the metric in Eq. (6) and the one in Eq. (4) can be seen by a change of parameterization both in and in each fiber. First, one must define the inner product at to be the inner product computed in the bijection , to get , which is the form of the metric provided by A. Takatsu (takatsu:2011osaka, , Prop. A). Second, one has to change the parameterization on each fiber of the statistical bundle by . The involved change of parameterization in the statistical bundle whose inverse is produces the desired inner product.
We mention also that the Machine Learning literature discusses a divergence introduced by A. HyvärinenMR2249836 , which is related to Otto’s metric. Precisely, in the concentration parameterization the Hyvärinen divergence is
and the second derivative of at is
In Statistics, Hyvärinen divergence is related to local proper scoring rules, see M. Parry et al. parry|dawid|lauritzen:2012 .
The first two sections of the paper are mostly review of known material. In Sec. 2 we recall some properties of the space of symmetric matrices. In particular: Riccati equation, Lyapunov equation, and the calculus regarding to the two mappings and . The mapping , where is a non-singular square matrix is shown to be a submersion and the horizontal vectors at each point is computed. Despite of our manifold being finite dimensional, there is no need of choosing a basis, as all operations of interest are matrix operations. For that reason, we rely on the language of non-parametric differential geometry of W. Klingenberg klingenberg:1995 and S. Lang lang:1995 .
In Sec. 3 we discuss known results about the metric geometry induced by the Wasserstein distance. These results are re-stated in Prop. 3 and, for sake of completeness, we provide a further proof inspired by dowson|landau:1982 . It is possible to write down an explicit metric geodesic as done by R.J. McCann (mccann:2001, , Example 1.7), see Prop. 4. The space of non-degenerate Gaussian measures (or, equivalently, the space of positive definite matrices) can be endowed with a Riemann structure that induces the Wasserstein distance. This is elaborated in Sec. 4, where we use the presentation given by takatsu:2011osaka , cf. also bhatia|jain|lim:2018 , which in turn adapts to the Gaussian case the original work (otto:2001, , §4).
The remaining part of the paper is offered as a new contribution to this topic. The Wasserstein Riemannian metric turns out to be
at each matrix , and where are symmetric matrices. By submersion methods we study the more general problem of the horizontal surfaces in , characterized in Prop. 8. As a specialized case we get the Riemannian geodesic which agrees with the metric geodesic of Section 3.
The explicit form of Riemannian exponential is obtained in Sec. 5. The natural (Riemannian) gradient is discussed in Sec. 6 and some applications to optimization are provided in Sec. 6.1. The analysis of the second-order geometry is treated in Sec. 7, where we compute the Levi-Civita covariant derivative, the Riemannian Hessian, and discuss other related topics. However, the curvature tensor will not be taken into consideration in the present paper.
In the final Sec. 8, we discuss the results in view of applications and in Information Geometry of statistical sub-models of the Gaussian manifold.
2 Symmetric matrices
The set of Gaussian distributions on is in 1-to-1 correspondence with the space of its parameters . Moreover, is closed for the weak convergence and the identification is continuous in both directions. A reference for Gaussian distributions is the monograph T.W. Anderson anderson:2003 .
For ease of later reference, we recall a few results on spaces of matrices. General references are the monographs by P. R. Halmos halmos:1958 , J. R. Magnus and H. Neudecker magnus|neudecker:1999 , and R. Bhatia bhatia:2007PDM .
The vector space of real matrices is denoted by , while square matrices are denoted . It is an Euclidean space of dimension and the vectorization mapping is an isometry for the Frobenius inner product .
Symmetric matrices form a vector subspace of whose orthogonal complement is the space of anti-symmetric matrices . We will find it convenient the use, with regard to symmetric matrices, of the equivalent inner product , see e.g. Eq. (18) below. The closed pointed cone of non-negative-definite symmetric matrices is denoted by and its interior, the open cone of the positive-definite symmetric matrices, by .
Given , the equation is called Riccati equation. If and , then the equation has unique solution . In fact, from it follows and, in turn, because . Hence, the solution to Riccati equation is
Notice that , consequently if
. In terms of random variables, ifand , then is the unique matrix of such that .
A more compact closed-form solution of the Riccati equation is available. Given and , observe that
. By similarity, the eigenvalues ofare non-negative, hence the square root
Since , the eigenvalues of and are identical, so that the same argument used before yields too
The square mapping is an injection of onto itself with derivative . Hence, the derivative operator is invertible. An alternative notation for the derivative we find convenient to use now and then is .
For each assigned matrix , the matrix is the unique solution in the space to the Lyapunov equation
Its solution will be written . Clearly we have also
The Lyapunov operator itself can be seen as a derivative. In fact, the inverse of the square mapping is the square root mapping . By the derivative-of-the-inverse rule,
If is the dispersion of a non-singular Gaussian distribution, then is the concentration matrix and represents an alternative and useful parameterization. From the Lyapunov equation we obtain , hence
Likewise, another useful formula is
There is also a relation between the Lyapunov equation and the trace. From , it follows . Then
We will later need the derivative of the mapping , for a fixed . Differentiating the first identity in Eq. (13) in the direction , we have
Hence is the solution to the Lyapunov equation
so that we get
Lyapunov equation plays a crucial role, as the linear operator enters the expression of the Riemannian metric with respect to the standard inner product, see Eq. (7). As a consequence, the numerical implementation of the inner product will require the computation of the matrix . There are many ways to write down the closed-form solution to Eq. (12). They are discussed in bhatia:2007PDM . However, efficient numerical solutions are not based on the closed forms, but rely on specialized numerical algorithms, as discussed by E. L. Wachspress wachspress:2008 and by V. Simoncini simoncini:2016 .
We now turn to the computation of the second-order approximation of in Eq. (2).
Fix and let so that . Hence, for all . Consider the expression of with , , , namely
Observe that .
The second derivative is
where Eq. (15) has been used. Finally, observe that
We can conclude that
Therefore, the bi-linear form in the RHS suggests the form of the Riemannian metric to be derived.
2.1 The mapping
We study now the extension of the square operation to general invertible matrices, namely the mapping , defined by . Next proposition shows that this operation is a submersion. We recall first its definition, see (docarmo:1992, , Ch. 8, Ex. 8–10) or (lang:1995, , §II.2 ).
Let be an open set of the Hilbert space , and a smooth surjection from the Hilbert space onto a manifold , i.e., assume that for each the derivative at , is surjective. In such a case, for each , the fiber is a sub-manifold. Assigned a point , a vector is called vertical if it is tangent to the manifold . Each such a tangent vector is the velocity at of some smooth curve with and . Precisely, from for all we derive the characterization of vertical vectors. We have i.e., the tangent space at is . The orthogonal space to the tangent space is called the space of horizontal vectors at ,
Let us apply this argument to our specific case. Let be the open set of invertible matrices; the subgroup of of orthogonal matrices; the subspace of of anti-symmetric matrices.
For each given we have the orthogonal splitting
has derivative at given by . It is a submersion with fibers
The kernel of the differential is
and its orthogonal complement, is
The orthogonal projection of onto is .
If , for all i.e., , then , so that that is, .
Let the matrix be an element in the fiber manifold . The derivative of at , , is surjective, because for each we have . Hence is a submersion and the fiber is a sub-manifold of .
Let us compute the splitting of into the kernel of and its orthogonal: . The vector space tangent to at is the kernel of the derivative at :
Therefore, if, and only if, , i.e., . We have just proved that this implies .
Consider the decomposition of into the horizontal and the vertical part: with and . By transposition, we get . From the previous two equations, we obtain the two equations and . The sum of the two previous equations is , which is a Lyapunov equation having solution . It follows that the projection is
3 Wasserstein distance
The aim of this section is to discuss the Wasserstein distance for the Gaussian case as well as the equation for the associated metric geodesic. Most of its content is an exposition of known results.
Let us suppose that the dispersion matrix is partitioned into blocks, and consider random variables and such that
so that if and . It follows that , which in turn imply the bounds
For mean vectors and dispersion matrices , define the set of jointly Gaussian distributions with given marginals to be
and the Gini dissimilarity index
Actually, in view of either of the bounds in Eq. (19), the set is compact and the is attained.
It is easy to verify that
defines a distance on the space . The symmetry of is clear as well as the triangle inequality, by considering Gaussian distributions on with given marginals. To conclude, assume that the is reached at some . Then
A further observation is that distance is homogeneous i.e.,
3.2 Computing the quadratic dissimilarity index
We will present a proof as given by Dowson and Landau dowson|landau:1982 , but with some corrections.
The value of the similar problem with replaced by will be denoted by
Let . Then
If moreover , then
Proof (point )
A symmetric matrix is non-negative defined if, and only if, it is of the form , with . Given the block structure of in (21), we can write
where and are two matrices in
Therefore, problem (21) becomes
We have already observed that the optimum exists, so the necessary conditions of Lagrange theorem allows us to characterize this optimum. However, the two constraints and are not necessarily regular at every point (i.e., the Jacobian of the transformation may fail to be of full rank at some point), so we must take into account that the optimum could be an irregular point. To this purpose, as a customary, we shall adopt Fritz John first-order formulation for the Lagrangian (see mangasarian|fromovitz:1967 ).
We shall initially assume that both and are non-singular.
Let then , , where the symmetric matrices and are the Lagrange multipliers. The Lagrangian function will be
The first-order conditions of lead to
In the case i.e., the case of stationary regular points, Eq. (22) becomes
which in turn implies
Of course, Eqs. (24) could be more general than Eqs. (23) and thus possibly contain undesirable solutions. In this light, we establish the following facts, in which both matrices and must be nonsingular. Notice that in this case Eqs. (24) imply that both and are nonsingular as well.
Actually, let , be any representation of the matrix . Define so that . Moreover
and so are multipliers associated with the feasible point .
Claim 2: The set of solutions to (24), such that , is not empty. In particular, there is a unique pair where both and are positive definite.
We have already observed that Eqs. (24) imply that and are nonsingular. Moreover, we have . Recalling that Riccati’s equation has one and only one solution in the class of positive definite matrices, then .