1 Introduction
Given two probability measures
and on, with finite second moments, consider the set
of probability measures on the product sample space , such that the two dimensional margins have the prescribed distributions, and . The indexas a measure of dissimilarity between distributions has been considered by many classical authors e.g., C. Gini, P. Levy, and M.R. Fréchet. There is considerable contemporary literature discussing the index , which is usually called Wasserstein distance. E.g., the monograph by C. Villani villani:2008optimal . We want also to mention Y. Brenier brenier:1991 and R.J. McCann mccann:2001 .
There is an important particular case, where the above problem reduces to the Monge transport problem. Borrowing the argument from M. Knott and C.S. Smith knottsmith:1984 , assume is a smooth strictly convex function and . Clearly, the condition
turns out to be equivalent to . Latter inequality shows that the minimum quadratic distance is attained. In view of the new formulation, let us provide a proof.
If denotes the convex conjugate of . We have
and the equality case is
By assumption so that
This argument, including an existence proof, is in Y. Brenier brenier:1991 . In the present paper we shall study the same problem where all the involved distributions are Gaussian. It would be feasible to reduce the Gaussian case to the general one. However, we resort to methods specially suited for this case.
1.1 The Gaussian case
Given two Gaussian distributions , , consider the set of Gaussian distributions on such that the two dimensional margins have the prescribed distributions, . The corresponding index is
(1) 
Observe that if and is a symmetric matrix such that , then the previous argument applies by means of the convex function .
The value of in Eq. (1) as a function of the mean and the dispersion matrix has been computed by some authors, in particular: I. Olkin and F. Pukelsheim olkinpukelsheim:1982 , D. C. Dowson and B. V. Landau dowsonlandau:1982 , C. R. Givens and R. M. Shortt givensshortt:1984 , M. Gelbrich gelbrich:1990 . They found the (equivalent) forms
(2)  
Further interpretations of are available. R. Bhatia et al. bhatiajainlim:2018 showed that is also the solution of constrained minimization problems for the Frobenius matrix norm , when . Especially,
Notice that is the generic transformation of the standard Gaussian to the Gaussian with dispersion matrix .
Because of the exponent 2 in Eq. (1), the distance is more precisely called Wasserstein distance. Other exponents or other distances could be used in the definition. The quadratic case is particularly relevant as is a Riemannian distance. More references will be given later.
In an Information Geometry perspective, we can mimic the argument of the seminal paper by Amari amari:1998natural
, who derived the notion of both Fisher metric and natural gradient, from the second order approximation of the KullbackLeibler divergence.
It will be shown (see Sec. 2) that the value of Eq. (2) has the differential secondorder expansion for small :
(3) 
where is the solution to the Lyapunov equation .
The quadratic form in the RHS of Eq. (3) provides a candidate to be the Riemannian inner product associated with the distance . In addition, if is a smooth real function defined on a small sphere. i.e., for small , then the increment is maximized along the direction
where here denotes the Euclidean gradient. The operator is Amari’s natural gradient, i.e., the Riemannian gradient.
It is remarkable that all geometric objects shown in the previous equations above may be expressed as matrix operations. In this paper, we proceed in developing systematically the Wasserstein geometry of Gaussian models according to such a formalism.
1.2 Relations with the literature on the general transport theory
The Wasserstein distance and its relevant geometry can be studied nonparametrically also for general distributions. We do not pursue in this direction and refer to the monograph by C. Villani villani:2008optimal . The Wasserstein metric geometry has been shown to be Riemannian by F. Otto (otto:2001, , §4) and J. Lott lott:2008calculations . Cf. the earlier account by J.D. Lafferty lafferty:1988 .
Let us briefly discuss Otto’s approach in the language of Information Geometry, i.e., with reference to S. Amari and H. Nagaoka amarinagaoka:2000 . In view of the nonparametric approach first introduced in pistonesempi:95 , and denoted by the set of
dimensional Gaussian densities with zero mean, the vector bundle
is the Amari Hilbert bundle on . The Hilbert bundle contains the statistical bundle whose fibers consist of the scores for all smooth curves with . In turn, the statistical bundle is the tangent space of considered as an exponential manifold, see pistonesempi:95 ; pistone:2013GSI .
In our present case, since the model is an exponential family, the natural parameter is the concentration matrix . The loglikelihood is
If is a symmetric matrix, the derivative of in the direction is
where is a symmetric matrix identified with a linear operator on symmetric matrices , equipped with the Frobenius inner product. The fiber at consists of the vector space of functions , . The inner product in the Hilbert bundle, restricted to the parameterized statistical bundle, is the Fisher metric
(4) 
The study of the Fisher metric in the Gaussian case has been done first by L.T. Skovgaard skovgaard:1984 .
F. Otto (otto:2001, , §1.3)
, who was motivated by the study of a class of partial differential equation, considered a inner product defined on smooth functions of the
fiber of the Hilbert bundle, as(5) 
In the nonparametric case, Otto’s metric of Eq. (5) is related to the Wasserstein distance, for a detailed study of such a metric see J. Lott lott:2008calculations .
If we apply this definition to our score and , the gradient is and the metric becomes
(6) 
The equivalence between the metric in Eq. (6) and the one in Eq. (4) can be seen by a change of parameterization both in and in each fiber. First, one must define the inner product at to be the inner product computed in the bijection , to get , which is the form of the metric provided by A. Takatsu (takatsu:2011osaka, , Prop. A). Second, one has to change the parameterization on each fiber of the statistical bundle by . The involved change of parameterization in the statistical bundle whose inverse is produces the desired inner product.
We mention also that the Machine Learning literature discusses a divergence introduced by A. Hyvärinen
MR2249836 , which is related to Otto’s metric. Precisely, in the concentration parameterization the Hyvärinen divergence isand the second derivative of at is
In Statistics, Hyvärinen divergence is related to local proper scoring rules, see M. Parry et al. parrydawidlauritzen:2012 .
1.3 Overview
The first two sections of the paper are mostly review of known material. In Sec. 2 we recall some properties of the space of symmetric matrices. In particular: Riccati equation, Lyapunov equation, and the calculus regarding to the two mappings and . The mapping , where is a nonsingular square matrix is shown to be a submersion and the horizontal vectors at each point is computed. Despite of our manifold being finite dimensional, there is no need of choosing a basis, as all operations of interest are matrix operations. For that reason, we rely on the language of nonparametric differential geometry of W. Klingenberg klingenberg:1995 and S. Lang lang:1995 .
In Sec. 3 we discuss known results about the metric geometry induced by the Wasserstein distance. These results are restated in Prop. 3 and, for sake of completeness, we provide a further proof inspired by dowsonlandau:1982 . It is possible to write down an explicit metric geodesic as done by R.J. McCann (mccann:2001, , Example 1.7), see Prop. 4. The space of nondegenerate Gaussian measures (or, equivalently, the space of positive definite matrices) can be endowed with a Riemann structure that induces the Wasserstein distance. This is elaborated in Sec. 4, where we use the presentation given by takatsu:2011osaka , cf. also bhatiajainlim:2018 , which in turn adapts to the Gaussian case the original work (otto:2001, , §4).
The remaining part of the paper is offered as a new contribution to this topic. The Wasserstein Riemannian metric turns out to be
(7) 
at each matrix , and where are symmetric matrices. By submersion methods we study the more general problem of the horizontal surfaces in , characterized in Prop. 8. As a specialized case we get the Riemannian geodesic which agrees with the metric geodesic of Section 3.
The explicit form of Riemannian exponential is obtained in Sec. 5. The natural (Riemannian) gradient is discussed in Sec. 6 and some applications to optimization are provided in Sec. 6.1. The analysis of the secondorder geometry is treated in Sec. 7, where we compute the LeviCivita covariant derivative, the Riemannian Hessian, and discuss other related topics. However, the curvature tensor will not be taken into consideration in the present paper.
In the final Sec. 8, we discuss the results in view of applications and in Information Geometry of statistical submodels of the Gaussian manifold.
2 Symmetric matrices
The set of Gaussian distributions on is in 1to1 correspondence with the space of its parameters . Moreover, is closed for the weak convergence and the identification is continuous in both directions. A reference for Gaussian distributions is the monograph T.W. Anderson anderson:2003 .
For ease of later reference, we recall a few results on spaces of matrices. General references are the monographs by P. R. Halmos halmos:1958 , J. R. Magnus and H. Neudecker magnusneudecker:1999 , and R. Bhatia bhatia:2007PDM .
The vector space of real matrices is denoted by , while square matrices are denoted . It is an Euclidean space of dimension and the vectorization mapping is an isometry for the Frobenius inner product .
Symmetric matrices form a vector subspace of whose orthogonal complement is the space of antisymmetric matrices . We will find it convenient the use, with regard to symmetric matrices, of the equivalent inner product , see e.g. Eq. (18) below. The closed pointed cone of nonnegativedefinite symmetric matrices is denoted by and its interior, the open cone of the positivedefinite symmetric matrices, by .
Given , the equation is called Riccati equation. If and , then the equation has unique solution . In fact, from it follows and, in turn, because . Hence, the solution to Riccati equation is
(8) 
Notice that , consequently if
. In terms of random variables, if
and , then is the unique matrix of such that .A more compact closedform solution of the Riccati equation is available. Given and , observe that
. By similarity, the eigenvalues of
are nonnegative, hence the square root(9) 
is well defined, see (bhatia:2007PDM, , Ex. 4.5.2). Therefore, an equivalent formulation of Eq. (8) is
(10) 
Since , the eigenvalues of and are identical, so that the same argument used before yields too
(11) 
The square mapping is an injection of onto itself with derivative . Hence, the derivative operator is invertible. An alternative notation for the derivative we find convenient to use now and then is .
For each assigned matrix , the matrix is the unique solution in the space to the Lyapunov equation
(12) 
Its solution will be written . Clearly we have also
(13) 
The Lyapunov operator itself can be seen as a derivative. In fact, the inverse of the square mapping is the square root mapping . By the derivativeoftheinverse rule,
(14) 
If is the dispersion of a nonsingular Gaussian distribution, then is the concentration matrix and represents an alternative and useful parameterization. From the Lyapunov equation we obtain , hence
Likewise, another useful formula is
(15) 
There is also a relation between the Lyapunov equation and the trace. From , it follows . Then
(16) 
We will later need the derivative of the mapping , for a fixed . Differentiating the first identity in Eq. (13) in the direction , we have
Hence is the solution to the Lyapunov equation
so that we get
(17) 
It will be useful in the following to evaluate the second derivative of the mapping . From Eqs. (14) and (17) it follows
Lyapunov equation plays a crucial role, as the linear operator enters the expression of the Riemannian metric with respect to the standard inner product, see Eq. (7). As a consequence, the numerical implementation of the inner product will require the computation of the matrix . There are many ways to write down the closedform solution to Eq. (12). They are discussed in bhatia:2007PDM . However, efficient numerical solutions are not based on the closed forms, but rely on specialized numerical algorithms, as discussed by E. L. Wachspress wachspress:2008 and by V. Simoncini simoncini:2016 .
We now turn to the computation of the secondorder approximation of in Eq. (2).
Fix and let so that . Hence, for all . Consider the expression of with , , , namely
Observe that .
The second derivative is
with
so that
where Eq. (15) has been used. Finally, observe that
(18) 
We can conclude that
Therefore, the bilinear form in the RHS suggests the form of the Riemannian metric to be derived.
2.1 The mapping
We study now the extension of the square operation to general invertible matrices, namely the mapping , defined by . Next proposition shows that this operation is a submersion. We recall first its definition, see (docarmo:1992, , Ch. 8, Ex. 8–10) or (lang:1995, , §II.2 ).
Let be an open set of the Hilbert space , and a smooth surjection from the Hilbert space onto a manifold , i.e., assume that for each the derivative at , is surjective. In such a case, for each , the fiber is a submanifold. Assigned a point , a vector is called vertical if it is tangent to the manifold . Each such a tangent vector is the velocity at of some smooth curve with and . Precisely, from for all we derive the characterization of vertical vectors. We have i.e., the tangent space at is . The orthogonal space to the tangent space is called the space of horizontal vectors at ,
Let us apply this argument to our specific case. Let be the open set of invertible matrices; the subgroup of of orthogonal matrices; the subspace of of antisymmetric matrices.
Proposition 1

For each given we have the orthogonal splitting

The mapping
has derivative at given by . It is a submersion with fibers

The kernel of the differential is
and its orthogonal complement, is

The orthogonal projection of onto is .
Proof
We provide here the proof for sake of completeness. See also takatsu:2011osaka and bhatiajainlim:2018 .

If , for all i.e., , then , so that that is, .

Let the matrix be an element in the fiber manifold . The derivative of at , , is surjective, because for each we have . Hence is a submersion and the fiber is a submanifold of .

Let us compute the splitting of into the kernel of and its orthogonal: . The vector space tangent to at is the kernel of the derivative at :
Therefore, if, and only if, , i.e., . We have just proved that this implies .

Consider the decomposition of into the horizontal and the vertical part: with and . By transposition, we get . From the previous two equations, we obtain the two equations and . The sum of the two previous equations is , which is a Lyapunov equation having solution . It follows that the projection is
3 Wasserstein distance
The aim of this section is to discuss the Wasserstein distance for the Gaussian case as well as the equation for the associated metric geodesic. Most of its content is an exposition of known results.
3.1 BlockGaussian
Let us suppose that the dispersion matrix is partitioned into blocks, and consider random variables and such that
so that if and . It follows that , which in turn imply the bounds
(19) 
For mean vectors and dispersion matrices , define the set of jointly Gaussian distributions with given marginals to be
and the Gini dissimilarity index
(20) 
Actually, in view of either of the bounds in Eq. (19), the set is compact and the is attained.
It is easy to verify that
defines a distance on the space . The symmetry of is clear as well as the triangle inequality, by considering Gaussian distributions on with given marginals. To conclude, assume that the is reached at some . Then
A further observation is that distance is homogeneous i.e.,
3.2 Computing the quadratic dissimilarity index
We will present a proof as given by Dowson and Landau dowsonlandau:1982 , but with some corrections.
Given , each admissible ’s in (20) belongs to a compact set of thanks to bound (19), so the maximum of the function is reached. Therefore, we are led to study the problem
(21) 
The value of the similar problem with replaced by will be denoted by
Proposition 2

Let . Then

If moreover , then
Proof (point )
A symmetric matrix is nonnegative defined if, and only if, it is of the form , with . Given the block structure of in (21), we can write
where and are two matrices in
Therefore, problem (21) becomes
We have already observed that the optimum exists, so the necessary conditions of Lagrange theorem allows us to characterize this optimum. However, the two constraints and are not necessarily regular at every point (i.e., the Jacobian of the transformation may fail to be of full rank at some point), so we must take into account that the optimum could be an irregular point. To this purpose, as a customary, we shall adopt Fritz John firstorder formulation for the Lagrangian (see mangasarianfromovitz:1967 ).
We shall initially assume that both and are nonsingular.
Let then , , where the symmetric matrices and are the Lagrange multipliers. The Lagrangian function will be
The firstorder conditions of lead to
(22) 
In the case i.e., the case of stationary regular points, Eq. (22) becomes
(23) 
which in turn implies
(24) 
and further
Of course, Eqs. (24) could be more general than Eqs. (23) and thus possibly contain undesirable solutions. In this light, we establish the following facts, in which both matrices and must be nonsingular. Notice that in this case Eqs. (24) imply that both and are nonsingular as well.
Actually, let , be any representation of the matrix . Define so that . Moreover
and so are multipliers associated with the feasible point .
Claim 2: The set of solutions to (24), such that , is not empty. In particular, there is a unique pair where both and are positive definite.
We have already observed that Eqs. (24) imply that and are nonsingular. Moreover, we have . Recalling that Riccati’s equation has one and only one solution in the class of positive definite matrices, then .
Now we proceed to study the solutions to and we shall show that Eq (
Comments
There are no comments yet.