Wasserstein-Riemannian Geometry of Positive-definite Matrices

01/28/2018 ∙ by Luigi Malagò, et al. ∙ Collegio Carlo Alberto Università di Torino Romanian Institute of Science and Technology 0

The Wasserstein distance on multivariate non-degenerate Gaussian densities is a Riemannian distance. After reviewing the properties of the distance and the metric geodesic, we derive an explicit form of the Riemannian metrics on positive-definite matrices and compute its tensor form with respect to the trace scalar product. The tensor is a matrix, which is the solution of a Lyapunov equation. We compute explicit form for the Riemannian exponential, the normal coordinates charts, the Riemannian gradient, and discuss the gradient flow of the entropy. Finally, the Levi-Civita covariant derivative is computed in matrix form together with the differential equation for the parallel transport. While all computations are given in matrix form, notheless we discuss the use of a special moving frame. Applications are briefly discussed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given two probability measures

and on

, with finite second moments, consider the set

of probability measures on the product sample space , such that the two -dimensional margins have the prescribed distributions, and . The index

as a measure of dissimilarity between distributions has been considered by many classical authors e.g., C. Gini, P. Levy, and M.R. Fréchet. There is considerable contemporary literature discussing the index , which is usually called Wasserstein distance. E.g., the monograph by C. Villani villani:2008optimal . We want also to mention Y. Brenier brenier:1991 and R.J. McCann mccann:2001 .

There is an important particular case, where the above problem reduces to the Monge transport problem. Borrowing the argument from M. Knott and C.S. Smith knott|smith:1984 , assume is a smooth strictly convex function and . Clearly, the condition

turns out to be equivalent to . Latter inequality shows that the minimum quadratic distance is attained. In view of the new formulation, let us provide a proof.

If denotes the convex conjugate of . We have

and the equality case is

By assumption so that

This argument, including an existence proof, is in Y. Brenier brenier:1991 . In the present paper we shall study the same problem where all the involved distributions are Gaussian. It would be feasible to reduce the Gaussian case to the general one. However, we resort to methods specially suited for this case.

1.1 The Gaussian case

Given two Gaussian distributions , , consider the set of Gaussian distributions on such that the two -dimensional margins have the prescribed distributions, . The corresponding index is

(1)

Observe that if and is a symmetric matrix such that , then the previous argument applies by means of the convex function .

The value of in Eq. (1) as a function of the mean and the dispersion matrix has been computed by some authors, in particular: I. Olkin and F. Pukelsheim olkin|pukelsheim:1982 , D. C. Dowson and B. V. Landau dowson|landau:1982 , C. R. Givens and R. M. Shortt givens|shortt:1984 , M. Gelbrich gelbrich:1990 . They found the (equivalent) forms

(2)

Further interpretations of are available. R. Bhatia et al. bhatia|jain|lim:2018 showed that is also the solution of constrained minimization problems for the Frobenius matrix norm , when . Especially,

Notice that is the generic transformation of the standard Gaussian to the Gaussian with dispersion matrix .

Because of the exponent 2 in Eq. (1), the distance is more precisely called -Wasserstein distance. Other exponents or other distances could be used in the definition. The quadratic case is particularly relevant as is a Riemannian distance. More references will be given later.

In an Information Geometry perspective, we can mimic the argument of the seminal paper by Amari amari:1998natural

, who derived the notion of both Fisher metric and natural gradient, from the second order approximation of the Kullback-Leibler divergence.

It will be shown (see Sec. 2) that the value of Eq. (2) has the differential second-order expansion for small :

(3)

where is the solution to the Lyapunov equation .

The quadratic form in the RHS of Eq. (3) provides a candidate to be the Riemannian inner product associated with the distance . In addition, if is a smooth real function defined on a small -sphere. i.e., for small , then the increment is maximized along the direction

where here denotes the Euclidean gradient. The operator is Amari’s natural gradient, i.e., the Riemannian gradient.

It is remarkable that all geometric objects shown in the previous equations above may be expressed as matrix operations. In this paper, we proceed in developing systematically the Wasserstein geometry of Gaussian models according to such a formalism.

1.2 Relations with the literature on the general transport theory

The Wasserstein distance and its relevant geometry can be studied non-parametrically also for general distributions. We do not pursue in this direction and refer to the monograph by C. Villani villani:2008optimal . The -Wasserstein metric geometry has been shown to be Riemannian by F. Otto (otto:2001, , §4) and J. Lott lott:2008calculations . Cf. the earlier account by J.D. Lafferty lafferty:1988 .

Let us briefly discuss Otto’s approach in the language of Information Geometry, i.e., with reference to S. Amari and H. Nagaoka amari|nagaoka:2000 . In view of the non-parametric approach first introduced in pistone|sempi:95 , and denoted by the set of

-dimensional Gaussian densities with zero mean, the vector bundle

is the Amari Hilbert bundle on . The Hilbert bundle contains the statistical bundle whose fibers consist of the scores for all smooth curves with . In turn, the statistical bundle is the tangent space of considered as an exponential manifold, see pistone|sempi:95 ; pistone:2013GSI .

In our present case, since the model is an exponential family, the natural parameter is the concentration matrix . The log-likelihood is

If is a symmetric matrix, the derivative of in the direction is

where is a symmetric matrix identified with a linear operator on symmetric matrices , equipped with the Frobenius inner product. The fiber at consists of the vector space of functions , . The inner product in the Hilbert bundle, restricted to the parameterized statistical bundle, is the Fisher metric

(4)

The study of the Fisher metric in the Gaussian case has been done first by L.T. Skovgaard skovgaard:1984 .

F. Otto (otto:2001, , §1.3)

, who was motivated by the study of a class of partial differential equation, considered a inner product defined on smooth functions of the

-fiber of the Hilbert bundle, as

(5)

In the non-parametric case, Otto’s metric of Eq. (5) is related to the Wasserstein distance, for a detailed study of such a metric see J. Lott lott:2008calculations .

If we apply this definition to our score and , the gradient is and the metric becomes

(6)

The equivalence between the metric in Eq. (6) and the one in Eq. (4) can be seen by a change of parameterization both in and in each fiber. First, one must define the inner product at to be the inner product computed in the bijection , to get , which is the form of the metric provided by A. Takatsu (takatsu:2011osaka, , Prop. A). Second, one has to change the parameterization on each fiber of the statistical bundle by . The involved change of parameterization in the statistical bundle whose inverse is produces the desired inner product.

We mention also that the Machine Learning literature discusses a divergence introduced by A. Hyvärinen

MR2249836 , which is related to Otto’s metric. Precisely, in the concentration parameterization the Hyvärinen divergence is

and the second derivative of at is

In Statistics, Hyvärinen divergence is related to local proper scoring rules, see M. Parry et al. parry|dawid|lauritzen:2012 .

1.3 Overview

The first two sections of the paper are mostly review of known material. In Sec. 2 we recall some properties of the space of symmetric matrices. In particular: Riccati equation, Lyapunov equation, and the calculus regarding to the two mappings and . The mapping , where is a non-singular square matrix is shown to be a submersion and the horizontal vectors at each point is computed. Despite of our manifold being finite dimensional, there is no need of choosing a basis, as all operations of interest are matrix operations. For that reason, we rely on the language of non-parametric differential geometry of W. Klingenberg klingenberg:1995 and S. Lang lang:1995 .

In Sec. 3 we discuss known results about the metric geometry induced by the Wasserstein distance. These results are re-stated in Prop. 3 and, for sake of completeness, we provide a further proof inspired by dowson|landau:1982 . It is possible to write down an explicit metric geodesic as done by R.J. McCann (mccann:2001, , Example 1.7), see Prop. 4. The space of non-degenerate Gaussian measures (or, equivalently, the space of positive definite matrices) can be endowed with a Riemann structure that induces the Wasserstein distance. This is elaborated in Sec. 4, where we use the presentation given by takatsu:2011osaka , cf. also bhatia|jain|lim:2018 , which in turn adapts to the Gaussian case the original work (otto:2001, , §4).

The remaining part of the paper is offered as a new contribution to this topic. The Wasserstein Riemannian metric turns out to be

(7)

at each matrix , and where are symmetric matrices. By submersion methods we study the more general problem of the horizontal surfaces in , characterized in Prop. 8. As a specialized case we get the Riemannian geodesic which agrees with the metric geodesic of Section 3.

The explicit form of Riemannian exponential is obtained in Sec. 5. The natural (Riemannian) gradient is discussed in Sec. 6 and some applications to optimization are provided in Sec. 6.1. The analysis of the second-order geometry is treated in Sec. 7, where we compute the Levi-Civita covariant derivative, the Riemannian Hessian, and discuss other related topics. However, the curvature tensor will not be taken into consideration in the present paper.

In the final Sec. 8, we discuss the results in view of applications and in Information Geometry of statistical sub-models of the Gaussian manifold.

2 Symmetric matrices

The set of Gaussian distributions on is in 1-to-1 correspondence with the space of its parameters . Moreover, is closed for the weak convergence and the identification is continuous in both directions. A reference for Gaussian distributions is the monograph T.W. Anderson anderson:2003 .

For ease of later reference, we recall a few results on spaces of matrices. General references are the monographs by P. R. Halmos halmos:1958 , J. R. Magnus and H. Neudecker magnus|neudecker:1999 , and R. Bhatia bhatia:2007PDM .

The vector space of real matrices is denoted by , while square matrices are denoted . It is an Euclidean space of dimension and the vectorization mapping is an isometry for the Frobenius inner product .

Symmetric matrices form a vector subspace of whose orthogonal complement is the space of anti-symmetric matrices . We will find it convenient the use, with regard to symmetric matrices, of the equivalent inner product , see e.g. Eq. (18) below. The closed pointed cone of non-negative-definite symmetric matrices is denoted by and its interior, the open cone of the positive-definite symmetric matrices, by .

Given , the equation is called Riccati equation. If and , then the equation has unique solution . In fact, from it follows and, in turn, because . Hence, the solution to Riccati equation is

(8)

Notice that , consequently if

. In terms of random variables, if

and , then is the unique matrix of such that .

A more compact closed-form solution of the Riccati equation is available. Given and , observe that

. By similarity, the eigenvalues of

are non-negative, hence the square root

(9)

is well defined, see (bhatia:2007PDM, , Ex. 4.5.2). Therefore, an equivalent formulation of Eq. (8) is

(10)

Since , the eigenvalues of and are identical, so that the same argument used before yields too

(11)

The square mapping is an injection of onto itself with derivative . Hence, the derivative operator is invertible. An alternative notation for the derivative we find convenient to use now and then is .

For each assigned matrix , the matrix is the unique solution in the space to the Lyapunov equation

(12)

Its solution will be written . Clearly we have also

(13)

The Lyapunov operator itself can be seen as a derivative. In fact, the inverse of the square mapping is the square root mapping . By the derivative-of-the-inverse rule,

(14)

If is the dispersion of a non-singular Gaussian distribution, then is the concentration matrix and represents an alternative and useful parameterization. From the Lyapunov equation we obtain , hence

Likewise, another useful formula is

(15)

There is also a relation between the Lyapunov equation and the trace. From , it follows . Then

(16)

We will later need the derivative of the mapping , for a fixed . Differentiating the first identity in Eq. (13) in the direction , we have

Hence is the solution to the Lyapunov equation

so that we get

(17)

It will be useful in the following to evaluate the second derivative of the mapping . From Eqs. (14) and (17) it follows

Lyapunov equation plays a crucial role, as the linear operator enters the expression of the Riemannian metric with respect to the standard inner product, see Eq. (7). As a consequence, the numerical implementation of the inner product will require the computation of the matrix . There are many ways to write down the closed-form solution to Eq. (12). They are discussed in bhatia:2007PDM . However, efficient numerical solutions are not based on the closed forms, but rely on specialized numerical algorithms, as discussed by E. L. Wachspress wachspress:2008 and by V. Simoncini simoncini:2016 .

We now turn to the computation of the second-order approximation of in Eq. (2).

Fix and let so that . Hence, for all . Consider the expression of with , , , namely

By Eq. (14) and Eq. (16), the first-order derivative is

Observe that .

The second derivative is

with

so that

where Eq. (15) has been used. Finally, observe that

(18)

We can conclude that

Therefore, the bi-linear form in the RHS suggests the form of the Riemannian metric to be derived.

2.1 The mapping

We study now the extension of the square operation to general invertible matrices, namely the mapping , defined by . Next proposition shows that this operation is a submersion. We recall first its definition, see (docarmo:1992, , Ch. 8, Ex. 8–10) or (lang:1995, , §II.2 ).

Let be an open set of the Hilbert space , and a smooth surjection from the Hilbert space onto a manifold , i.e., assume that for each the derivative at , is surjective. In such a case, for each , the fiber is a sub-manifold. Assigned a point , a vector is called vertical if it is tangent to the manifold . Each such a tangent vector is the velocity at of some smooth curve with and . Precisely, from for all we derive the characterization of vertical vectors. We have i.e., the tangent space at is . The orthogonal space to the tangent space is called the space of horizontal vectors at ,

Let us apply this argument to our specific case. Let be the open set of invertible matrices; the subgroup of of orthogonal matrices; the subspace of of anti-symmetric matrices.

Proposition 1
  1. For each given we have the orthogonal splitting

  2. The mapping

    has derivative at given by . It is a submersion with fibers

  3. The kernel of the differential is

    and its orthogonal complement, is

  4. The orthogonal projection of onto is .

Proof

We provide here the proof for sake of completeness. See also takatsu:2011osaka and bhatia|jain|lim:2018 .

  1. If , for all i.e., , then , so that that is, .

  2. Let the matrix be an element in the fiber manifold . The derivative of at , , is surjective, because for each we have . Hence is a submersion and the fiber is a sub-manifold of .

  3. Let us compute the splitting of into the kernel of and its orthogonal: . The vector space tangent to at is the kernel of the derivative at :

    Therefore, if, and only if, , i.e., . We have just proved that this implies .

  4. Consider the decomposition of into the horizontal and the vertical part: with and . By transposition, we get . From the previous two equations, we obtain the two equations and . The sum of the two previous equations is , which is a Lyapunov equation having solution . It follows that the projection is

3 Wasserstein distance

The aim of this section is to discuss the Wasserstein distance for the Gaussian case as well as the equation for the associated metric geodesic. Most of its content is an exposition of known results.

3.1 Block-Gaussian

Let us suppose that the dispersion matrix is partitioned into blocks, and consider random variables and such that

so that if and . It follows that , which in turn imply the bounds

(19)

For mean vectors and dispersion matrices , define the set of jointly Gaussian distributions with given marginals to be

and the Gini dissimilarity index

(20)

Actually, in view of either of the bounds in Eq. (19), the set is compact and the is attained.

It is easy to verify that

defines a distance on the space . The symmetry of is clear as well as the triangle inequality, by considering Gaussian distributions on with given marginals. To conclude, assume that the is reached at some . Then

A further observation is that distance is homogeneous i.e.,

3.2 Computing the quadratic dissimilarity index

We will present a proof as given by Dowson and Landau dowson|landau:1982 , but with some corrections.

Given , each admissible ’s in (20) belongs to a compact set of thanks to bound (19), so the maximum of the function is reached. Therefore, we are led to study the problem

(21)

The value of the similar problem with replaced by will be denoted by

Proposition 2
  1. Let . Then

  2. If moreover , then

Proof (point )

A symmetric matrix is non-negative defined if, and only if, it is of the form , with . Given the block structure of in (21), we can write

where and are two matrices in

Therefore, problem (21) becomes

We have already observed that the optimum exists, so the necessary conditions of Lagrange theorem allows us to characterize this optimum. However, the two constraints and are not necessarily regular at every point (i.e., the Jacobian of the transformation may fail to be of full rank at some point), so we must take into account that the optimum could be an irregular point. To this purpose, as a customary, we shall adopt Fritz John first-order formulation for the Lagrangian (see mangasarian|fromovitz:1967 ).

We shall initially assume that both and are non-singular.

Let then , , where the symmetric matrices and are the Lagrange multipliers. The Lagrangian function will be

The first-order conditions of lead to

(22)

In the case i.e., the case of stationary regular points, Eq. (22) becomes

(23)

which in turn implies

(24)

and further

Of course, Eqs. (24) could be more general than Eqs. (23) and thus possibly contain undesirable solutions. In this light, we establish the following facts, in which both matrices and must be nonsingular. Notice that in this case Eqs. (24) imply that both and are nonsingular as well.


Claim 1: If is a solution to (24) and , then the couple are Lagrange multipliers of Problem (21).

Actually, let , be any representation of the matrix . Define so that . Moreover

and so are multipliers associated with the feasible point .


Claim 2: The set of solutions to (24), such that , is not empty. In particular, there is a unique pair where both and are positive definite.

We have already observed that Eqs. (24) imply that and are nonsingular. Moreover, we have . Recalling that Riccati’s equation has one and only one solution in the class of positive definite matrices, then .

Now we proceed to study the solutions to and we shall show that Eq (