I Introduction
The present paper aims to bring together the tools of Riemannian statistics and those of random matrix theory, in order to unlock the computational issues which have stood in the way of applying Riemannian learning models to datasets of high-dimensional covariance matrices.
Over the past few years, Riemannian Gaussian distributions, and mixtures of these distributions, were introduced as a means of modeling the intrinsic structure of statistical populations of covariance matrices [Cheng2013, Sa16, Sa17]
. They have found successful applications in fields such as brain-computer interface analysis and artificial intelligence
[zanini2018][mathieu2019]. However, such applications could not be pursued for covariance matrices of relatively larger dimension (for example, ). Indeed, there seemed to exist no practical method of computing the normalising factors associated with Riemannian Gaussian distributions on spaces of high-dimensional covariance matrices.The theoretical contribution of this paper is to show that this missing method arises quite naturally, as soon as one realises that a strong connection exists between Riemannian Gaussian distributions and random matrix theory. Roughly speaking, Riemannian Gaussian distributions on spaces of real, complex, or quaternion covariance matrices correspond to log-normal orthogonal, unitary or symplectic matrix ensembles. These are similar to the classical, widely-known Gaussian orthogonal, unitary and symplectic ensembles, but with the normal weight function replaced with a log-normal weight function. Thanks to this new connection, the powerful tools of random matrix theory can be employed to uncover several original properties of Riemannian Gaussian distributions, especially for higher-dimensional covariance matrices.
The present paper also extends the equivalence between Riemannian Gaussian distributions and random matrix ensembles to the case of block-Toeplitz matrices. Instead of a log-normal weight function , the corresponding weigh function is “-normal" , where is the inverse hyperbolic cosine. No attempt is made to develop these additional “-normal" ensembles, at least not for now. In fact, there are already plenty of results to derive and apply with the log-normal ensembles, which cover the important cases of real and complex covariance matrices. This is in part because log-normal ensembles (also called Sitltjes-Wigert ensembles) are amenable to a direct analytic treatment, as will be seen below.
It is also because of a rather surprising connection also noted in [SH21]. The log-normal unitary ensemble had appeared in the theoretical physics literature about twenty years ago, as a random matrix model for the Chern-Simons quantum field theory [Ma05]. A seen in the present paper, this ensemble corresponds to Riemannian Gaussian distributions on the space of complex covariance matrices. Of course, such Riemannian Gaussian distributions have nothing to do with quantum field theory, but some of their valuable properties can be obtained by carefully re-adapting already existing results, found within this theory.
The present paper was developed independently from a theoretical physics paper, published only very recently [Ti20]. Both papers focus on the connection between Riemannian Gaussian distributions and log-normal matrix ensembles, but significantly differ in terms of their contribution and focus, as will be discussed below. The reader is also referred to [Forrester1989] for previous results on asymptotic computations of log-partition functions from the random matrix theory literature, as well as the recent paper [Forrester2021], which considers global and local scaling limits for the Stieltjes-Wigert random matrix ensemble and associated physical interpretations.
The paper is organised as follows. Section II recalls basic background material on Riemannian Gaussian distributions. Section III presents the main original results, obtained by applying random matrix theory to the study of Riemannian Gaussian distributions. Section IV illustrates the importance of these original results to learning from datesets of high-dimensional covariance matrices. Finally, Section V considers Riemannian Gaussian distributions on the space of block-Toeplitz covariance matrices, and introduces the corresponding “-normal" ensembles. Proofs of the propositions stated in Section III and V are provided in Appendix A.
Ii Notation and background
Let denote the space of covariance matrices which are either real (), complex () or quaternion (). Precisely, the elements of are real, symmetric positive-definite matrices, while those of and are respectively complex and quaternion, hermitian positive definite matrices.
Of course,
is an open convex cone, sitting inside the real vector space
of real, complex or quaternion self-adjoint matrices (the adjoint of a matrix being its conjugate transpose). Therefore, is a real differentiable manifold, whose tangent space at any point is isomorphic to . In particular, the dimension of is where .The elements of are in one-to-one correspondence with the centred (zero-mean)
-variate normal distributions (real, complex circular, or quaternion circular, according to the value of
). Thus, can be equipped with the Rao-Fisher information metric of the centred -variate normal model [amaribook]. This is identical to the so-called affine-invariant metric introduced in [Pennec2006],(1) |
where denotes the real part and denotes the trace. The Riemannian geometry of the metric (1) is quite well-known in the geometric information science community (see the recent book [Pennec2019]). Recall here the associated geodesic distance
(2) |
where matrix logarithms and powers are understood as self-adjoint matrix functions, obtained by taking logarithms and powers of eigenvalues. The main advantage of the metric (
1) is its invariance under the action of the group of invertible matrices with real, complex or quaternion entries (according to the value of ). For example, if and is replaced by while is replaced by , then the distance remains unchanged (note that denotes the adjoint, or conjugate transpose).In terms of the distance (2), we define Riemannian Gaussian distributions on the space
to be given by parameterised probability density function
, as in [Sa16][Sa17],(3) |
where is the centre-of-mass parameter, and the dispersion parameter. The normalising factor is given by the integral
(4) |
with respect to the Riemannian volume . Here,
where if is complex and if is quaternion (here, denote complex or quaternion imaginary units).
Both formulae (3) and (4) are greatly simplified by introducing “polar coordinates". Each can be diagonalised as where is a diagonal matrix, with diagonal elements for , and , which means (the identity matrix). Note that is the orthogonal group , is the unitary group , and is the symplectic (quaternion unitary) group (for deeper insight, see the recent review [edelman]).
Now, as in [Sa17], let follow the Riemannian Gaussian density (3), and . If , then
is uniformly distributed on
(for a rigorous definition of this “uniform distribution", see [meckes]), whilehave joint probability density function
(5) |
where indicates proportionality and the hyperbolic sine. In addition, the normalising factor in (4) reduces to a certain multiple integral . Specifically,
(6) |
where is a numerical constant, which appears after the uniformly distributed matrix is integrated out of (4), and where
(7) |
One of the main issues addressed in the present paper is the efficient approximation of the multiple integral . Until now (in [Sa16][Sa17]), this was done using a Monte Carlo technique which involved a smoothing method (containing certain arbitrarily fixed parameters) and which failed to produce coherent results when the dimension increased beyond .
Before proceeding to present our main results, we should briefly note the existence of several alternative proposals for the extension of Gaussian distributions to Riemannian manifolds, including constructions based on heat flows and diffusion processes [Sommer2015]. Further discussion of these alternative formulations is beyond the scope of this paper and the interested reader is referred to the literature on the topic for further information. See [Pennec2019] (Section 3.4.3 and Chapter 10) and the references therein for a recent and comprehensive account.
Iii Main results
The main results of the present paper stem from the equivalence between Riemannian Gaussian distributions on the spaces of real, complex, or quaternion covariance matrices and log-normal orthogonal, unitary, and symplectic matrix ensembles.
Let be a random matrix in which follows the Riemannian Gaussian density of (3). It is always possible to assume that , since this holds after replacing by
. Then, the probability distribution of
is described by the following proposition.Proposition 1.
In plain words, this proposition states that follows a log-normal matrix ensemble. If is diagonalised as , then is uniformly distributed on , and the eigenvalues , which are all positive, have joint probability density function
(9) |
where is the Vandermonde determinant, and is the log-normal weight function.
In essence, Proposition 1 is already contained in [Ti20]. As a consequence of this proposition, the multiple integral of (7) may be expressed as follows
(10) |
where (product over ) with . In [Ti20], integrals as the one in (10) are expressed using the Andreev or De Bruijn identities, often employed in random matrix theory. These yield somewhat cumbersome formulae, involving determinants of size . On the other hand, the present paper focuses on an alternative approach, which turns out to be more suitable from a practical point of view. Instead of expressing exactly, by means of complicated formulae, the aim is to use a highly efficient approximation, which involves a single analytic expression. This is indicated by the following proposition.
Proposition 2.
In the limit where and , while the product remains constant,
(11) |
where is the trilogarithm function, .
In [Ti20], the limit in (11) is only mentioned in passing, and not stated under the same form. Here, this limit will be given centre stage. Proposition 2 states that (11) is valid in the “double-scaling regime" ( and ), but numerical experiments have shown that can be replaced by the expression afforded by (11) without notable loss of accuracy, whenever is small in comparison with (this is further illustrated below).
At least informally, this can be justified by appealing to arguments originating in theoretical physics [Ma05]. Considered as a function of , is called the Free energy (log of partition function). This free energy can be expanded in an asymptotic series (see Section 1.3 of [Ma05]),
(12) |
which is obtained by summing over Feynman diagrams, or so-called fatgraphs. Each coefficient is itself a series , where counts fatgraphs which are said to have holes and genus (this means that one thinks of a fatgraph as a graph with loops, drawn on a surface of genus , such as a sphere or torus, etc). Now, accepting (12), it follows that for each fixed value of ,
(13) |
so that approximates the left-hand side up to an error of the order of . Finally, recalling that , it is clear that is the right-hand side of (11) — this follows by uniqueness of asymptotic expansions.
In addition to the asymptotic form of , another quantity of interest is the asymptotic empirical distribution of eigenvalues, of a random matrix which follows the Riemannian Gaussian density with . Let denote the eigenvalues of , and consider their empirical distribution
(14) |
for any open interval , where denotes expectation, and the number of which belong to . In [Ti20], the probability density function of was expressed as a weighted sum of Gaussian distributions, by a direct application of the Christoffel-Darboux formula, as in [deift]. It is possible to do so for any value of , only because the case can be studied using a well-known family of orthogonal polynomials, called Stieltjes-Wigert polynomials. No such analytic tool is available when or .
To make up for this issue, the following proposition provides an asymptotic expression of the distribution , valid for all values .
Proposition 3.
In the limit where and , while the product remains constant, the empirical distribution converges weakly to a distribution with probability density function , where
(15) |
on the interval , where and , with .
One hopes that, similar to the situation discussed after Proposition 2, the asymptotic density (15) approximates the finite- empirical distribution to such a good accuracy that one can replace by this asymptotic density, for many practical purposes. This possibility will not be further investigated in the present paper.
Iv Towards learning applications
Riemannian Gaussian distributions were initially proposed as basic building blocks for learning models which aim to capture the intrinsic structure of statistical populations of covariance matrices. These include the mixture models and hidden Markov models, introduced in
[Sa16][mtnshmm] and further developed in [QT21]. In order to make use of these models in real-world applications, it is indispensable to know how to effectively compute the logarithm of the multiple integral of (7), for a dimension which may be in the tens or hundreds.Knowledge of is already crucial in the simplest situation, where one tries to fit a single Riemannian Gaussian density (rather than a whole mixture) to data . Indeed, this will require setting and , where and
are the maximum-likelihood estimates, given in
[Sa16][Sa17]. Specifically,(16) |
is the Fréchet mean of the data , with respect to the distance (2), and is the solution of the nonlinear equation
(17) |
Therefore, it is already impossible to solve a toy problem, with a single Riemannian Gaussian density, without having some kind of hold on .
Until now (in [Sa16][Sa17]), was approximated using an ad hoc Monte Carlo technique, which failed to produce coherent results for a dimension just above . In the present section, the aim will be to show that significantly improved results can be obtained by solving Equation (17) after approximating with the expression afforded by (11), according to Proposition 2.
First, numerical experiments were conducted to verify the validity of this new approximation. According to (13), the right-hand side of (11) should approximate up to an error of the order of . To see that this is correct, the right-hand side of (11) was compared to certain exact expressions of . Namely, for the case, one has the following expression, obtained in [habilitation],
(18) |
and for the case, when the dimension is even, the following expression, based on [Ti20],
(19) |
where denotes the Pfaffian, equal to the square root of the determinant, and the matrix has entries
(20) |
with the error function.
Numerical evaluation of (18) for large values of or (up to ) is quite straightforward. Moreover, it immediately shows that (11) and (18) agree very closely when is smaller than , and then gradually diverge away from one another as increases. Figure 1 provides graphical illustration for and . Still larger values of yield an even stronger match between (11) and (18).
![]() |
![]() |
It is not equally straightforward to numerically evaluate (19). Even at , we begin encountering overflow problems for moderate values of when performing the computations in MATLAB. Still, as long as these overflow problems do not appear, it is possible to observe a close agreement between (11) and (19), as in the case. This is shown in Figure 2 for and .
![]() |
![]() |
Based on the numerical results summarised in Figures 1 and 2, it seems possible to use the right-hand side of (11), multiplied by , as a substitute for . While this is only an approximation, it is a highly efficient one, and has the advantage of being given by a rather simple analytic expression. In the case, direct numerical evaluation of is unstable for larger values of , and (11) offers a practical way out of this problem.
This can be verified by using symbolic computation (Mathematica), which allows us to extend the evaluation of (19) to larger values of for larger . It is nonetheless associated with the drawback that the resulting curves tend to be artificially non-smooth as in Figure 2(a), due to numerical artifacts. These non-smooth features also exist in the curve depicted in Figure 2(b), but do not appear visible at the given resolution. This behaviour is particularly problematic in the context of Equation (17), due to the presence of the derivative of . Furthermore, symbolic computation of (19) becomes exceedingly slow for sufficiently large , and this can only be overcome by relying on approximations such as (11).
![]() |
![]() |
The second set of experiments directly addressed Equation (17), in the case, for a dimension ranging between and . This equation was solved using the Newton method, after its right-hand side was approximated according to (11). Here, the solution obtained in this way will be denoted . This is an approximation of the maximum-likelihood estimate (the exact solution of (17)). If this approximation is any good, should approach the true value of for sufficiently large (the number of data points ). This can already be observed at when , as reported in the following Table I
. Each entry in this table gives the average value and standard deviation of
, calculated over independent trials (average standard deviation). It is clear that increasing from to only reduces the standard deviation of , without really affecting its average.When , the right-hand side of (17) can still be approximated using the Monte Carlo technique mentioned in [Sa16][Sa17]. The solution obtained with this approximation will be denoted . The following Table II shows that the estimation error from is quite improved, in comparison with the estimation error from . Moreover, seems to systematically overestimate the true value of .
true | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
When is above , Monte Carlo approximation of the right-hand side of (17) is not feasible anymore, and one is left only with the approximation using (11). The solution obtained from this approximation is shown in Table III, for and . Here (contrary to Table II), systematically underestimates the true value of . In fact, identical behavior was observed for between and , along with similar values of .
A practical means of overcoming this issue would be to include a penalty term into Equation (17), in order to enforce greater values of its solution. Then, the approximation using (11) can be successfully implemented for larger dimension , where Monte Carlo approximation becomes useless (the present discussion stopped at , because the sampling algorithms used to generate the data points could not be taken any further).
In conclusion, the present section has demonstrated the new approximation of , based on (11), significantly improves the existing Monte Carlo approximation, and also extends it to higher dimension. In upcoming work, the experiments conducted in this section will be generalised to more realistic learning models which may be applied to real-world data (for example, the mixture models considered in [zanini2018]).
V Block-Toeplitz covariance matrices
Similar to the spaces introduced in Section II, consider the spaces , defined as follows.
-
is the space of matrices with complex entries, whose operator norm is .
-
is the space of symmetric matrices with complex entries, whose operator norm is .
Here, operator norm means the largest singular value. The space
will be called the Hermitian Siegel domain, and the symmetric Siegel domain. The focus will be restricted to these or cases, since they are closely related to time-series analysis and signal processing (the case is similar, but involves quaternion matrices).Specifically, assume a wide-sense stationary -variate time series of length is described by its autocovariance matrix . Wide-sense stationarity implies has a block-Toeplitz structure, with blocks of size . When solving an optimal prediction problem, one may apply a multidimensional Szegö-Levinson algorithm to the autocovariance [jeuris1][yannthesis], and obtain a family of matrices , where is the zero-lag autocovariance of the original time series (this is a complex covariance matrix, ), and each is a so-called matrix reflection coefficient, which belongs to . If the autocovariance has Toeplitz blocks, then each moreover belongs to (which is a subspace of ).
Riemannian geometry enters this picture in the following way [jeuris1][yannthesis]. Consider an information metric on the space of block-Toeplitz autocovariance matrices , equal to the Hessian of the entropy function . In terms of the new coordinates , this information metric is a direct product of the affine-invariant metric (1) on the first coordinate , and of a scaled copy of the Siegel domain metric on each of the remaining coordinates (see formula (3.11) in [jeuris1]). Dropping the subscript , this Siegel domain metric is given by,
(21) |
where the tangent space is isomorphic to the space of complex matrices if , and to the space of symmetric complex matrices if (this isomorphism shows that the dimension of equals ). The geodesic distance associated with the Siegel metric (21) has the following expression [jeuris1][yannthesis],
(22) |
where is called the matrix cross-ratio.
Riemannian Gaussian distributions on the space are given by their probability density function which is of the same form as in (3), but with the distance determined by (22), and the normalising factor
(23) |
with respect to the Riemannian volume . Here, where and denote the real and imaginary parts (the product is over if and over all if ).
As shown in [Sa17], a Riemannian Gaussian distribution on the space of block-Toeplitz covariance matrices is just a product of independent Riemannian Gaussian distributions, one for each coordinate . For , this is a Riemannian Gaussian distribution on , already considered in Section II. Thus, to understand Riemannian Gaussian distributions of block-Toeplitz covariance matrices, it only remains to study Riemannian Gaussian distributions on the Siegel domain .
In Section III, it was seen that Riemannian Gaussian distributions on are equivalent to log-normal matrix ensembles. On the other hand, Riemannian Gaussian distributions on are equivalent to “acosh-normal" ensembles, which will be described in Proposition 4 below.
Note first that each matrix can be factorised in the following way
(24) |
where and are unitary, denotes the transpose, and is diagonal, with diagonal elements for . The
case follows from the singular value decomposition of
, and the case follows from the Takagi decomposition of [edelman]. In either case, is of the form because the singular values of are all .Proposition 4.
Let follow a Riemannian Gaussian distribution on the Siegel domain with (zero matrix). If is factorised as in (24), then and are uniformly distributed on the unitary group . Moreover, if then have joint probability density function
(25) |
where , and where is the “acosh-normal" weight function.
Note that the assumption that the centre-of-mass parameter is equal to does not entail any loss of generality. The transformation [yannthesis]
(26) |
maps to , while preserving the Siegel domain metric (21) and the associated distance and Riemannian volume. Thus, if follows a Riemannian Gaussian density , it is enough to replace by , which will have a Riemannian Gaussian density with and with the same .
Proposition 4 implies that the transformed singular values follow the classical eigenvalue distribution of an orthogonal () or unitary () matrix ensemble with “acosh-normal" weight function. Therefore, in particular, the normalising factor of (23) reduces to a multiple integral (compare to (6) and (10))
(27) |
where , with for the weight function .
Thus, Riemannian Gaussian distributions on the Siegel domain can be studied by applying the tools of random matrix theory to new “acosh-normal" matrix ensembles. Hopefully, in the near future, this will lead to similar results to the ones obtained with log-normal ensembles in Section III, paving the way to learning from datesets of high-dimensional block-Toeplitz covariance matrices.
Appendix A Proofs of Propositions 1 to 4
A-a Proof of Propositon 1
If follows the Riemannian Gaussian density (3) with , then
(28) |
Let denote the eigenvalues of . Using (2) and the fact that , (28) becomes
(29) |
Recall that . Accordingly, if are the eigenvalues of , an elementary calculation yields
Therefore, (29) can be written
(30) |
To conclude, it is enough to use once more the definition . This implies
Thus, using the fact that
and changing the variable of integration from to in (30), it follows that
as required in (8).
A-B Proof of Proposition 2
A-C Proof of Proposition 3
The case has already been proved in [habilitation]. To deal with the general case, write (10) under the form , where
and is the multiple integral
Now, if is the empirical distribution of (where denotes the Dirac measure at ), then
(31) |
where is the so-called energy functional
(32) |
defined for any probability distribution on . This energy functional satisfies the following scaling equation
(33) |
which is easily obtained after dividing (32) by . The proof can now be completed by applying to (31) and (33) the arguments in [deift]. First [deift] (Corollary 6.90, Page 155),
where is the so-called equilibrium distribution, the unique minimiser of the energy (32) among probability distributions on . From the scaling equation (33), it is now clear
(34) |
Therefore, by adding the limit of ,
(35) |
However, when , it is already known that this limit is , where is defined in (11). This provides , which can be replaced back into (35), yielding the general case of Proposition 2.
To prove Proposition 3, it is possible to use the scaling equation (33), once more. From [deift] (Section 6.4), if is defined as in (14), but with the instead of the , then converges weakly to the equilibrium distribution . On the other hand, the scaling equation (33) implies , and is already known to be the image of the distribution with density , defined as in (15), under the change of variables . This provides for any value of , and the Proposition then follows by changing the variables back from to .
A-D Proof of Proposition 4
The proof relies on the general theory of Gaussian distributions on Riemannian symmetric spaces, as outlined in [Sa17][habilitation]. Here, to keep the proof self-contained, it will be helpful to briefly recall certain aspects of this theory (for a more detailed, in-depth discussion, the reader is referred to [habilitation], Sections 1.9 and 3.3).
A Gaussian distribution on a Riemannian symmetric space is defined by its probability density function
(36) |
with respect to the Riemannian volume on , with and .
It is always assumed is a Riemannian symmetric space of non-positive curvature, associated to a symmetric pair . Precisely, is a connected Lie group which acts transitively and isometrically on (this action is denoted , for and ), and is a compact subgroup of , made up of those elements which fix a certain point (that is, if and only if ).
For a concrete understanding of Gaussian distributions on , it is necessary to take a closer look at the Lie algebras of and , denoted and , respectively. These are related together by the so-called Cartan decomposition, (direct sum), where the subspace of can be identified with the tangent space (tangent space to at at ). In terms of this decomposition, the main construction needed for the proof can be described as follows.
Without any loss of generality, and are taken to be matrix Lie groups, and and their matrix Lie algebras. Let be a maximal abelian subspace of (that is, all the matrices commute with each other). Then, each matrix can be written under the form , where and (this factorisation is the general template for the fifty three matrix factorisations, outlined in [edelman]). Moreover, any admits a representation
(37) |
where denotes the matrix exponential. Incidentally, this representation is not unique, but for almost all (for all , except a subset of zero volume) has exactly couples which satisfy (37), where is the number of elements of the Weyl group of the symmetric couple .
It is now possible to state the following Lemma 1 [Sa17][habilitation], which will yield the entire proof, by direct application.
Lemma 1.
Let follow the Gaussian density (36) on , with . If is represented as in (37), then is uniformly distributed on the compact group . Moreover, has the following probability density function on (recall that is a real vector space, so the density is with respect to the usual Lebesgue measure on )
(38) |
where is the Riemannian norm of (since , it can be identified with a vector in ), and is a set of positive roots on , with respective multiplicities (each is a certain linear function ).
Recall that positive roots are any set of linear functions on such that, for any , the eigenvalues of the linear operator , given by (this is ), are equal to with respective multiplicities .
The proof may now begin in earnest. It merely consists in identifying , , , , , and , for each Siegel domain , and then writing down the corresponding version of Lemma 1, which directly yields Proposition 4. Fortunately, all the necessary information can be found in [piatetski][helgason].
The symmetric pair : is associated to the symmetric pair , where and are groups of complex matrices , defined in the following way (here, denotes the complex conjugate).
where and denote the following matrices
These groups and act on by matrix fractional transformations