Matrix Means and a Novel High-Dimensional Shrinkage Phenomenon

10/16/2019 ∙ by Asad Lodhia, et al. ∙ 0

Many statistical settings call for estimating a population parameter, most typically the population mean, from a sample of matrices. The most natural estimate of the population mean is the arithmetic mean, but there are many other matrix means that may behave differently, especially in high dimensions. Here we consider the matrix harmonic mean as an alternative to the arithmetic matrix mean. We show that in certain high-dimensional regimes, the harmonic mean yields an improvement over the arithmetic mean in estimation error as measured by the operator norm. Counter-intuitively, studying the asymptotic behavior of these two matrix means in a spiked covariance estimation problem, we find that this improvement in operator norm error does not imply better recovery of the leading eigenvector. We also show that a Rao-Blackwellized version of the harmonic mean is equivalent to a linear shrinkage estimator that has been studied previously in the high-dimensional covariance estimation literature. Simulations complement the theoretical results, illustrating the conditions under which the harmonic matrix mean yields an empirically better estimate.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Matrix estimation problems arise in statistics in a number of areas, most prominently in covariance estimation, but also in network analysis, low-rank recovery and time series analysis. Typically, the focus is on estimating a matrix based on a single noisy realization of that matrix. For example, the problem of covariance estimation [24]

focuses on estimating a population covariance matrix based on a sample of vectors, which are usually combined to form a sample covariance matrix or another estimate of the population covariance. In network analysis and matrix completion problems, the goal is typically to estimate the expectation of a matrix-valued random variable based on a single observation under suitable structural assumptions (see, e.g.,

[18] and citations therein). A related setting that has received less attention is when we observe a sample of matrices and aim to estimate an underlying population mean or other parameter. This arises frequently in neuroimaging data analysis, where each matrix represents connectivity within a particular subject’s brain and the goal is to estimate a population brain connectivity pattern [10, 47].

The most direct approach to estimating the underlying population mean from a sample of matrices is to take the arithmetic (sample) mean, perhaps with some regularization to ensure the stability of the resulting estimate. The arithmetic matrix mean is the mean with respect to Euclidean geometry on the space of matrices, which is often not the most suitable geometry for a matrix model. A simple example can be found in our recent work [35]

, where we showed that in the problem of estimating a low-rank expectation of a collection of independent random matrices with different variances, a weighted average improves upon the naïve arithmetic matrix mean. Somewhat surprisingly in light of the rich geometry of matrices, fairly little attention has been paid in the literature to the use of other matrix geometries and their associated means. An exception is work by Schwartzman

[45], who argued for using the intrinsic geometry of the positive definite cone [7] in the problem of covariance estimation, since the covariance matrix has to be positive (semi-)definite, and showed that a mean with respect to this different matrix geometry can, under certain models, yield an appreciably better estimate. Recent work has considered Fréchet means in the context of multiple-network analysis [37]. Continuing in this vein, the goal of the current work is to better understand the sampling distribution of matrix means other than the arithmetic mean under different matrix geometries.

The subject of computing means with respect to different geometries has been studied at some length in the signal processing and computer vision communities, mostly in the context of the Grassmann and Stiefel manifolds

[1, 22, 39]. See [46] for a good discussion of how taking intrinsic geometry into account leads to estimators other than the arithmetic mean. Recent work has considered similar geometric concerns in the context of network data [30, 37]. Kolaczyk and coauthors [30] considered the problem of averaging multiple networks on the same number of vertices, developed a novel geometry for this setting and derived a Fréchet mean for that geometry. Recent work by Lunagómez, Olhede and Wolfe [37] considered a similar network averaging problem, and presented a framework for both specifying distributions of networks and deriving corresponding sample means.

Unfortunately, most of these matrix means are not amenable to the currently available tools from random matrix theory. Instead, in this paper we consider the behavior of the harmonic mean of a collection of random matrices, as an example of a matrix mean that arises from a markedly different notion of matrix geometry compared to the matrix arithmetic mean. The harmonic mean turns out to be well-suited to analysis using techniques in random matrix theory, and it is our hope that results established here may be extended to other related matrix means in the future. Building on random matrix results by the first author

[36], we show how the harmonic matrix mean can, in certain regimes, yield a better estimate of the population mean matrix in spectral norm compared to the arithmetic mean, but not necessarily in recovering the top eigenvector in a spiked covariance model, making an important distinction between two measures of successful performance which are often assumed to behave similarly. We characterize the settings in which the harmonic matrix mean improves upon the more naïve arithmetic matrix mean as well as the settings in which it does not, and show the implications of these results for covariance estimation.

In the rest of this section, we briefly review the literature on covariance matrix estimation, since we focus on estimating the mean of a collection of Wishart matrices, which can be thought of as sample covariance matrices. There is an extensive literature on estimating a population covariance matrix on variables from observations. The sample covariance matrix is the maximum likelihood estimator for Gaussian data, and when the dimension is fixed, there are classical results that fully describe its behavior [40, 2]. When is allowed to grow with , the sample covariance matrix is not well-behaved, and in particular becomes singular as soon as . There has been extensive work on understanding this phenomenon in random matrix theory, starting from the pioneering work of [38] and many more recent results, especially focusing on estimating the spectrum [50, 23]

. Much work in random matrix theory has focused on the spiked model, in which the population covariance is the sum of the identity matrix and a low-rank signal matrix

[28, 6, 12, 25].

The recent literature on covariance estimation in high dimensions (see [24] for a review) focused on addressing the shortcomings of the sample covariance, mainly by applying regularization. James-Stein type shrinkage was considered in early Bayesian approaches [20] and in the Ledoit-Wolf estimator [33], which shrinks the sample covariance matrix towards the identity matrix using coefficients optimized with respect to a Frobenius norm loss. Subsequent papers presented variants with different estimates of the optimal coefficients and different choices of the shrinkage target matrix for normal data [19, 26]

, as well as for the more general case of finite fourth moments 

[48]. Follow-up work presented a nonlinear regularization scheme [34]

, motivated by the fact that linear shrinkage of the sample covariance fails to account for the nonlinear dispersion of the sample eigenvalues.

An alternative approach to regularization of the sample covariance matrix is to impose structural constraints. This class of methods includes banding or tapering of the covariance matrix, suitable when the variables have a natural ordering [49, 8, 44], and thresholding when the variables are not ordered [9, 44, 11]. Minimax rates for many structured covariance estimation problems are now known [14, 16, 15]; see [13] for a good overview of these results.

The remainder of this paper is laid out as follows. In Section 2 we establish notation and introduce the random matrix models under study in the present paper. In Section 3, we establish the asymptotic behavior of the harmonic matrix mean under these models. In Section 4 we compute a Rao-Blackwellized version of the harmonic mean of two random covariance matrices. In Section 5, we analyze a spiked covariance model and compute the limit of the inner product of the top eigenvector of the harmonic mean with the top eigenvector of the population covariance. Finally, Section 6 briefly presents numerical simulations highlighting the settings in which the harmonic matrix mean succeeds and fails in covariance estimation. Section 7 concludes with discussion.

2 Problem Setup

We begin by establishing notation. We denote the identity matrix by , with its dimension clear from context. For a matrix , denotes its operator norm and denotes its Frobenius norm. For a set , let if and otherwise. We denote by and the spaces of symmetric and Hermitian positive definite matrices, respectively. For a symmetric or Hermitian matrix , the eigenvalues of are denoted and their corresponding eigenvectors are denoted . We use for the positive semidefinite ordering, so that if and only if is positive semidefinite.

Suppose that we wish to estimate the population mean of a collection of independent identically distributed self-adjoint positive definite -by- random matrices. The most commonly used model for positive (semi)definite random matrices is the Wishart distribution, which arises in covariance estimation and is well-studied in the random matrix theory literature.

Definition 1 (Wishart Random Matrix: Real Observation Model).

Let be a random matrix with columns drawn i.i.d. from a centered normal with covariance . Then

is a real-valued random Wishart matrix with parameters and .

Some of our results are more straightforwardly stated and proved for the complex-valued version of the Wishart distribution, which we define here for the special case of identity covariance.

Definition 2 (Wishart Random Matrix: Complex Observation Model).

Let be a random matrix with i.i.d. complex standard Gaussian random entries, i.e., entries of the form

where and are independent standard real Gaussian random variables. Then is a random matrix following the complex Wishart distribution with parameters and .

Let be a sequence of independent identically distributed matrices with columns drawn i.i.d. from a centered normal with covariance . Then for each ,

(1)

is the sample covariance matrix, which follows the real-valued Wishart distribution with parameters and . Most previous approaches to covariance estimation and other related matrix estimation problems have focused on bounding the error between an estimator and . This error is most commonly measured in Frobenius norm or operator norm, the latter of which is more relevant in some applications since, by the Davis-Kahan theorem [21], small operator norm error implies that one can recover the leading eigenvectors of

. This is of particular interest in covariance estimation when the task at hand is principal component analysis (see Section 

5), but is also relevant in other problems when is low-rank. For examle, in the case of network analysis [35], the eigenvectors of encode community structure.

Even in the modestly high-dimensional regime of , estimating is more challenging. When , the spectral measure of each satisfies the Marčenko-Pastur law with parameter in the large- limit. In fact, we have the stronger result (see Proposition 1 below) that

A straightforward estimator of in this setting is the arithmetic mean of the matrices,

which can be equivalently expressed as

The arithmetic mean is a sample covariance based on a total of observations in this case, since we center by the known rather than estimated observation mean, and every covariance matrix is based on the same number of observations. Note that in the present work we assume that the observed data are mean-, and thus there is no need to center the observations about a sample mean. This assumption comes with minimal loss of generality, since the centered sample covariance matrix of a collection of normal random variables is Wishart distributed with parameter in place of , which has no effect on the asymptotic analyses below.

Remark 1.

In practical applications, there are situations where pooling observations is not appropriate, and the arithmetic mean may be ill-suited to estimating as a result. For example, in resting state fMRI data, pooling observations from different subjects at a given brain region is infeasible, as the response signals at a particular brain location are not time-aligned across subjects. Nonetheless, combining sample covariance or correlation matrices across subjects via some other procedure may still be valid for estimating the population covariance or correlation matrix.

Keeping the number of observed matrices fixed and letting and grow, for the case , defining , (see Proposition 1)

Throughout this paper, will denote the total number of observations of points in -dimensional space. The regime of interest is that in which , and we will consider where is a fixed number of matrices, and will be tending to infinity with . It will be convenient to define , which has the relationship .

In past work [36], the first author analyzed the behavior of the harmonic mean of a collection of independent Wishart random matrices in the regime . The matrix harmonic mean is defined as

(2)

provided that (so that the are invertible almost surely). While the harmonic mean is also inconsistent as an estimator for , we will show that its operator-norm bias is, under certain conditions, is smaller than the arithmetic mean.

3 Improved Operator Norm Error of the Harmonic Mean

When the are drawn from the same underlying population, the harmonic mean can be a better estimate of the population mean in operator norm than the arithmetic mean [36]. This improvement is best understood as a data splitting result, in which we partition a sample of -dimensional observations, and compute the harmonic mean of the covariance estimates computed from each part. This is certainly counter-intuitive, but we remind the reader that our intuitions are often wrong in the high-dimensional regime.

Let be a set of points in , Let be a partition of into disjoint subsets ,

Define the Wishart random matrix associated with each

as

and the arithmetic and harmonic means associated with as, respectively,

provided that is invertible for all .

If the sets making up the partition are all of the same size, then is in fact simply the sample covariance of the vectors in and does not depend on . The convergence of the spectrum of is classical [38]. The statement given here can be found in the modern reference [3, Theorems 3.6–7 and Theorems 5.9–10].

Proposition 1 (Marčenko-Pastur law).

Suppose is a collection of i.i.d. -dimensional complex Gaussians with zero mean and covariance , with and tending to infinity in such a way that , and let be a deterministic sequence of partitions of such that are equal for all . The spectral measure of converges weakly almost surely to the measure with density

where

(3)

Further, we have the convergence

The following result describes the limiting behaviour of under similar conditions.

Proposition 2.

Let be a collection of i.i.d. -dimensional complex Gaussians with zero mean and covariance , with and tending to infinity in such a way that

where is a constant and . Let be fixed with divisible by , and let be a deterministic sequence of partitions of of size such that are equal for all . The spectral measure of converges weakly almost surely to the measure with density

where

(4)

Further, we have the convergence

The above result is a restatement of the results of [36, Theorem 2.1], since if the are disjoint and equal sized, then are a collection of growing Wishart matrices of the same dimension and shared parameter , with . The case where is permitted to vary in , while still feasibly handled by the tools of this paper, is more complicated. The limiting spectrum of the harmonic mean in this setting depends on the roots of a high-degree polynomial, with the result that comparison of the harmonic and arithmetic mean requires a more subtle analysis. In contrast, when the cells of the partition are of equal size, the limiting spectral measure of the harmonic mean is characterized by the roots of a quadratic and thus admits an explicit solution. For the sake of concreteness and simplicity, we will assume that is a partition with cells of equal size for the remainder of the paper.

With the interpretation of as a mean formed by splitting into equal parts, we have the following Theorem.

Theorem 1.

Under the assumptions of Proposition 2, the operator norm is minimized for a partition of size . Further, for such a partition,

Proof.

The function

has derivative

which is greater than 0 whenever

For , this region includes the point so that the minimizer of on the interval is . ∎

The above result suggests that given a collection of observations, it is better asymptotically (as measured in operator norm error) to estimate the covariance by splitting into two equal parts and and computing the harmonic mean of and than it is to directly compute the sample covariance matrix of . We note that for , , so that is closer to . This in turn implies that, in the case where the true covariance matrix is the identity, the harmonic mean is shrunk toward the true population covariance when compared with the arithmetic mean. As we will see below in Section 4, Rao-Blackwellizing the harmonic mean yields yet another shrinkage estimator, one that has appeared elsewhere in the literature.

There are two main drawbacks to using the complex Gaussian model from Definition 2. The first is the requirement that the random vectors be complex-valued Gaussians, which is not the case for most datasets. The second is the requirement that the true covariance be equal to the identity. The need for complex Gaussian matrices arises in the proof of Theorem 2.1 in [36]

, which relies on a result in free probability

[17], and the remainder of the argument extends immediately to real-valued data with sub-Gaussian entries [36, Remark 3]. This is not unusual: many applications of random matrix theory to statistics have first been established for complex random matrices, for technical reasons. For example, the celebrated Ben Arous, Baik and Peché transition for spiked covariance models was first established for complex Gaussian random vectors [4] and the Tracy-Widom Law for the largest eigenvalue of a general sample covariance was first established for complex Gaussians [29]. We believe these results can be made to apply to real random variables, and empirical simulations with multivariate real-valued Gaussians support this claim.

The requirement that the vectors have identity covariance is partially addressed by [36, Corollary 2.1.1], which we restate here.

Proposition 3.

Under the same assumptions as Proposition 2, let and suppose is a positive definite matrix such that

Then

Since multiplying by on both sides gives a complex Wishart model with population covariance (see Remark 3 below), the bound in Proposition 3 holds so long as the condition number of lies in a certain range. See [36, Remark 2] for further discussion.

The harmonic mean has an interesting additional property with respect to the Frobenius norm. The arithmetic matrix mean is usually motivated as minimizing the squared Frobenius norm error. A similar objective motivates many existing shrinkage estimators for the covariance matrix [33, 26]. Under the setting considered above, the harmonic matrix mean, despite not being optimized for this loss, matches the Frobenius norm error of the arithmetic mean asymptotically.

Lemma 1.

Under the conditions of Proposition 2, when we have

Proof.

Since is symmetric, by the almost sure weak convergence of , it suffices to show

where are defined in Equation (4), and compare with

where are defined in Equation (3). Note that

and

Using Lemma 3 in the Appendix,

while

4 Rao-Blackwell Improvement of the Harmonic Mean

The results in Section 3 are somewhat unexpected, and raise the question of whether other matrix means have similar properties. Analyzing other means such as the geometric or the more complicated Frechét means [7, 45]

under the high-dimensional regime poses a significant challenge since these operations currently fall outside the scope of known free probability theory. Random matrix techniques do, however, allow us to extend our analysis of the harmonic mean by computing its expectation conditioned on the arithmetic mean. By the Rao-Blackwell theorem, using this conditional expectation as an estimator yields an expected spectral norm error no worse than the unconditioned harmonic mean.

In this section we restrict ourselves to the model of Definition 1, so as to ensure the availability of explicit integrals for our quantities of interest. We expect the same results can be established for the complex model in Definition 2, but doing so would require reworking the results of [31] (restated in the Appendix for ease of reference) for the complex Wishart Ensemble, which is outside the scope of the present article. Further, we restrict our attention to to facilitate comparison with the case studied in the previous section. To this end, let where and are disjoint, and

The matrices and have densities [2, Theorem 7.2.2]

where

These densities are supported on the space of all symmetric positive definite real matrices. As before, define

(5)
(6)

and note that

is a Wishart random matrix with parameter and , and thus

(7)

where takes values in all of .

Recall that the matrix is a sufficient statistic for the covariance matrix

, and note that the loss function

is convex in the variable . By the Rao-Blackwell Theorem [43, 5a.2 (ii)], we have

which is to say that as an estimator, is outperformed by the conditional expectation , which we now compute.

Observe that the harmonic mean satisfies [7, Section 4.1]

Averaging these two equations gives

and taking the conditional expectation yields

To compute the matrix-valued integrals

we proceed by directly computing the conditional density of given .

We begin with the joint density of and :

We will use this formula to obtain an expression for the joint density of and . For a symmetric matrix with entries , let denote Lebesgue measure over that entry and define

that is is the volume form of the matrix . The “shear” transformation

maps the domain to the region

where we remind the reader that denotes the positive semidefinite ordering. The Jacobian of this mapping is

hence the joint density is

To obtain the conditional distribution, we divide by (7), yielding

(8)

where is supported on the region

Evaluating this density at , for ,

gives a multivariate Beta distribution

(see Definition 4 in the Appendix). With this notation, we have

the integration over can be done using Theorem 5 in the Appendix, with , , and setting yields the following Lemma.

Lemma 2.

For any that is a function of taking value in the space of matrices,

Setting yields

The same calculation can be carried out for to give

which is simply a rescaling of by a deterministic constant. We summarize this result as a theorem.

Theorem 2.

Let and be as in Definition 1. If is a partition of size with , then

Note that as , the limiting spectral measure of the above conditional expectation converges to

where is a random variable distributed according to the limiting spectral distribution of .

The above calculations can be further extended by making a few adjustments to the matrices , . A number of matrix estimators take the form

where are positive scalars and is a positive semidefinite matrix, all depending only on . Estimators of this form have been extensively studied in the covariance estimation literature [33, 26, 32]. One could take the extra step of applying the same regularization procedure to the matrices and before computing their (Rao-Blackwellized) harmonic mean. Suppose we replace and with

respectively. Letting be the harmonic mean of and , we can compute a Rao-Blackwell improvement of in much the same way that we did for above. Indeed, we still have

We can compute the conditional expectation with respect to as follows

(9)

Using Lemma 2, we have

(10)

Combining Equations (9) and (10), we have

which we can write solely in terms of and by substituting with , obtaining

We summarize the above results in the following Theorem.

Theorem 3.

The Rao-Blackwell improvement of , the harmonic mean of the matrices and , where and are positive constants that only depend on and is a positive definite matrix that only depends on given by

Remark 2.

For linear shrinkage estimators of the form

as in [33], setting , , and in our formula gives

(11)

The results outlined above are unexpected, and somewhat odd. The implication of Theorem 

2 is that the Rao-Blackwellization of is a deterministic constant multiple of . This suggests that the expense of computing is not warranted, since while may improve upon as an estimator, a scalar multiple of improves still further upon . Figure 4 in Section 6 explores this point empirically in the finite-sample regime. On the other hand, the form of the Rao-Blackwellized estimator in Equation (11), obtained from Theorem 3, bears noting. In contrast to the Rao-Blackwellized version of considered in Theorem 2, this estimator involves a linear combination of , and . The form of this estimator is fundamentally different from the class of linear shrinkage estimators considered elsewhere in the literature [33, 26, 48], and warrants further study.

5 Eigenvector Recovery

A major motivation for working with the operator norm is to obtain guarantees on convergence of eigenvectors, which are often the main object of interest in covariance estimation, as in when the covariance is used for principal component analysis. This is done via the Davis-Kahan theorem, which bounds the distance between the leading eigenvectors and in terms of . For example, it can be shown [51, Corollary 1] that

(12)

In this section, we show that under a spiked covariance model, the leading eigenvector of carries information about the leading eigenvector of the population covariance matrix in the regime , and compare with the leading eigenvector of . We return to the complex case once more due to our reliance on Proposition 2, but as alluded to previously, we expect the resulting asymptotic formula to hold for the real case without any adjustment.

Definition 3.

Let be a set of i.i.d. -dimensional centered multivariate complex Gaussians with population covariance matrix

(13)

where and has norm one. As in Proposition 2 assume that

where is a constant that does not depend on or and .

Remark 3.

Let be a collection of multivariate complex Gaussians with zero mean and covariance . If we define

where is given in Definition 3, then has the same distribution as the model in Equation 13. Moreover, by this same transformation, we may take a partition of and generate a partition of by replacing each in by . With this definition we have the equality

The Theorem below follows from well-established results in the literature and can be generalized without any change to higher rank perturbations of the identity. We focus on the simple case of one spike in order to get more explicit insight into the performance of .

Theorem 4.

Let be a spiked model as in Definition 3, and suppose is a partition satisfying the conditions of Proposition 2, with . Then we have the almost sure convergence

and

Proof.

To prove this result we will use the general framework [5], which considers multiplicative spikes of the form

Here and are as in Definition 3 and is a Hermitian matrix whose eigenvalue distribution converges weakly to a spectral measure almost surely, and is supported on the interval . Assume further that the convergence of the largest (smallest) eigenvalue of is the right (left) edge of the support of and that the distribution of is invariant under unitary conjugation; recall this implies the matrix of eigenvectors of is Haar distributed on the unitary group. Define for

and let be the functional inverse of . By [5, Theorem 2.7] we have the almost sure convergence

Here is the limit as of , well defined when has a density with square root decay near [5, Proposition 2.10]. Furthermore, by [5, Remark 2.11] we have the almost sure convergence

where

Applying these results using Remark 3 and taking where is the partition of a data set with population covariance , we see that satisfies the required convergence properties by Proposition 2 and is unitarily invariant. Letting equal to the limiting spectral measure of , and noting for

the proof now proceeds by calculation. From the results of [36, Equation (18)], satisfies the fixed point equation

Inserting the definition of and simplifying yields

Taking the limit as goes to and utilizing the square root decay of at yields

Hence, a phase transition in the largest eigenvalue of

occurs for . We can solve for the inverse of by substituting into the polynomial fixed point equation for