Asymptotics of empirical eigenvalues for large separable covariance matrices

10/10/2019 ∙ by Tiebin Mi, et al. ∙ Shanghai Jiao Tong University 0

We investigate the asymptotics of eigenvalues of sample covariance matrices associated with a class of non-independent Gaussian processes (separable and temporally stationary) under the Kolmogorov asymptotic regime. The limiting spectral distribution (LSD) is shown to depend explicitly on the Kolmogorov constant (a fixed limiting ratio of the sample size to the dimensionality) and parameters representing the spatio- and temporal- correlations. The Cauchy, M- and N-transforms from free harmonic analysis play key roles to this LSD calculation problem. The free multiplication law of free random variables is employed to give a semi-closed-form expression (only the final step is numerical based) of the LSD for the spatio-covariance matrix being a diagonally dominant Wigner matrix and temporal-covariance matrix an exponential off-diagonal decay (Toeplitz) matrix. Furthermore, we also derive a nonlinear shrinkage estimator for the top eigenvalues associated with a low rank matrix (Hermitian) from its noisy measurements. Numerical studies about the effectiveness of the estimator are also presented.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many applications in signal processing and other sciences involve modeling a dependency across a wide range of spatial and temporal scales, e.g., [1, 2, 3, 4, 5]. Traditionally, statistical theory has been concerned with studying the properties of a spatio-temporal process as the sample size tends to infinity, while the number of variables is fixed. Nevertheless, there are always a few snapshots available to us about the underlying process in practice. A key challenge in spatio-temporal data analysis is therefore to summarize or infer the process’ behavior with these finite samples. In this article, we will discuss the limiting behavior of the empirical spectral distribution associated with a class of Gaussian processes (large, non-independent, separable and temporally stationary).

For convenience, we denote by a spatio-temporal process. Particularly, the covariance function provides us a succinct but informative summary on the process. If no further structure exists, depends entirely on the space-time indexes and . However, stationary and separability are often imposed to ensure that the covariance function could be manipulated easily [6].

Specifically, a spatio-temporal random function possesses temporally stationary covariance if depends on the time index only through the temporal lag [4]. Similarly, it has spatially stationary covariance if depends on and only through the spatial separation . A spatio-temporal process has stationary covariance if it has both spatially and temporally stationary covariance. Under this assumption, there exists a function such that .

The separability, which offers significantly computational benefits, is a desirable property for the spatio-temporal models [6, 7, 8]. A process has a separable covariance function if for some spatio- and temporal- covariances and . Roughly speaking, with the separability, the spatio- and temporal-correlations are “decoupled” completely. Moreover, with both the separability and temporal stationary, the associated covariance function could be written as .

In this article we will employ a framework inspired by random matrix theory to characterize the asymptotic behaviour of large sample covariance matrices associated with a class of separable and temporally stationary Gaussian processes (noises). The setting in random matrix theory differs substantially from traditional situation where the number of variables is fixed, but the sample size for examination tends to infinity. The proposed asymptotic approach allows us to obtain the limiting of the eigenvalues (under the condition that the number of variables and samples size both tend to infinity) and predict the behavior of the empirical eigenvalues associated with large sample covariance matrices with reasonable accuracy. This is the primary motivation for this article.

We begin with the definition of separable Gaussian process. Specifically, for the samples of a separable Gaussian process, we consider a matrix with the form , where is a random matrix consisting of independent identically distributed (i.i.d.) Gaussian random variables, and are and non-negative definite matrices, representing spatio- and temporal- covariances, respectively. Clearly, with the impact of and , consists of non-independent Gaussian random variables. For the associated sample covariance matrix , we mean .

To examine the asymptotic statistics of large sample covariance matrices, a new approach based on free probability theory (see for instance [9, 10]) is developed under the Kolmogorov condition, i.e., and (Kolmogorov constant). We emphasize that this approach is different greatly from the methods in [11, 12]. The Cauchy, M- and N-transforms from free harmonic analysis play key roles in this spectral distribution estimation problem. We will investigate how to calculate with these transforms a semi-closed-form expression of the LSD of the sample covariance matrices associated with a class of separable Gaussian processes.

Before we proceed, let us first discuss briefly the computational and statistical challenges we are faced under the Kolmogorov condition.

I-a Challenges and opportunities from massive data

In recent years, modern signal processing techniques often amplify system’s pressure from dealing with massive data in pursuit of high performance. Massive data are always featured in high-dimensionality and large sample size, i.e., the dimensionality of variables is comparable to the size of available samples (the Kolmogorov condition falls into this regime also). That will introduce computational and statistical challenges, including spurious correlation, noise accumulation and scalability.

A good illustration for massive data processing is the development in wireless communication [13, 14]

. Basically, the more antennas the transmitter/receiver are equipped with, the more degrees of freedom could be achieved. The LTE standard allows for up to eight antennas at the base station. The massive MIMO goes even further, which consists of a very large number of antennas (hundreds or thousands), simultaneously serves a great number of user terminals (tens or hundreds). Therefore, it’s necessary to know what will happen for a large system limit, where the numbers of antennas and user terminals grow infinitely large with the same speed

[15, 14].

Unfortunately, the Kolmogorov condition limits the applicability of many well-known approaches [16]. We consider the covariance matrix estimation as an example. Given an data matrix , consisting of

samples drawn from multivariate Gaussian distribution

, it is well-known that the sample covariance matrix is a good approximate of the population covariance matrix when remains fixed and , i.e., .

By contrast, for the situation where the dimension is comparable to the sample size , many new phenomena arise. Perhaps the first high-dimensionality phenomenon is that, if , the empirical eigenvalues do not converge to their population counterparts even as the sample size grows. The famous Marc̆enko-Pastur law (illustrated in Theorem II.3) gives a celebrated function for the asymptotic behavior of eigenvalues of a large sample covariance matrix associated with multivariate Gaussian distributions. We can see from Fig. 1 that the eigenvalues of the sample covariance matrices are more spread out, or dispersed, than the population eigenvalues.

Such a paradigm shift to big/massive data in signal processing will lead to significant progresses on the development of new theory and method. Although there exist various distinct statistical problems, we primarily focus on the limiting spectral statistics associated with large sample covariance matrices. To this end, a framework inspired by random matrix theory and several transforms from free harmonic analysis will be adapted.

The organization of this article is as follows. A brief description of random matrix theory is given in Section II. The roles of Cauchy transform in the calculation of Marčenko-Pastur distribution and top eigenvalues of a low-rank spiked covariance model are also discussed. In Section III, we present some preliminaries for free probability theory. For those who are familiar with this may skip most of the section. Section IV focuses on the models of spatio- and temporal- covariance matrices. We restrict the spatio- and temporal-covariance matrices to a diagonally dominant Wigner matrix and a bi-infinite Toeplitz matrix, respectively. Section V is dedicated to the calculation of Cauchy, M- and N-transforms associated with the Marčenko-Pastur distribution, diagonally dominant Wigner matrix and bi-infinite Toeplitz matrix. All of the associated Cauchy, M- and N-transforms are analytical. Finally, the article turns to the calculation of LSD of large sample covariance matrices associated with autoregressive Gaussian processes. We also have a very short discussion on the nonlinear shrinkage estimator for the top eigenvalues of a low rank matrix (Hermitian) from its noisy measurements in Section VI.

Ii Random matrix theory and Cauchy transform

Random matrix theory, which studies how the eigenvalues behave asymptotically when the order of an underlying random matrix grows, is gaining increasing attention for analyzing complex high-dimensional data. Early interest in random matrices arose in the context of multivariate statistics, it was Wigner who introduced random matrix ensembles and derived the first asymptotic result

[17, 18]. Since then, there have been a large number of literatures studying the asymptotic behavior of various random matrices. In this section, we will introduce some necessary concepts in random matrix theory. For a thorough treatment on random matrix theory, we refer the reader to [19, 20, 21, 22, 23, 24, 25, 26, 27].

Definition II.1.

Given an Hermitian matrix , the empirical spectral distribution (or ESD for short) is defined as, for ,

where are the (necessarily real) eigenvalues of , including multiplicity.

Note that is actually a probability measure on . It is non-decreasing with and .

Random matrix theory asserts that there exist some classes of Hermitian random matrices whose (random) ESD converges, as , towards a non-random function . This function , if it exists, is called the limiting spectral distribution (LSD for short). Moreover, for many Hermitian random matrices, the corresponding LSD is supported on a bounded interval (bulk) and illustrating sharp edges on the boundary. The eigenvalue significantly outside of the interval (noise) will be a good indicator of the non-random behavior (signal).

We now introduce Cauchy transform, which uses complex-analytic methods and has turned out to be a powerful tool in the calculation of LSDs.

Definition II.2.

Let be the complex upper half-plane and a probability measure on . The Cauchy transform is an analytic function defined as, for ,

A probability measure could be retrieved from its Cauchy transform via the so-called Stieltjes inversion formula. Particularly, if is a continuous point of ,

(II.1)

With this Stieltjes inversion formula, the statistical information on a probability measure is completely encoded in the corresponding Cauchy transform function. However, it is worth noting that, for an arbitrary probability distribution, it might not be possible to obtain an analytical or even an easy numerical solution via the Stieltjes inversion formula.

Next we will consider Cauchy transform associated with a Hermitian random matrix. By Definition II.1, we know that is the summation of a series of Dirac delta functions that concentrate around the eigenvalues of . The associated Cauchy transform is then defined as, for ,

where is the normalized trace.

Ii-a Marčenko-Pastur law

The Cauchy transform was used by Marčenko and Pastur to derive the well-known Marčenko-Pastur law of large dimensional Gram matrices of random matrices with i.i.d. Gaussian entries [28].

Theorem II.3 ([28]).

Consider an Gaussian random matrix

(consists of i.i.d. Gaussian entries of mean 0 and variance 1). Under the Kolmogorov condition, i.e.,

and , the ESD of converges weakly and almost surely to a non-random distribution function (Marčenko-Pastur distribution) with density function

where , , and .

(a) Eigenvalue spreading phenomenon ().
(b) Three instances of the Marčenko-Pastur distribution for different values of the limiting ratio .
Fig. 1: High-dimensional phenomenon: spreading of eigenvalues. Population eigenvalues, LSDs and ESDs are shown in green, red and blue, respectively.

From Theorem II.3, we see that the Marčenko-Pastur distribution is supported on . Thus one of the high-dimensional phenomena relates to Kolmogorov condition is that the sample eigenvalues are more spread out than the population eigenvalues for large and . Fig. 1(a) illustrates this phenomenon for and . All population eigenvalues are equal to 1 (in green), but the 4000 sample eigenvalues have a histogram (in blue) that spreads over a range. Fig. 1(b) illustrates the Marčenko-Pastur law for different values of the Kolmogorov constant .

Ii-B A low-rank spiked sample covariance model

In recent years, some authors investigated in detail the low rank “spiked” covariance matrix model, e.g., [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]. A spiked sample covariance matrix consists of a known “base” covariance matrix (bulk, noise) and a low-rank perturbation (spikes, signals)

where the eigenvalues of , or the signal strengths are often taken as unknown. We assume that the rank

is relatively small and stays fixed in the asymptotic regime. In particular, there exist a notable high-dimensional phenomenon - the eigenvalue phase transition.

In [41], Benaych-Georges, F. and Nadakuditi, R.R. focused on the role of Cauchy transform in calculating the top eigenvalues of the low-rank spiked covariance model. They showed that the limiting non-random eigenvalues associated with the perturbed matrix depend explicitly on the levels of the spikes and the LSD of the bulk via the Cauchy transform.

Theorem II.4 (Eigenvalue phase transition for observation [41]).

Consider a sample covariance matrix . Without loss of generality, we assume that , and, as , , (compactly supported). Let be the supremum of the support of . Then the eigenvalues of exhibit the following behavior. As , for each ,

(II.2)
Corollary II.5 (Eigenvalue phase transition for estimation).

Let . Assume that and, as , , (compactly supported). Let be the supremum of the support of . Then for any , there exists a corresponding such that .

Remark II.6.

It is worth noting that the supremum of the support of (denoted by ) is actually a critical point for eigenvalue estimation. That means, as , for any , if , there is indeed no valuable information contained in about the -th eigenvalue of (denoted by ).

Corollary II.5 yields a method to estimate top eigenvalues associated with a low rank matrix from its noisy measurements. Specifically, for sufficiently large, if , is a (nonlinear shrinkage) estimator for the -th largest eigenvalue of .

Next, we employ an example in [41], where

is a Wigner matrix, to illustrate the eigenvalue phase transition phenomena. A Wigner random matrix

is an Hermitian matrix consisting of i.i.d. random variables with zero mean and unit variance. Benaych-Georges, F. and Nadakuditi, R.R. demonstrated that, for , as ,

(II.3)

On the other hand, the proposed estimator for the eigenvalues associated with a low rank matrix from its noisy measurements is

(II.4)

Fig. 2(a) and 2(b) show the mappings defined by Equations II.3 and II.4, respectively. We can see from Fig 2(a) that there exists significant upward bias if the original eigenvalue (low rank) is greater than (critical point). In Fig. 2(b), we demonstrate the nonlinear shrinkage for estimating the large eigenvalues associated with a low rank matrix, where the critical point is . That means, for any , if falls into the uncertain region, i.e., , there is indeed no chance to recover the -th eigenvalue of (denoted by ) from directly.

(a) Upward bias in the large eigenvalues (eigenvalue phase transition for observation).
(b) Nonlinear shrinkage estimator of the large eigenvalues (eigenvalue phase transition for estimation).
Fig. 2: Eigenvalue phase transitions for low-rank spiked sample covariance model. The noise covariance matrix is a Wigner matrix. The mappings defined by Equation II.3 and Equation II.4 are plotted in red in (a) and (b), respectively. The dashed blue is for easy comparison.

Iii Free probability in a nutshell

Free probability is a mathematical theory developed by Voiculescu around 1983. The freeness or free independence is an analogue of the classical notion of independence. A comprehensive study on free probability could be found in [26, 27].

The significance of free probability to random matrix theory lies in the fundamental observation that random matrices consisting of independent entries also tend to be asymptotically free. Roughly speaking, free probability replaces the vague notion of generic position between eigenspaces associated with random matrices by a mathematical precise concept of freeness and provides a framework to calculate the LSD of

based on LSDs of and . Consequently, many tedious calculations in random matrix theory, particularly those of an algebraic or combinatorial nature, can be performed efficiently and systematically with free probability theory [27].

In the following, we will present some preliminaries for free probability theory. For those who are familiar with these may skip most of this section.

We generally refer to a pair , consisting of a unital algebra and a unital linear functional with , as a non-commutative probability space. The elements of are always called non-commutative random variables. for random variables

are always called moments.

Definition III.1 ([26]).

Let be a unital algebra with a unital linear functional. Suppose are unital subalgebras. We say that are freely independent (or just free) with respect to if whenever we have and such that

  • for

  • with for

we must have .

Definition III.2 ([26]).

Let be a non-commutative probability space. The non-commutative random variables are said to be free or freely independent if the generated unital subalgebras are free in with respect to .

Remark III.3.

With freeness, it is possible to calculate the mixed moments of and from the moments of and . In this sense, freeness could be considered and treated as the non-commutative analogue to the concept of independence of classical random variables.

Remark III.4.

Suppose and are two compactly supported probability measures. If there exist two free random variables and having distributions and , respectively, then the distributions of the non-commutative random variables and depend only on and [42, 23].

There are many classes of random matrices which show asymptotic freeness. In fact, Gaussian random matrices and deterministic matrices, Haar distributed unitary random matrices and deterministic matrices, Wigner random matrices and deterministic matrices all become asymptotically free with respect to the normalized trace [26]. We only present one theorem to demonstrate that randomly rotated deterministic matrices become free in the asymptotic regime.

Theorem III.5 ([26]).

Let and be two sequences of deterministic matrices with and . Let be a sequence of Haar unitary random matrices. Then , , where and are free. This convergence holds also almost surely. So in particular, we have that and are almost surely asymptotically free.

Iii-a Free harmonic analysis

We begin with the definition of Cauchy transform associated with a self-adjoint random variable. Actually, the distribution of a self-adjoint random variable in a -probability space can be identified by a compactly supported probability measure on , i.e., for all ,

With the Cauchy transform associated with probability measure , we can then define the Cauchy transform of as

Next, we will define several transforms on the complex upper half-plane that will be used frequently in this article. The first is the M-transform, defined as

Lemma III.6.

For a probability measure , let be the associated M-transform. Then , for all , and . Furthermore, if and , .

Proof.

Since , we have that, for ,

and

By the assumption that , for all . Finally, with the dominated convergence theorem, we have and , which lead to . ∎

Theorem III.7 (Free multiplication law [10, 9]).

For a non-commutative random variable , let be the associated M-transform. We define the N-transform as a function such that , or equivalently, . For freely independent random variables and , the N-transform of obeys the ‘non-commutative multiplication law’:

Since the trace operator is invariant under cyclic permutations, there exist similar cyclic properties for the M- and N- transforms.

Theorem III.8 (Cyclic property).

Consider an matrix and a matrix and assume is Hermitian. Then

and

Proof.

On account of the definition of M-transform, we have . Since and have the same non-zero eigenvalues and is Hermitian, all of the eigenvalues of are real. Then

Applying the scaling property of inverse functions yields

Iv Separability between the spatio- and temporal- correlations

As mentioned in Section I, the separability brings us significant computational advances. For example, if we address a separable Gaussian process

as a vector

by stacking the columns of into a single vector, the population covariance matrix , where denotes the Kronecker product between matrices.

Lemma IV.1.

Let , where is a Gaussian random matrix, and are non-negative definite matrices. Then .

Proof.

Since , we first have . Then

Remark IV.2.

The separability between spatio- and temporal-correlations reduces the model complexity greatly. Consider the estimation of covariance matrix as an example, the number of parameters for a separable Gaussian random process is , in contrast to for a general Gaussian random process.

The objective of this study is to characterize the asymptotic behaviour of the sample covariance matrices . We place certain restrictions on and so that they are completely defined by simple parameters that could be estimated easily. In the remainder of this article, a diagonally dominant Wigner matrix and a shift invariant (Toeplitz) structure are imposed on the spatio-covariance matrix and the temporal-covariance matrix , respectively.

When is a diagonally dominant Wigner matrix and is a deterministic Toeplitz matrix, we could assume the asymptotic freeness between and , and , respectively. The assumption is relatively weak because the left-singular vectors of compose a Haar-like random matrix. The asymptotic freeness follows immediately from Theorem III.5.

The key idea for the calculation of LSD of is through the non-commutative multiplication law of the N-transform (Theorem III.7). Specifically, with the asymptotic freeness between and , and , we have

Theorem II.3 shows that, as and , the ESD of converges weakly and almost surely to Marčenko-Pastur distribution. By the assumption that , and , we finally have

(IV.1)

Iv-a Restrictions on the spatio- and temporal- covariance matrices

Next, we will give a brief exposition of the restrictions on the spatio- and temporal- covariance structures. It is well-known that the autoregressive (AR) model provides flexibility in terms of understanding the lag-dependence structures. Particularly, AR(1) often appears as a building block in more complex models. For an AR(1) Gaussian process, we actually restrict the spatio-covariance matrix to a diagonally dominant Wigner matrix and impose a ‘shift invariant’ structure to the temporal-covariance matrix , respectively.

Specifically, let be a jointly wide-sense stationary AR(1) Gaussian process, i.e., for ,

Here ,

is white noise with zero mean. We first calculate the spatio-covariance function

Applying the jointly wide-sense stationarity between and yields

In practice, there often exists non-trivial cross correlation term . Nevertheless, it’s not easy to calculate the cross correlation term accurately. To reduce the complexity of the model, we further assume that the random variables in the collection

are i.i.d. with zero mean and standard deviation

. Additionally, we assume that . Finally, the spatio-covariance matrix is modeled as , where is a Wigner random matrix.

We now turn to the structure of the temporal-covariance matrix associated with an autoregressive Gaussian process. For a wide-sense stationary process, it is natural to impose ‘shift invariance’ on the temporal-covariance matrix , i.e., the entries of depend only on the difference between the indices,

With the assumption that , becomes a bi-infinite Toeplitz matrix (the indices range over all integers)

Moreover, for a stationary AR(1) Gaussian process , the covariance between the observations at time and is

It follows immediately that . Finally, the bi-infinite Toeplitz matrix becomes a bi-infinite exponential off-diagonal decay matrix, i.e., .

Remark IV.3.

We emphasize that the AR process could be considered as a mean-field model to study the behavior of large and complex stochastic models [43, 44]. That means rather than considering individual signals, the spectrum of the homogeneous covariance matrix is a mean-field approximation of the spectrum of the original heterogenous covariance matrix.

We use the mean-field model on spectrum in [44] for a better illustration. With a slight abuse of notation, consider two sample matrices

where

is uniformly distributed on

, and , and

where , and . The ESDs of and are plotted in Fig. 3, from which we can see that there is no significant difference between the two ESDs.

Fig. 3: ESDs of the heterogenous (in red) and homogeneous (in blue) autoregressive process.

Finally, we have the explicit form of the sample covariance matrix

where is a Gaussian random matrix, and . The aim of this article is to calculate the LSD of , under the Kolmogorov condition (, and ).

V Cauchy, M- and N-transforms

We have shown in Section IV that, under the assumption of freeness (relatively weak),

This section is dedicated to the calculation of Cauchy, M- and N-transforms associated with the Marčenko-Pastur distribution, diagonally dominant Wigner matrix () and bi-infinite Toeplitz matrix (). It is worth noting that all of these three N-transforms are analytical, so is .

Before that, we give a lemma that will be used frequently in the following calculations. For the proof we refer the reader to [26]. Unless otherwise stated, we define that , where , and .

Lemma V.1 ([26]).

Let and be the roots of . Suppose that and . If , then and . Here is the unit circle in complex plane.

V-a Marčenko-Pastur distribution

We first calculate the Cauchy, M- and N-transforms associated with the Marčenko-Pastur distribution. When , the Marčenko-Pastur density function is reduced to

Consider the associated Cauchy transform

Applying