# Estimating Tail Probabilities of the Ratio of the Largest Eigenvalue to the Trace of a Wishart Matrix

This paper develops an efficient Monte Carlo method to estimate the tail probabilities of the ratio of the largest eigenvalue to the trace of the Wishart matrix, which plays an important role in multivariate data analysis. The estimator is constructed based on a change-of-measure technique and it is proved to be asymptotically efficient for both the real and complex Wishart matrices. Simulation studies further show the outperformance of the proposed method over existing approaches based on asymptotic approximations, especially when estimating probabilities of rare events.

## Authors

• 5 publications
• 23 publications
12/22/2021

### Small deviation estimates for the largest eigenvalue of Wigner matrices

We establish precise right-tail small deviation estimates for the larges...
04/27/2020

### Consistency of full-sample bootstrap for estimating high-quantile, tail probability, and tail index

We show that the full-sample bootstrap is asymptotically valid for const...
05/06/2019

### Estimating the inverse trace using random forests on graphs

Some data analysis problems require the computation of (regularised) inv...
12/10/2021

### Contour Integral-based Quantum Algorithm for Estimating Matrix Eigenvalue Density

The eigenvalue density of a matrix plays an important role in various ty...
08/21/2020

### Large Deviations of Extreme Eigenvalues of Generalized Sample Covariance Matrices

Very rare events in which the largest eigenvalue of a random matrix is a...
10/03/2017

### Keep It Real: Tail Probabilities of Compound Heavy-Tailed Distributions

We propose an analytical approach to the computation of tail probabiliti...
11/16/2021

### Quantification of fracture roughness by change probabilities and Hurst exponents

The objective of the current study is to utilize an innovative method ca...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider independent and identically distributed (iid) -dimensional observations

from a real or complex valued Gaussian distribution with mean zero and covariance matrix

. Here is an unknown scaling factor and is the identity matrix. Define the data matrix , and assume are the ordered real eigenvalues of the sample covariance matrix , where denotes the conjugate transpose. Note that if , the last of the s are zero. Let be the ratio of the largest eigenvalue to the trace, viz.

 Un,p=λ1(λ1+⋯+λp)/min(n,p). (1)

We are interested in estimating the rare-event tail probability where is some constant such that is small. Estimating rare-event tail probabilities is often of interest in multivariate data analysis. For instance, in multiple testing problems, it is often needed to evaluate very small

-values for individual test statistics to control the overall false-positive error rate.

plays an important role in multivariate statistics when testing the covariance structure. For instance, it has been used to test for equality of the population covariance to a scaled identity matrix, viz.

 H0:Σ=σ2Ipvs.H1:Σ≠σ2Ip

with unknown, i.e., the so-called sphericity test; see, e.g., muirhead2009aspects . The test statistic

does not depend on the unknown variance parameter

and has high detection power against alternative covariance matrices with a low-rank perturbation of the null . In particular, under the alternative of rank-1 perturbation with for some unknown and , the likelihood ratio test statistic can be written as a monotone function of and therefore corresponds to the -value; see, e.g., (muirhead2009aspects, ; bianchi2011performance, ). Please refer to krzanowski2000principles ; muirhead2009aspects ; paul2014random for more discussion and many other applications.

The exact distribution of is difficult to compute, especially when estimating rare-event tail probabilities. Note that follows a Wishart distribution , with for real Gaussian and for complex Gaussian. So the distribution of corresponds to that of the ratio of the largest eigenvalue to the trace of a . However, this distribution is nonstandard and exact formulas based on it typically involve high-dimensional integrals or inverses of Laplace transforms. Numerical evaluation has been studied in davis1972ratios ; schuurmann1973distributions ; kuriki2001tail ; kortun2012distribution ; wei2012exact ; chiani2014distribution

. But for high-dimensional data with large

, the computation becomes more challenging, which is notably the case when is small, due to the additional computational cost to control the relative estimation error of .

The asymptotic distribution of with and both going to infinity has also been studied in the literature. It is known that asymptotically behaves similarly to the largest eigenvalue , whose limiting distribution has been studied in johansson2000shape ; johnstone2001 , and also asymptotically follows the Tracy–Widom distribution; see, e.g., (bianchi2011performance, ; nadler2011distribution, ). That is,

 Pr(Un,p−μn,pσn,p>x)→1−TWβ(x), (2)

where denotes the Tracy–Widom distribution of order , with or for real and complex valued observations, respectively. In particular, for real-valued observations, the centering and scaling constants

 μn,p=1n(√n−12+√p−12)2 and σn,p=1n(√n−12+√p−12)(1√n−1/2+1√p−1/2)1/3 (3)

lead to a convergence rate of the order ; see (ma2012accuracy, ). For the complex case, similar expressions can be found in karouicomplexrate . Nadler nadler2011distribution studied the accuracy of the Tracy–Widom approximation for finite values of and . He found that the approximation may be inaccurate for small and even moderate values of when is large. Therefore, he proposed a correction term to improve the approximation result, which is derived using the Fredholm determinant representation, and he showed that the approximation rate is when follows a complex Gaussian distribution. In the real Gaussian case, which is of interest in many statistical applications, Nadler nadler2011distribution conjectured that the result also holds. The calculation of the correction term in nadler2011distribution depends on the second derivative of the non-standard Tracy–Widom distribution, which usually involves a numerical discretization scheme.

Another limitation of the existing methods is that they may become less efficient when estimating small tail probabilities of rare events. This paper aims to address this rare-event estimation problem. In particular, we propose an efficient Monte Carlo method to estimate the exact tail probability of by utilizing importance sampling. The latter is a commonly used tool to reduce Monte Carlo variance and it has been found helpful to estimate small tail probabilities, especially when the event is rare, in a wide variety of stochastic systems with both light-tailed and heavy-tailed distributions; see, e.g., (SIE76, ; AsmKro06, ; DupLedWang07, ; ASMGLY07, ; BlaGly07, ; LiuXu12, ; LiuXuTomacs, ; xu2014rare, ).

An importance sampling algorithm needs to construct an alternative sampling measure (a change of measure) under which the eigenvalues are sampled. Note that it is necessary to normalize the estimator with a Radon–Nikodym derivative to ensure an unbiased estimate. Ideally, one develops a sampling measure so that the event of interest is no longer rare under the sampling measure. The challenge is of course the construction of an appropriate sampling measure, and one common heuristic is to utilize a sampling measure that approximates the conditional distribution of

given the event . This paper proposes a change of measure that asymptotically approximates the conditional measure . We carry out a rigorous analysis of the proposed estimator for and show that it is asymptotically efficient. Simulation studies show that the proposed method outperforms existing approximation approaches, especially when estimating probabilities of rare events.

The remainder of the paper is organized as follows. In Section 2, we propose our importance sampling estimator and establish its asymptotic efficiency in Theorem 1. Numerical results are presented in Section 3 to illustrate its performance. We discuss the possibility of generalizing the result to the ratio of the sum of the largest eigenvalues to the trace of a Wishart matrix in Section 4. The proof of Theorem 1 is given in Section 5.

## 2 Importance sampling estimation

For ease of discussion, we consider the setting , and . When , the algorithm and theory are essentially the same up to switching labels of and , which is explained in Remark 4. We use the notation to denote the real Wishart Matrix () and complex Wishart matrix (). Since is invariant to , the analysis does not depend on the specific values of , and we take as follows in order to simplify the notation and unify the real and complex cases under the same representation, as specified in Eq. (4) below:

• When , we assume that . That is, the entries of are iid , and are the ordered eigenvalues of .

• When , we assume . We consider the circularly symmetric Gaussian random variable (tse2005fundamentals, ), and we write when and are iid . In the following, we assume that the entries of are iid , and that are the ordered eigenvalues of .

As mentioned, e.g., in (dumitriu2002matrix, ), the eigenvalues

are distributed with probability density function

 fn,p,β(λ)=Cn,p,βp∏i

when , where is a normalizing constant given by

 Cn,p,β=p!(n2)βnp/2p∏j=1Γ(1+β/2)Γ(1+βj/2)Γ{β(n−p+j)/2}.

Then the target probability can be written as

 αn,p(x)=∫λ1≥⋯≥λp≥01(Un,p>x)fn,p,β(λ1,…,λp)dλ1⋯dλp,

where is the indicator function. As discussed in the Introduction, direct evaluation of the above -dimensional integral is computationally challenging, especially when is relatively large.

This work aims to design an efficient Monte Carlo method to estimate . We first introduce some computational concepts from the rare-event analysis literature, which helps to evaluate the computation efficiency of a Monte Carlo estimator.

Consider an estimator of a rare-event probability , which goes to 0 as . We simulate iid  copies of , say , and obtain the average estimator . We want to control the relative error such that for some prescribed ,

 Pr{|¯Ln,p(x)−αn,p(x)|/αn,p(x)>ε}<δ.

Consider the direct Monte Carlo estimator as an example. The direct Monte Carlo directly generates samples from the density (4) and uses So in each simulation we have a Bernoulli variable with mean

. According to the Central Limit theorem, the direct Monte Carlo simulation requires

iid replicates to achieve the above accuracy, where the notation is defined as follows. For any and depending on , means that This implies that the direct Monte Carlo method becomes inefficient and even infeasible as .

A more efficient estimator is the asymptotically efficient estimator; see, e.g., (SIE76, ; AsmKro06, ). An unbiased estimator of is called asymptotically efficient if

 liminfn→∞ln[var{Ln,p(x)}]/ln{αn,p(x)2}≥1. (5)

Note that (5) is equivalent to

 limsupn→∞var{Ln,p(x)}/αn,p(x)2−η=0, (6)

for any . In addition, since and

 limsupn→∞ln{E(L2n,p)}/ln{αn,p(x)2}≤1

by Hölder’s inequality, (5) is also equivalent to

 limn→∞ln{E(L2n,p)}/ln{αn,p(x)2}=1.

When is asymptotically efficient, by Chebyshev’s inequality,

 Pr{|¯Ln,p(x)−αn,p(x)|/αn,p(x)>ε}≤var{Ln,p(x)}/{Nαn,p(x)2ϵ2},

and therefore (6) implies that we only need , for any , iid replicates of . Compared with the direct Monte Carlo simulation, efficient estimation substantially reduces the computational cost, especially when is small.

To construct an asymptotically efficient estimator, we use the importance sampling technique, which is an often used method for variance reduction of a Monte Carlo estimator. We use to denote the probability measure of the eigenvalues . The importance sampling estimator is constructed based on the identity

 Pr(Un,p>x)=E{1(Un,p>x)}=EQ{1(Un,p>x)dP/dQ},

where is a probability measure such that the Radon–Nikodym derivative is well defined on the set , and we use and to denote the expectations under the measures and , respectively. Let be the density function of the eigenvalues under the change of measure . Then, the random variable defined by

 Ln,p=1(Un,p>x)fn,p(λ1,…,λp)/fQn,p(λ1,…,λp)

is an unbiased estimator of under the measure . Therefore, to have asymptotically efficient, we only need to choose a change of measure such that

 liminfn→∞1|2lnαn,p(x)||lnEQ{1(Un,p>x)fn,p(λ1,…,λp)2/fQn,p(λ1,…,λp)2} |≥1. (7)

To gain insight into the requirement (7), we consider some examples. First consider the direct Monte Carlo with ; the right-hand side of (7) then equals which is smaller than 1. On the other hand, consider to be the conditional probability measure given , i.e., ; then the right-hand side of (7) is exactly 1. Note that this change of measure is of no practical use since depends on the unknown . But if we can find a measure that is a good approximation of the conditional probability measure given , we would expect (7) to hold and the corresponding estimator to be efficient. In other words, the asymptotic efficiency criterion requires the change of measure to be a good approximation of the conditional distribution of interest.

Following the above argument, we construct the change of measure as follows, which is motivated by a recent study of Jiang et al.  xu2016rare . These authors studied the tail probability of the largest eigenvalue, i.e., with and proposed a change of measure that approximates the conditional probability measure given in total variation when . It is known that the asymptotic behaviors of and are closely related. We therefore adapt the change of measure to the current problem of estimating . However, we would like to clarify that the problem of estimating is different from that in xu2016rare in terms of both theoretical justification and computational implementation, which is further discussed in Remark 3.

Specifically, we propose the following importance sampling estimator.

###### Algorithm 1.

Every iteration in the algorithm contains three steps, as follows:

• We use the matrix representation of the -Laguerre ensemble introduced in dumitriu2002matrix , and generate the matrix where is a bidiagonal matrix defined by

 Bn−1,p−1,β=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝χβ(n−1)χβ(p−2)χβ(n−2)⋱⋱χβχβ{n−(p−1)}⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠(p−1)×(p−1).

The notation

denotes the square root of the chi-square distribution with

degrees of freedom, and the diagonal and sub-diagonal elements of are generated independently. We then compute the corresponding ordered eigenvalues of , denoted by .

• Conditional on , we sample

from an exponential distribution with density

 f(λ1)=nre−nr(λ1−~x∨λ2)×1(λ1>~x∨λ2), (8)

where and is a rate function such that

 r=12−βγ∫1βx−ydσβ(y)−1−γ2x (9)

with and

denotes the probability distribution function of the Marchenko–Pastur law such that

 σβ(ds)=(β×2πγs)−1√(s−s∗)(s∗−s)1(s∈[s∗,s∗])ds (10)

with and , and is a constant depending on , , and such that

 ~x=xtr(Ln−1,p−1,β/n)/(p−x).
• Based on the collected values , a corresponding importance sampling estimate can be computed as in (12) below and the value of the estimate is saved.

The three steps above are repeated at every iteration. After the last iteration, the saved sampling estimates from all iterations are averaged to give an unbiased estimate of .

Now we detail how the importance sampling estimate (12) is computed at every iteration of the algorithm. Let be the measure induced by combining the above two-step sampling procedure. From dumitriu2002matrix , under the change of measure , the density of is

 fQn,p(λ∗2,…,λ∗p)=Cn−1,p−1,β∏2≤i

This implies that the density function of under is

 fQn,p(λ2,…,λp)=(nn−1)β(n−1)(p−1)/2Cn−1,p−1,β∏2≤i

Therefore takes the form

 fQn,p(λ2,…,λp)×nre−nr(λ1−~x∨λ2)×1(λ1>~x∨λ2)fn,p(λ1,…,λp)=(nn−1)β(n−1)(p−1)/2Cn−1,p−1,βnre−nr(λ1−~x∨λ2)×1(λ1>~x∨λ2)Cn,p,β∏pi=2(λ1−λi)×λβ(n−p+1)/2−11×e−nλ1/2.

The corresponding importance sampling estimate is given by

 Ln,p(x)=1(Un,p>x)dP/dQ, (12)

where is calculated with the sampled based on Eq. (1).

We claim that for the proposed Algorithm 1, with the choice of specified in (9), the importance sampling estimator is asymptotically efficient in estimating the target tail probability. This result is formally stated below and proved in Section 5.

###### Theorem 1.

When , the estimator in (12) is an asymptotically efficient estimator of for .

###### Remark 1.

Our discussion regarding asymptotic efficiency focuses on the case of estimating rare-event tail probability , i.e., when corresponds to a rare event. When , is not rare, and we can still apply the importance sampling algorithm with a reasonable positive value as the exponential distribution’s rate. However, the theoretical properties of the importance sampling estimator must then be studied under a different framework; this issue is not pursued here.

###### Remark 2.

We explain the Marchenko–Pastur form of (10). When the entries of have mean 0 and variance 1 ( and ), the Marchenko–Pastur law for the eigenvalues of takes the standard form

 f(d¯s)=(2πγ¯s)−1√(¯s+−¯s)(¯s−¯s−)1(¯s∈[¯s−,¯s+])d¯s (13)

with and ; see, e.g., Theorem 3.2 in (paul2014random, ). For the setting considered of this paper, the real case () has , so (10) and (13) are consistent. In contrast, the complex case () has and therefore (10) and (13) are different up to a factor of . Specifically, let and be eigenvalues of when has iid entries of and , respectively. Then we know that and (13) implies the empirical distribution in (10).

###### Remark 3.

We discuss the differences between the proposed method and the method in xu2016rare on the largest eigenvalue, which also employs an importance sampling technique. First, the two methods have different targets, i.e., in xu2016rare and here, and therefore use different changes of measure to construct efficient importance sampling estimators. As discussed in Section 2, in order to achieve asymptotic efficiency, the change of measures should approximate the target conditional distribution measures, i.e., in xu2016rare and in this paper. Due to the difference between the two conditional distributions, different changes of measure are constructed in the two methods. Specifically, Jiang et al. xu2016rare sample the largest eigenvalue from a truncated exponential distribution depending on the second largest eigenvalue , while the present work samples from an exponential distribution depending on eigenvalues . Second, the proof techniques of the main asymptotic results in the two papers are also different. In particular, to show the asymptotic efficiency of the importance sampling estimators as defined in (5), we need to derive asymptotic approximations for both the rare-event probability

and the second moments of the importance sampling estimator

. Even though the largest eigenvalue and the ratio statistic have similar large deviation approximation results for their tail probabilities, the asymptotic approximations for the second moments of the importance sampling estimators are different due to the differences between the considered changes of measure as well as the effect of the trace term in . Please refer to the proof for more details.

###### Remark 4.

The method and the theoretical results can be easily extended from the case to the case by switching the labels of and and changing to correspondingly. Note that when , the eigenvalues of and give the same test statistic as defined in (1), which is because and have the same set of nonzero eigenvalues and is scale invariant. By symmetry, when , the joint density function of the eigenvalues of have the same form as (4), except that the labels of and are switched. Therefore, the cases when and are equivalent up to the label switching. Note that after is changed to , becomes correspondingly.

## 3 Numerical study

We conducted numerical studies to evaluate the performance of our algorithm. We first took combinations , , , and and , respectively. Then we compared our algorithm with other methods and present the results in Table 1 and 2.

For the proposed importance sampling estimator, we repeated times and show the estimated probabilities (“

” column) along with the estimated standard deviations of

, i.e., (“” column). The ratios between estimated standard deviations and estimates (“” column) reflect the efficiency of the algorithms. Note that with

replications, the standard error of the estimate is

.

In addition, three alternative methods were considered, namely the direct Monte Carlo, the Tracy–Widom distribution approximation, and the corrected Tracy–Widom approximation (nadler2011distribution, ). We computed direct Monte Carlo estimates (“” column) with independent replications. We present the standard deviation of direct Monte Carlo estimates (“” column) and the ratios between estimated standard deviations and estimates (“”). In addition, we used the approximation of Tracy–Widom distribution (“” column) specified in Eq. (2). The is computed from the RMTstat package in R. Furthermore, following nadler2011distribution , we computed the Tracy–Widom approximation with correction term (“” column), viz.

 Pr(U−μn,pσn,p>x)≈1−TWβ(x)+12(2np)(μn,pσn,p)2TW′′β(x), (14)

where is computed numerically via a standard central differencing scheme with . When , and is chosen according to Eq. (3). When , and is chosen according to karouicomplexrate .

We can see from Tables 1 and 2 that the Tracy–Widom distribution (“” column) significantly overestimates the tail probabilities for all considered settings and the finding is consistent with that in nadler2011distribution . Furthermore, the corrected Tracy–Widom approximation (“” column) underestimates the tail probability and goes to a negative number as becomes small.

Since the proposed importance sampling and the direct Monte Carlo method are both unbiased estimators, next we compare their computational efficiency. As discussed in Section 2, for the average estimator , “” and “” can be used as a measure of the computational efficiency in terms of iteration numbers. From the results in Tables 1 and 2, as decreases, “” grows quickly and even becomes not available. In contrast, “” increases slowly and is generally smaller than “”, showing that the proposed importance sampling is more efficient than the direct Monte Carlo method.

As a further illustration, we compared the iteration numbers and that would be needed to achieve the same level of relative standard errors of the estimators. Specifically, in order to have the same ratios of the standard errors to the estimates, i.e., and , obtained under the importance sampling and direct direct Monte Carlo, respectively, we need

 NDMCNIS=(SDDMC/ESTDMC)2(SDIS/ESTIS)2. (15)

Based on the above equation, the simulation results show that to have a similar standard error obtained under the importance sampling, the direct Monte Carlo method needs more iterations as goes small. For example, from Table 1, when , and , we need to be approximately times larger than ; when , and , we need to be about times larger.

Besides the iteration numbers, we compared the average time cost of each iteration under the importance sampling and the direct Monte Carlo method, respectively. For the direct Monte Carlo, two methods were considered in computing the eigenvalues. The first method directly computes the test statistic using the eigen-decomposition of a randomly sampled Wishart matrix. The second method computes the eigenvalues from the tridiagonal representation form as in Step 1 of Algorithm 1. We ran iterations for all the methods and report the average time of one iteration in Table 3, where the first method of the direct Monte Carlo is denoted as , the second method is denoted as , and the importance sampling method is denoted as . The simulation results show that has the highest time cost per iteration, while and are similar.

We further explain the simulation results from the perspective of algorithm complexity. For each iteration, the first direct Monte Carlo method samples a Wishart matrix and performs its eigen-decomposition, whose cost is typically of the order of . The second direct Monte Carlo method and the importance sampling only need to sample number of chi-square random variables and then decompose a symmetric tridiagonal matrix, at a cost of per iteration Demmel:1997:ANL:264989 . Although the importance sampling also samples from an exponential distribution in Step 2, the distribution parameters can be calculated in advance and it does not affect the overall complexity much. Therefore, the time complexity of the algorithm is higher while and are similar per iteration. Together with the result in (15), we can see that the importance sampling is more efficient than the direct Monte Carlo method in terms of both the iteration number and the overall time cost.

To further check the influence of replication number of the importance sampling algorithm, we focus on the case and compare the performance of different s. In order to obtain accurate reference values of the tail probabilities, we used direct Monte Carlo with repeating time to estimate multiple tail probabilities s ranging from to under and , respectively. Then we estimated the corresponding s using our algorithm with , respectively.

The results are presented in Figure 1, where the -axis represents the reference values . The line “DMC with error bar” represents the (approximated) pointwise confidence intervals, viz.

 [log10(ESTDMC−2×SDDMC/√NDMC),log10(ESTDMC+2×SDDMC/√NDMC)].

Similarly, the line “Importance Sampling with error bar” represents the importance sampling estimates and pointwise confidence intervals, viz.

 [log10(ESTIS−2×SDIS/√NIS),log10(ESTIS+2×SDIS/√NIS)].

One can surmise from the figures that the proposed algorithm gives reliable estimates of probabilities as small as with , which is more efficient than directed Monte Carlo and more accurate than Tracy–Widom approximations. Furthermore, Figure 1 shows that the algorithm improves as the number of iterations increases. We also plot the Tracy–Widom approximations in (2) and (14) in Figure 1 for comparison.

Figure 1 shows that without correction, the Tracy–Widom distribution in (2) is not accurate and overestimates the probabilies. The correction term in (14) improves the approximation when the probability is larger than the scale of about , which is consistent with the result in nadler2011distribution . But when the probability gets smaller, the corrected approximation has larger deviation from true values (on the scale) and even becomes negative. Note that since we cannot plot the of negative numbers in the figures, the lines of the corrected Tracy–Widom approximations appear to be shorter. These results validate the results in Table 1 and 2.