 # Optimal Covariance Estimation for Condition Number Loss in the Spiked Model

We study estimation of the covariance matrix under relative condition number loss κ(Σ^-1/2Σ̂Σ^-1/2), where κ(Δ) is the condition number of matrix Δ, and Σ̂ and Σ are the estimated and theoretical covariance matrices. Optimality in κ-loss provides optimal guarantees in two stylized applications: Multi-User Covariance Estimation and Multi-Task Linear Discriminant Analysis. We assume the so-called spiked covariance model for Σ, and exploit recent advances in understanding that model, to derive a nonlinear shrinker which is asymptotically optimal among orthogonally-equivariant procedures. In our asymptotic study, the number of variables p is comparable to the number of observations n. The form of the optimal nonlinearity depends on the aspect ratio γ=p/n of the data matrix and on the top eigenvalue of Σ. For γ > 0.618..., even dependence on the top eigenvalue can be avoided. The optimal shrinker has two notable properties. First, when p/n →γ≫ 1 is large, it shrinks even very large eigenvalues substantially, by a factor 1/(1+γ). Second, even for moderate γ, certain highly statistically significant eigencomponents will be completely suppressed. We show that when γ≫ 1 is large, purely diagonal covariance matrices can be optimal, despite the top eigenvalues being large and the empirical eigenvalues being highly statistically significant. This aligns with practitioner experience. We identify intuitively reasonable procedures with small worst-case relative regret - the simplest being generalized soft thresholding having threshold at the bulk edge and slope (1+γ)^-1 above the bulk. For γ < 2 it has at most a few percent relative regret.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Sixty years ago, Charles Stein  made the surprising observation that, when estimating the covariance matrix underlying a dataset with variables and observations, if and are both large, and comparable in size, one should shrink the eigenvalues of the empirical covariance matrix away from their raw empirical values.

In the decades since, shrinkage estimates of covariance have been studied by many researchers, and Stein’s insight has emerged as pervasive and profound. Dozens of citations relevant through 2013 are given in ; more recent work includes [35, 49, 31, 7, 29]. Broadly speaking, the literature has found that the appropriate way to shrink the eigenvalues depends both on the use to be made of the estimated covariance matrix and on the properties of the underlying covariance matrix; for procedures optimal under various assumptions, see among others [36, 34, 13].

### 1.1 Our Focus

This paper studies eigenvalue shrinkage under five specific assumptions:

[-Loss]

We consider relative condition number loss:

 L(^Σ,Σ)=κ(Σ−1/2^ΣΣ−1/2),

where denotes the estimated and the true underlying theoretical covariance, and denotes the condition number of matrix .

[Equivariance]

We consider orthogonally equivariant procedures, defined as follows. Let denote the empirical covariance, and denote an estimator of interest. We say that is orthogonally-equivariant (OE) if for all . Such procedures are in a certain sense coordinate-free.

[Spike]

Spiked covariance models, where is the identity outside an -dimensional subspace. Here is fixed, and the top eigenvalues of , , say, exceed , with all later eigenvalues equalling .

[PGA]

We focus on proportional growth asymptotics, where the data matrix sizes grow proportionally large - with asymptotic aspect ratio .

[Distribution]

In this paper we assume the classical model, where the rows of are i.i.d 111We conducted simulation studies of -loss performance of this paper’s shrinkers across a broader collection of situations with the same bulk spectrum. Our observed performance was consistent with our theorems for the Normal case when we used a linear generative model with i.i.d.

having variance

and finite fourth moments. Accordingly, we believe that work of

[2, 5, 10] can extend our results unchanged across a range of non-normal assumptions. We are mainly interested here in the structure of optimal procedures and the phenomena they exhibit..

### 1.2 Optimality

Our assumptions allow us to evaluate the optimal asymptotic loss among orthogonally equivariant procedures.

###### Theorem 1.2.1.

(Optimal Asymptotic Loss) The following limit exists almost surely:

 limn→∞inf^Σ∈OEL(^Σ,Σ)=a.s.L∗(ℓ1,…,ℓr;γ);

say. Here the infimum is over orthogonally equivariant procedures. Definition 3.2.2 in (3.3) below specifies a function depending only on the aspect ratio and on the top spike eigenvalue , for which:

 L∗(ℓ1,…,ℓr;γ)=κ∗1(ℓ1;γ).

Let be the usual empirical second-moment matrix and let be its usual eigendecomposition, where is orthogonal and is diagonal with the ordered eigenvalues along the diagonal.

We derive in Section 4 below, in Theorem 4.2.1 et seq., a closed-form expression for an asymptotically optimal nonlinearity . That theorem defines a nonlinearity having two tuning parameters: , the aspect ratio; and , the limiting value of the top empirical eigenvalue under our asymptotic model. A fully data-driven nonlinearity is obtained by using for the (unknown) limit eigenvalue tuning parameter simply the top empirical eigenvalue , which is observable. This nonlinearity is applied separately to each of the empirical eigenvalues , producing the diagonal matrix which sits at the core of the orthogonally-equivariant covariance estimator .

###### Theorem 1.2.2.

(Asymptotically Optimal Nonlinearity) The estimator is asymptotically optimal among orthogonally equivariant procedures under relative condition number loss:

 limn→∞L(^Σe,Σ)=a.s.κ∗1(ℓ1;γ).

We also show – see Theorem 4.1.1 – that in case there is even an asymptotically optimal nonlinearity that does not depend on the top eigenvalue , but instead only on (which of course is known); the nonlinearity is denoted and derived in Theorem 3.1.1. Moreover even for , is asymptotically optimal in the single-spike situation, ; and is within several percent of optimal for the multi-spike case .

A different option also avoids tuning by the top eigenvalue, and offers theoretical performance guarantees (of a weaker sort) even for . This option intentionally mis-tunes the tuning parameter to infinity, producing a still well-defined nonlinearity . Section 5 shows that this tuning parameter-free nonlinearity minimaxes the -loss across all spike models of all orders and all configurations .

### 1.3 Insights

Figure 1.1 depicts the optimal nonlinearity for various choices of parameters. Several properties seem surprising: Figure 1.1: Optimal shrinkage nonlinearities. Different panels present different aspect ratios γ=p/n. Each panel’s black curve gives the identity line y=x. Each red curve depicts the optimal single-spike nonlinearity η∗1. Each aqua-shaded area depicts the range of the optimal multi-spike nonlinearity η∗m(λ;λ1,γ) as λ1 varies from λ to ∞. The minimax nonlinearity η\footnotesize\sc mm is shown in each panel as the blue curve. In each panel, all depicted shrinkers lie significantly below the identity line y=x, in the case γ=5, so much so that the identity line is no longer visible.
1. Asymptotic slope . A ‘natural’ psychological expectation for optimal shrinkage is that it ‘ought to’ minimally displace any very large empirical eigenvalues, producing ‘shrunken’ outputs that are still relatively close to inputs . Contrary to this belief, the optimal shrinker has asymptotic slope : as . This entails very substantial displacement of very large eigenvalues - in fact shrinkage by more than 50% when : as .

2. Dead-Zone 1. The nonlinearity has a dead zone, an interval throughout which it collapses all empirical eigenvalues to output value . The dead-zone threshold gives the upper edge of this interval for the largest spike:

 η∗(λ1)=1,λ1<λ+1(γ).

A ‘natural’ psychological expectation for the dead zone is that it ‘ought to’ agree with the so called bulk edge

. Indeed traditional statistical hypothesis tests for the number of eigencomponents in a spiked model consider a null hypothesis where

(i.e. all eigenvalues are one), and spiked alternatives where . Such tests declare to be statistically significant evidence against the null whenever it noticeably exceeds the bulk edge [27, 28, 45, 41, 43]. This leads to the ‘natural’ presumption that the ‘correct’ place to terminate the dead zone is at the bulk edge.

Contradicting this, the dead-zone threshold is noticeably larger than ; see Figure 1.2. This gap implies that we can have even though is large enough to provide conclusive statistical evidence for existence of a non-null spike. Optimal shrinkage does not at all agree with statistical significance in such cases.

For an example, imagine that there are times more variables than observations (not unusual in some genomics applications). Then while . Unless the top theoretical eigenvalue - i.e. more than 50 times larger than the average eigenvalue - the eigenvalues all fall in the dead zone. Figure 1.2: Thresholding behavior of optimal shrinkers; comparison to bulk edge. The dead zone thresholds λ+1(γ) (red), λ+m(∞,γ) (blue) and (shaded aqua) the full range of λ+m(λ1,γ). All empirical eigenvalues smaller than these thresholds are collapsed to 1, under the single-spike nonlinearity η∗1, the multi-spike nonlinearity η∗m, and under the minimax nonlinearity, respectively. For comparison, the bulk edge λ+(γ) is also shown, in black. Note the ordering λ+(γ)<λ+1(γ)≤λ+m(λ1;γ)<λ+m(∞;γ).
3. Dead-Zone 2. When the top eigenvalue escapes the dead zone, it can happen that some noticeably large secondary eigenvalues will still be collapsed to . For secondary eigenvalues, we denote by the upper edge of the set where :

 ∀i>1,η∗(λi)=1,λi<λ+m.

Figure 1.2 shows that for moderately large , can be substantially larger than . It were as if the existence of at least one included eigencomponent makes the optimal shrinker be even more demanding of subsidiary eigenvalues. Any eigencomponents between and are, for large , overwhelmingly statistically significant according to well-designed tests  - yet the optimal rule effectively views them as useless, and ignores them. Figure 1.2 shows that, for fixed, varies with .

The severe shrinkage induced by has precedent: the asymptotic slope of this paper’s optimal shrinker occurred previously in the eigenvalue shrinkage literature - most immediately,  showed222See Lemma 7.1, Table 3 and Table 4 in  that, under the spiked model, the optimal nonlinearities also have asymptotic slopes in three important cases: under Stein’s loss; under Frobenius norm loss on the precision discrepancy ; and under operator norm loss on the relative discrepancy .

### 1.4 Other Performance Comparisons

In Section 6 we compare and contrast performance of the optimal shrinker with some popular and well-known approaches. In the process, we learn surprising and revealing things about the well-known approaches.

We first consider worst case analysis across all spike models. We identify two explicit nonlinearities delivering the same worst-case guarantees as the optimal nonlinear shrinker:

• Minimax soft thresholding: The generalized soft thresholding nonlinearity with slope and soft threshold has the form

 η\footnotesize\sc gst(λ;b,λ0)=1+b⋅(λ−λ0)+.

Tuning this rule with slope , (i.e. the asymptotic slope of the optimal shrinker), and threshold at the bulk edge , produces a minimax estimator333i.e. this rule minimaxes the asymptotic loss – min across rules and max across spike configurations. For space reasons we do not prove the minimaxity property in this paper..

• Precision Nonlinearity: the nonlinearity derived in  for asy. optimal estimation of the so-called precision matrix under Frobenius norm loss; see (6.2) below. also turns out to be asy. minimax for relative condition number loss.

We next consider the relative regret, i.e. the percentage extra loss suffered by a given rule; and we show that both of these rules suffer minimally - from a few to several percent additional loss, as compared to the optimal rule.

### 1.5 Motivation for Relative Condition Number Loss

Our interest in relative condition number loss estimation has been driven both by the mathematical analysis and by the potential applications.

#### 1.5.1 Mathematical Structure

Optimal shrinkers under the spiked model were derived in 

for 26 different loss functions, including Frobenius loss, Stein loss, and Operator loss and Nuclear norm loss. In that paper the discrepancy between the estimated and underlying covariance was shown to have a certain asymptotic

block structure. This could be exploited when the loss measure exhibited a certain separability property. For such separable losses, the optimal shrinker could be obtained by a simple spike-by-spike analysis.

In this paper, the block structure persists, as we explain in Section 2 below; however separability is lost. Thus, the general perspective that was used successfully 26 times in  is not applicable. Instead, the optimal shrinker is no longer separable; as depicted in Figure 1.1 the shrinker’s output at a given sub-principal eigenvalue , , changes according to the value of the top eigenvalue . Also, study of properties using spike-by-spike analysis no longer works. Our proofs of optimality show that the non-separable optimal rule manages a delicate cross-spike compromise - typically choosing a shrinker to balance distortions at the top spike and the bottom spike. So, while we build on the earlier ideas, several new ideas are needed and developed in this paper.

#### 1.5.2 Multi-User Covariance Estimation

Covariance matrices play an essential role in modern portfolio allocation in empirical finance . An extensive statistical literature discusses the benefits of shrunken covariance estimates in that setting; we mention work by Bai and co-authors , El Karoui , Fan and co-authors [18, 19], Lai, Xing and Chen , Ledoit and Wolf , McKay and co-authors: [50, 9] and Onatski .

From the shrinkage literature across 6 decades and also individual papers like , we know that shrinkage is often beneficial, but optimal shrinkage is very task-dependent. Multi-user Covariance Estimation (MUCE) posits an interesting variation on the traditional portfolio allocation task, in which optimal -loss shrinkage plays a key role.

Suppose a central authority supplies a covariance matrix to many users, who each privately use the supplied matrix to perform mean-variance portfolio optimization. The

-th user is assumed to have a vector

of forecasted returns, and to allocate one unit of investor capital across a portfolio (vector) of holdings by solving the mean-variance portfolio allocation problem 

 (\sc mv(μu,^Σ))Minimize over% h:h′^Σhsubject to h′μu=1. (1.1)

The forecast is presumably private to user , and is not known to the suppliers of the covariance matrix (and presumably each user’s forecast is kept private from other users)444Problem (1.1) suppresses the leverage and short sale constraints which often complicate practical allocation; we prefer to focus on risk-return trade-offs in our discussion..

MUCE models a common situation in empirical finance, where risk measurement enterprises Axioma, MSCI-Barra, and RiskMetrics, do, in fact, disseminate risk models (i.e covariance matrices) to their customers on a periodic basis. Many of those customers then privately use the supplied covariance matrices in portfolio allocation.

MUCE is also rumored to arise in large trading organizations composed of decentralized forecasting teams working independently across the organization to each develop their own mean forecasts, where the organization has a central risk-measurement team that supplies a returns covariance matrix for all forecasting teams to use in decentralized allocation. Namely, each team in the organization mean-variance allocates its own individual capital endowment separately from other teams, using the organization’s common covariance matrix and that team’s private forecast.

In the MUCE setting, the producer of the covariance matrix estimate does not know the forecasts that the matrix’ users are applying, and in particular cannot make use of empirical cross-user compromises such as ‘produce the matrix that historically works best on average across all existing users’.

Relative condition number loss optimality, explored in this paper, offers instead mathematical guarantees that apply across all users.

Suppose that the returns , , and let denote the expected out-of-sample risk-adjusted return, where the estimated covariance matrix and (true) forecast are supplied to (1.1); and where, in computing the risk-adjusted return, we evaluate the risk of the portfolio using the correct risk model . Similarly, let denote the Sharpe ratio when the true underlying covariance matrix is used both in Markowitz allocation and in evaluation of the portfolio’s risk-adjusted return. We always have

 \sc sr(^Σ;μ,Σ)≤\sc sr(Σ;μ,Σ);

but in the high-dimensional setting where , the shortfall is typically substantial. In words, using rather than for portfolio allocation causes a substantial shortfall of realized Sharpe ratio [15, 35, 30, 33].

Performance shortfalls are pervasive in real-world empirical finance and one would like to limit them where possible. A Relative Sharpe Ratio Guarantee (RSRG) for is a statement of the form

 \sc sr(Σ;μ,Σ)/ρ≤\sc sr(^Σ;μ,Σ),∀μ.

It guarantees that the Sharpe ratio one experiences using never suffers more than the given factor of deterioration, no matter which underlying is in force. In this way the performance shortfall is curtailed.

Let denote the smallest positive which allows such a guarantee; this has an intimate connection with -loss.

###### Lemma 1.5.1.

(RSRG in terms of Condition Number.) Define the pivot .

 supμ\sc sr(Σ;μ,Σ)\sc sr(^Σ;μ,Σ)=12√κ(Δ)+1κ(Δ)+2 (1.2)

More specifically, let ; then

 \sc rsrg(^Σ(η))=12√κ(Δ(η))+1κ(Δ(η))+2.

The correspondence is one-to-one on . So this lemma sets up an isomorphism between questions concerning Sharpe ratio guarantees and questions concerning condition numbers. Since is increasing in , minimizing relative condition-number loss is equivalent to minimizing rsrg.

Ever since RA Fisher’s pioneering work in discriminant analysis 

, the covariance matrix has been an essential component in designing linear classifiers. Fisher of course showed how to use the underlying theoretical covariance in LDA, but that covariance matrix would not truly be available to us in applications. The dominant approach over the ensuing eight decades has certainly been to ‘plug-in’ the naive empirical covariance as if it were the best available approximation to the theoretical covariance. Relatively little attention has been paid to the idea that in high-dimensional cases the covariance might need to be estimated with application to LDA in mind.

Bickel and Levina  discuss some of the difficulties of traditional empirical covariance matrices as plug-in inputs to Fisher’s discriminant analysis. Methodological work aiming to surmount some of these difficulties includes Jerome Friedman’s regularized discriminant analysis (RDA)  and several methods developed by Jianqing Fan and co-authors [16, 17], including ROAD. Theoretical work documenting the difficulties of plug in rules under high-dimensional asymptotics includes pioneering work of Serdobolski . Recent work on RDA by Dobriban and Wager  studied regularized discriminant analysis under Serdobolski-flavored asymptotics.

This paper’s study of estimation under -loss is directly relevant to Multi-Task discriminant analysis (MTDA), which envisions the use of LDA across many different classification tasks, but always with the same underlying features driving the classification and always the same covariance estimate defined on those features.

The setting for MTDA is increasingly important: it applies to large shared databases which many researchers can mine for different purposes. There are conceptually two distinct databases: first, a database ‘X’ in which a sample of individuals has been measured on ‘predictive’, or easy-to-measure, feature values. Separately, other ’outcome’ or ‘hard-to-measure’ properties about those same individuals are recorded in a database ‘Y’. Researchers are interested in using the outcome ‘Y’ data to define interesting dichotomies of individuals, and then in developing rules to predict class membership in such dichotomies, based on the ‘X’ data.

In one stylized application of MTDA, ‘X’ would be gene expression data and ‘Y’ phenotypic data. Dichotomies based on ‘Y’ split individuals into classes with and without certain phenotypes. The goal of one individual researcher is to identify an interesting phenotypic split and then use gene expression data to predict those phenotypes.

Historically, researchers worked in an uncoordinated way on data of this kind, each one developing a bespoke classifier. Such a practice - developing a different workflow for every such project - is considered harmful by many serious scientists , as the resulting analysis variability and method instability adds uncertainty to the interpretation of researcher claims. For this reason, and also because it’s good research hygiene to have a good baseline procedure, one might consider developing an LDA-based procedure appropriate for use across many different dichotomies.

A common situation in such large shared databases involves the number of ‘X’ features being comparable to or larger than the number of individuals in the database. In that setting, high-dimensional asymptotics become relevant, and we should carefully pay attention to the inaccuracies in the covariance matrix.

Recall that in traditional single-task linear discriminant analysis (LDA)  the feature measurements in the two classes of a dichotomy are assumed to have distributions , where the denote the class-conditional means. For simplicity, we assume that there are a priori the same number of individuals in each class, and define the mean feature value , and the mean interclass contrast . Under traditional LDA, we attempt to classify a future observed into one of the two groups based on linear classifier scores:

 w′(X−ξ)<>0.

Fisher showed that, when is known, the ideal weights are the solution to

 (\sc lda(μ,Σ))Maximize over w:w′μ√w′Σw (1.3)

so that . The misclassification rate of this ideal classifier is , where denotes the standard CDF, and sep denotes Fisher’s measure of interclass separation:

 \sc sep(Σ;μ,Σ)≡√μ′Σ−1μ.

In practice, is not available, and represents an ideal performance achievable only with the aid of an oracle. Plugging into (1.3) gives the achievable interclass separation

 \sc sep(^Σ;μ,Σ)≡μ′^Σ−1μ√μ′^Σ−1Σ^Σ−1μ.

When the number of individuals in the database is comparable to the number of ‘X’ feature measurements per individual, the concerns of this paper become relevant. We always have , but when , the achievable classifier performance can fall well short of ideal oracle performance. A relative separation guarantee is an inequality of the form

 \sc sep(Σ;μ,Σ)/ρ<\sc sep(^Σ;μ,Σ),∀μ.

Specifically, we are guaranteed that the asy. misclassification rate of any dichotomy will not exceed where sep here denotes the ideal separation . The best performance guarantee relative to oracle performance involves

 \sc rsepg≡supμ\sc sep(Σ;μ,Σ)\sc sep(^Σ;μ,Σ).

There is a formal resemblance of Fisher separation sep and Sharpe Ratio sr which implies a formal isomorphism between the MTDA and MUCE settings. Indeed, Lemma 1.5.1 yields that

 ρ=12√κ(Δ)+1κ(Δ)+2,

where, as earlier, . When applied in the MTDA setting, the optimal shrinker of this paper offers a covariance estimate which optimally limits the maximal cross-dichotomy performance shortfall relative to an oracle.

#### 1.5.4 Implications

In these stylized applications, the optimal shrinker behaves quite differently than today’s common practices. Consider first the treatment of large eigenvalues.

• Today, many practical applications of mean-variance optimization and Fisher LDA undoubtedly involve no eigenvalue shrinkage whatever.

• Where shrinkage has been applied in practice - for example by Barra – the goal has been to ensure that the variance in top eigendirections is accurately estimated. Such debiasing rules shrink large eigenvalues relatively little. In contrast this paper’s shrinker discounts the top eigenvalues quite severely, by a factor .

• Shrinkage of the top eigenvalue by nonlinearity combats a marked deficiency of the and optimization problems. Those optimization procedures are not aware of the types of inaccuracy suffered by the covariance estimate

in the high-dimensional case. They ‘clamp down’ severely on exposures to the top empirical eigenvector. However, when the top eigenvector is not perfectly accurate, there is a limited risk benefit from such clamping down.

is aware of the inaccuracy of eigenvectors and guards against clamping down too severely.

Consider next the treatment of small or moderate eigenvalues.

• It is common in applied statistics to decide the rank of a factor model by null hypothesis significance testing. This leads to accepting eigencomponents as part of the covariance model whenever the corresponding empirical eigenvalue exceeds the bulk edge noticeably. The optimal shrinker derived here sets a noticeably higher standard for eigencomponent inclusion, i.e a threshold well outside the bulk, thereby demanding much more than mere ‘statistical significance’.

• In particular, there are spike configurations where the empirical eigenvalues are unquestionably non-null, yet the optimally shrunken covariance model is simply the identity matrix. In such situations, no-one would dispute the existence of correlations between variables, however, it could be understood that the corresponding eigencomponents are too inaccurate to be of value.

For an empirical finance scenario, suppose that we have stocks (as in the Russell 3000) and observations (1 year of daily returns). Then , and while . The threshold for inclusion in the model by the optimal shrinker is asymptotically 38% larger than the usual threshold based on null hypothesis significance testing.

Finally, consider the worst-case situation for MUCE and MTDA. Practitioners often suspect that theoretical guarantees are misleading, because they may guard against unrealistically pessimistic situations. And yet:

• In Section 7 below, we show that the least-favorable forecasts (in MUCE) / least-favorable dichotomies in (MTDA) involve a linear combination of the top empirical eigenvector and the top theoretical eigenvector.

In empirical finance, these two directions translate into the past market direction and the future market direction, respectively. Undoubtedly, many investors face heavy exposures of just this type. In genomic analysis, those two directions concern the directions of strongest underlying genetic relatedness in the population and in the sample, respectively. Undoubtedly, some of the most provocative dichotomies are well approximated using just these two directions.

We briefly mention implications specific to Multi-Task Discriminant Analysis. Because of R.A. Fisher’s prestige, it might today be considered de rigueur to use the standard empirical covariance matrix in developing an empirical linear classifier (pace Bickel and Levina ). Nevertheless, practitioners have long successfully used naive diagonal rules in linear classification and they have sometimes taken pains to document the fact that such naive rules outperform covariance-aware rules; see references in Hand and Yu , the empirical studies of Zhao and co-authors on the very simple Mas-O-Menos classifier , and of Donoho and Jin on the closely related Higher-Criticism feature selector/classifier .

The theory developed here aligns with earlier research – both empirical and theoretical – affirming the use of ‘naive’ diagonal covariance estimates in LDA, but going much further in showing that such naive rules can actually be optimal - in the sense discussed here. For well-cited theoretical work considering diagonal covariances in place of standard covariance estimates when is large, see for example [6, 14, 16]. Under our spiked model assumption, where is large and the top eigenvalue is not very large, the optimal equivariant shrinker literally reduces to the naive identity rule. In short, the identity covariance estimate offers optimal -loss guarantees in a range of cases where there are relatively large and statistically significant eigencomponents.

An example given earlier considered as being very plausible for genomic data analysis. In that setting, unless exceeds 100, the best achievable performance guarantee comes by using the identity covariance matrix.

The theory developed here also provides an additional setting – to add to the previously known ones - where singular values and eigenvalues can sometimes profitably be ignored even when they are noticeably outside the bulk; compare

[23, 13].

## 2 Basic Tools

### 2.1 Spiked Model Asymptotics

We remind the reader of two central assumptions:

[PGA]

Proportional-Growth Asymptotics. We consider a sequence of problem sizes while .

[SPIKE]

Spiked Covariance Model. We assume that the population covariance is spiked, with fixed spikes and base variance . In other words, the eigenvalues of are .

In this setting, the empirical eigenvalues of are random, but have a very simple asymptotic description.

Inside the “bulk”

All but at most lie inside a bulk distribution extending through the interval where

 λ±(γ)=(1±√γ)2.
Outside the “bulk”

Each theoretical eigenvalue among the first which exceeds , generates a corresponding empirical eigenvalue of outside the bulk , and converges to a limiting position (see (2.1) below) as . Asymptotically, at most eigenvalues lie outside the vicinity of the bulk.

Moreover, in the spiked setting, the following fundamental result tells us a great deal about the behavior of the eigenvalues outside the bulk and their eigenvectors. Variants and extensions have been developed by Baik, Ben Arous, and Péché , Baik and Silverstein , Paul , Nadler , Benaych-Georges and Rao-Nadakuditi .

###### Theorem 2.1.1.

(Spiked Covariance Asymptotics, [3, 4, 44, 39, 5]) In the spiked model [SPIKE] under the proportional growth asymptotic [PGA], we have eigenvalue displacement, such that for each spike eigenvalue , a corresponding empirical eigenvalue obeys:

 λi,n=λ(ℓi;γ)⋅(1+oP(1)),n→∞,

where

 λ(ℓ;γ)≡{ℓ⋅(1+γℓ−1)ℓ>ℓ+(γ)λ+(γ)ℓ≤ℓ+(γ). (2.1)

Suppose that the spike values are distinct. We also have eigenvector rotation, such that the theoretical eigenvector and the corresponding empirical eigenvector obey:

 |⟨Ui,Vi⟩|a.s−→c(ℓi;γ),n→∞,

where

 c(ℓ;γ)≡⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩ ⎷1−γ(ℓ−1)21+γℓ−1ℓ>ℓ+(γ)0ℓ≤ℓ+(γ).

In addition, for , we have

 |⟨Ui,Vj⟩|a.s−→0 as n→∞.

Earlier, we pointed out that at most the first eigenvalues emerge from the bulk. The following lemma formalizes this notion.

###### Lemma 2.1.2.

[4, 44] Let be the (decreasingly arranged - ) empirical eigenvalues of the empirical covariance matrix . Under the assumptions [SPIKE] and [PGA], suppose there are spikes exceeding . Then

 λk+1,na.s−→λ+(γ),n→∞.

All results of this paper can be viewed as consequences of Theorem 2.1.1 and Lemma 2.1.2. We believe the conclusions of Theorem 2.1.1 and Lemma 2.1.2 extend to a wider range of assumptions, where the bulk distribution of empirical eigenvalues still follows the Marčenko-Pastur law. See also the footnote in Section 1.

### 2.2 Two Parametrizations of Eigenvalues

Theorem 2.1.1 gives us two ways to describe the top eigenvalues. We can equivalently describe either:

• Underlying theoretical eigenvalues associated to ; these are not directly observable.

• Observable empirical eigenvalues associated to empirical covariance matrices , with large- limits . The correspond to, but are displaced from, their theoretical counterparts .

As long as we are speaking about a theoretical spike eigenvalue , the formula for the limit empirical eigenvalue applies and gives a one-one-correspondence between the spike and limit empirical . We can invert the relation to obtain:

 ℓ(λ;γ)=λ+1−γ+√(λ+1−γ)2−4λ2, (2.2)

provided lies above the bulk edge .

The one-one correspondence establishes a kind of interchangeability between writing out key expressions in terms of spike parameters or in terms of limiting empirical eigenvalues . It turns out to be very convenient to have understood conventions whereby we can without comment write expressions in terms of , in terms of , or even in terms of both. Accordingly, we adopt three conventions.

Convention 1. Provided - or equivalently - we are permitted to write expressions

 F(λ)=G(ℓ)

and the meaning will be unambiguous. If we are in a pedantic mood, we can rewrite such expressions entirely in terms of (hence as ) or entirely in terms of (hence as ). We adopt without comment the habit of writing expressions in terms of a mixture of ’s and ’s when it gives simpler or more memorable formulas.

Convention 2. Provided - or equivalently - there are defined quantities and that may be written in terms either of or and we may write expressions

 F(ℓ,λ,c,s)

and the meaning will be unambiguous. In a pedantic mood we could write everything out in terms of or in terms of .

Convention 3. In several cases below it will be convenient to consider some function as in certain passages a function of and other passages a function of . In such cases what we really mean is that we are assuming and that there is a function and a function , linked by and over the relevant ranges of both and , and that when we write , we really mean , while when we write , we mean . We believe there is little risk of confusion in these instances. 555Note: similar conventions are followed in object-oriented programming languages; different formulas apply depending on the ‘type’ of the argument. Here the type is ‘empirical eigenvalue (or its limit)’ or ‘spike value’.

### 2.3 The Empirical Pivot and Asymptotic Pivot

Let the empirical covariance matrix have spectral decomposition . Let be some nonnegative sequence ; then is a matrix where we retain the empirical eigenvectors but replace the eigenvalues by .

We consider a matrix-valued discrepancy between such an estimator and the underlying covariance which we call the empirical pivot. Specifically, .

We always represent the empirical pivot in the so-called -basis:

 Δn=W′Σ−1/2^Σn(η)Σ−1/2W.

As in , the columns of are basis vectors obtained, for , by applying Gram-Schmidt orthogonalization to the vectors

, producing a sequence of orthonormal vectors

, which we then complete out to an orthogonal basis of . From now on, we always assume that we have transformed coordinates to this -basis and we simply write .

###### Lemma 2.3.1.

 (Convergence to the Asymptotic Pivot). Suppose that has the special form

 (ηi)=(η1,η2,…,ηr,1,…,1),ηi≥1. (2.3)

where ,… are fixed independently of . Let denote the deterministic block-diagonal matrix

 Δa=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣A(ℓ1,η1)A(ℓ2,η2)A(ℓ3,η3)...A(ℓr,ηr)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦⊕Ip−2r,

where the 2-by-2 matrices are defined via:

 A(ℓ,η):=⎡⎢ ⎢⎣ηc2+s2ℓ(η−1)cs√ℓ(η−1)cs√ℓc2+ηs2⎤⎥ ⎥⎦,

with and

. We have almost surely and in probability the convergence in Frobenius norm:

 ∥Δn−Δa((ℓi),(ηi))∥F→0, (2.4)

as .

Below, we will call the asymptotic pivot; it is actually a sequence of pivots, one for each pair visited along the way to our proportional growth limit. Convergence in Frobenius norm is not happening in one common space, but instead, the Frobenius norm of the pivot difference is tending to zero along this sequence.

The following trivial but still useful corollary follows from the continuity of and the triangle inequality for the Frobenius norm.

###### Corollary 2.3.2.

Consider a sequence of vectors such that , , . Then for

 Δn((ηi,n))=W′Σ−1/2^Σn((ηi,n))Σ−1/2W,

we have both

 ∥Δn−Δa((ℓi),(ηi))∥F→0, (2.5)

and

 ∥Δn−Δa((ℓi),(ηi,n))∥F→0, (2.6)

### 2.4 Pivot Optimization

Consider a spike configuration fixed independently of , . Let denote the collection of vectors obeying . The condition is a normalization condition which is also obeyed by the underlying theoretical eigenvalues . Denote the empirical pivot by

 Δe((ℓi);(ηi)pi=1;n,p)=Σ−1/2^Σ(η)Σ−1/2,

where . Consider the empirical pivot optimization problem

 (Ke((ℓi);n,p))minη∈ηr,pκ(Δe((ℓi);(ηi);n,p)).

The value of this finite- problem is random (because the are random).

Lemma 2.3.1 suggests that this seemingly difficult problem can be replaced by the seemingly easier problem of determining scalars that optimize the asymptotic pivot matrix. This section justifies such replacement.

###### Lemma 2.4.1.

(Asymptotic Pivot Optimization). Consider a fixed configuration and a formal variable with free to vary in for

for as in (2.3). Define

 Δa((ℓi);(ηi);n,p,γ)=⊕ri=1A(ℓi,ηi;γ)⊕Ip−2r.

Consider the asymptotic pivot optimization problem

 (Ka((ℓi);n,pn,γ))min(ηi)ri=1κ(Δa((ℓi);(ηi);n,pn,γ)). (2.7)
1. Setting fixed independently of , this problem has the same optimal value for every . There exists a configuration which, when padded by ones as in (2.3), can achieve this optimal value, simultaneously for every . Denote this configuration by .

2. The configuration can be taken to satisfy the constraint , . Thus we may equivalently write (2.7) as a constrained optimization problem:

 (Ka((ℓi);n,pn,γ))min(ηi)ri=1ηi≥1κ(Δa((ℓi);(ηi);n,pn,γ)). (2.8)
###### Theorem 2.4.2.

(Asymptotics of empirical pivot optimization). The value of the empirical pivot optimization problem tends both almost surely and in probability to the same value as the asymptotic pivot optimization problem:

 val(Ke((ℓi);n,pn))→val(Ka((ℓi);n,pn,γ)),n→∞.

Finally, let denote an optimizing configuration of the asymptotic pivot optimization problem . Padding this optimum configuration with ’s asymptotically almost surely achieves the optimum of the empirical pivot

 |κ(Δe((ℓi),(η∗1,…η∗r,1,1,…,1);n,pn))−val(Ke((ℓi);n,pn))|→0,n→∞.

Summarizing: solving the asymptotic pivot optimization problem gives us a shrinker which approximately solves each associated large, finite- empirical pivot optimization problem.

Of course, asymptotically small perturbations of are also asymptotically optimal. Arguing as for Corollary 2.3.2 gives

###### Corollary 2.4.3.

Consider a sequence of vectors such that , , . Then as ,

 |κ(Δe((ℓi),(η1,n,…ηr,n,1,1,…,1);n,pn))−val(Ke((ℓi);n,pn))|→0.

### 2.5 Optimality over all Orthogonally Invariant Procedures

The pivot optimization explicitly optimizes over diagonals in the spectral decomposition of the estimator. Actually this optimization covers all orthogonally-equivariant procedures.

###### Lemma 2.5.1.

Diagonal Representation of Orthogonal Equivariance: Let denote the collection of symmetric nonnegative semidefinite matrices. Any OE procedure is of the form

 ^Σ(S)=Vdiag(η1,…,ηp)V′ (2.9)

where is a spectral decomposition of and

1. for , is a nonnegative function of .

2. for some and all , .

The diagonal representation shows that if, for a given realization , we optimize across all possible diagonals in with variable , we have thereby optimized over the range of matrices possible from orthogonally-equivariant estimators.

Empirical pivot optimization performs such optimization, however under a normalization condition, which normalizes the estimated bulk eigenvalues to have the same average as the bulk theoretical eigenvalues. The value of the normalized optimization problem is actually the same as that of the un-normalized problem. Indeed, the objective function is scale invariant; namely, for

 κ(Δe((ℓi),(ηi)pni=1;n,p))=κ(Δe((ℓi),(b⋅ηi)pi=1;n,p)).

It follows that

 κ(Δe((ℓi),(ηi)))=κ(Δe((ℓi),(ηi/(Avei>rηi)))).

The constrained optimization problem is thus effectively unconstrained. We prefer the normalized problem because the asymptotic analysis identifying its limit behavior is then more transparent and better connected to the Asymptotic Pivot Optimization problem.

### 2.6 Condition Number of the Asymptotic Pivot

We turn our attention to evaluating and optimizing the asymptotic pivot, which we henceforth typically denote simply by - the value of being suppressed in notation.

Using the block diagonal representation of Lemma 2.3.1, the argument for Lemma 2.4.1 shows that if , the asymptotic pivot has condition number

 κ(Δ((ℓi),(ηi)))=maxiν+(A(ℓi,ηi))miniν−(A(ℓi,ηi)), (2.10)

where denote the maximum and minimum characteristic values of the two-by-two matrix , and the maximum and minimum range over , where we formally set and .

###### Lemma 2.6.1.

 For the 2-by-2 matrix we have

 D≡D(ℓ,η)=det(A(ℓ,η))=ηℓ,
 T≡T(ℓ,η)=tr(A(ℓ,η))=(1+˙ηc2ℓ+1+˙ηs2),

where . Consequently:

 ν±(A(ℓ,η))=T/2±√T2/4−D.

For later use, we note that and . Therefore,

 ν+(A(ℓ,1))=1ν−(A(ℓ,1))=ℓ−1.

## 3 The Optimal Shrinker in the Single-Spike Case

We initially consider single-spike configurations , and derive an optimal shrinkage nonlinearity, .

### 3.1 The Closed-Form Expression

The shrinker optimize the asymptotic pivot will equivalently optimize, for , this specialization of (2.10):

 K1(η)≡minηmax[1,ν+(A(ℓ,η))]min[1,ν−(A(ℓ,η))]. (3.1)
###### Theorem 3.1.1.

In the single-spike case, , the -optimal shrinker has the following form

 η∗1(λ)=⎧⎪⎨⎪⎩ℓ1+γ+2γℓ−1ℓ>ℓ+1(γ)1ℓ≤ℓ+1(γ), (3.2)

where . In fact, , and so is also expressible as a function of throughout the domain .

### 3.2 Properties of η∗1

Let denote the formula occurring in the upper branch of the case statement defining in (3.2). Then throughout the interval . Note that is precisely the value of at which and that gives the interval of those where . The branching by cases in (3.2) serves to keep , which is appropriate in the spiked model, as every model eigenvalue is at least , and it therefore makes no sense to shrink to a value less .

Corresponding to on the -scale, we have a threshold