# M-estimators of scatter with eigenvalue shrinkage

A popular regularized (shrinkage) covariance estimator is the shrinkage sample covariance matrix (SCM) which shares the same set of eigenvectors as the SCM but shrinks its eigenvalues toward its grand mean. In this paper, a more general approach is considered in which the SCM is replaced by an M-estimator of scatter matrix and a fully automatic data adaptive method to compute the optimal shrinkage parameter with minimum mean squared error is proposed. Our approach permits the use of any weight function such as Gaussian, Huber's, or t weight functions, all of which are commonly used in M-estimation framework. Our simulation examples illustrate that shrinkage M-estimators based on the proposed optimal tuning combined with robust weight function do not loose in performance to shrinkage SCM estimator when the data is Gaussian, but provide significantly improved performance when the data is sampled from a heavy-tailed distribution.

## Authors

• 17 publications
• 16 publications
• 9 publications
06/17/2020

### Shrinking the eigenvalues of M-estimators of covariance matrix

A highly popular regularized (shrinkage) covariance matrix estimator is ...
09/21/2018

### Shrinkage estimation of large covariance matrices using multiple shrinkage targets

Linear shrinkage estimators of a covariance matrix --- defined by a weig...
12/05/2014

### Multi-Target Shrinkage

Stein showed that the multivariate sample mean is outperformed by "shrin...
12/13/2018

### Split regression modeling

In this note we study the benefits of splitting variables variables for ...
09/28/2020

### Shrinkage Estimation of the Frechet Mean in Lie groups

Data in non-Euclidean spaces are commonly encountered in many fields of ...
07/22/2019

### Doubly robust off-policy evaluation with shrinkage

We design a new family of estimators for off-policy evaluation in contex...
04/03/2020

### Relaxing the Gaussian assumption in Shrinkage and SURE in high dimension

Shrinkage estimation is a fundamental tool of modern statistics, pioneer...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider a sample of

-dimensional vectors

sampled from a distribution of a random vector with

. One of the first tasks in the analysis of high-dimensional data is to estimate the covariance matrix. The most commonly used estimator is the sample covariance matrix (SCM),

, but its main drawbacks are its loss of efficiency when sampling from distributions which have longer tails than the multivariate normal (MVN) distribution and its sensitivity to outliers. Although being unbiased estimator of the covariance matrix

for any sample length , it is well-known that the eigenvalues are poorly estimated when is not orders of magnitude larger than

. In such cases, one commonly uses a regularized SCM (RSCM) with a linear shrinkage towards a scaled identity matrix,

 Sβ=βS+(1−β)tr(S)pI, (1)

where is the regularization parameter. The RSCM shares the same set of eigenvectors as the SCM , but its eigenvalues are shrinked towards the grand mean of the eigenvalues. That is, if denote the eigenvalues of , then are the eigenvalues of , where . Optimal computation of such that has minimum mean squared error (MMSE) has been developed for example in [1, 2].

The estimator in (1) remains sensitive to outliers and non-Gaussianity. M-estimators of scatter [3] are popular robust alternatives to SCM. We consider the situation where and hence a conventional M-estimator of scatter exists and can be used in place of the SCM in (1). We then propose a fully automatic data adaptive method to compute the optimal shrinkage parameter . First, we derive an approximation for parameter that attains the minimum MMSE and then propose a data adaptive method for its computation. The benefit of our approach is that it can be easily applied to any M-estimator using any weight function

. Our simulation examples illustrate that a shrinkage M-estimator using the proposed tuning and a robust loss function does not loose in performance to optimal shrinkage SCM estimator when the data is Gaussian, but is able to provide significantly improved performance in the case of heavy-tailed data.

Relations to prior work: Earlier work, [4, 5, 6, 7, 8, 9], proposed regularized M-estimators of scatter matrix either by adding a penalty function to M-estimation objective function or a diagonal loading term to the respective first-order solution (M-estimating equation). We consider a simpler approach that uses conventional M-estimator and shrinks its eigenvalues to grand mean of the eigenvalues. Our approach permits computation of the optimal shrinkage parameter for any M-estimation weight function.

The paper is structured as follows. Section 2 introduces the proposed shrinkage M-estimator framework. Section 3 discusses automatic computation of the optimal shrinkage parameter under the assumption of sampling from unspecified elliptical distribution. Section 4 contains simulation studies.

## 2 Shrinkage M-estimators of scatter

In this paper, we assume that and consider an M-estimator of scatter matrix [3] that solves an estimating equation

 ^Σ=1nn∑i=1u(x⊤i^Σ−1xi)xix⊤i, (2)

where is a non-increasing weight function. An M-estimator is a sort of adaptively weighted sample covariance matrix with weights determined by function . To guarantee existence of the solution, it is required that the data verifies the condition stated in [10]. An M-estimator of scatter which shrinks the eigenvalues towards the grand mean of the eigenvalues is then defined as:

 ^Σβ=β^Σ+(1−β)tr(^Σ)pI. (3)

For example, the RSCM is obtained when one uses the Gaussian weight function since then . Other popular choices are Huber’s weight function

 u\tiny H(t;c)=max(−c2,min(t,c2))/b, (4)

where is a tuning constant, chosen by the user, and is a scaling factor used to obtain Fisher consistency at the multivariate normal (MVN) distribution :

 b=Fχ2p+2(c2)+c2(1−Fχ2p(c2))/p.

We choose as

th upper quantile of

: . Another popular choice is -MLE weight function

 u\tiny T(t;ν)=p+νν+t (5)

in which case the corresponding M-estimator is also the maximum likelihood estimate (MLE) of the scatter matrix parameter of a -variate -distribution with degrees of freedom.

An M-estimator is a consistent estimator of the M-functional of scatter matrix, defined as

 Σ0=E[u(x⊤Σ−10x)xx⊤].−2pt (6)

If the population M-functional is known, then by defining a 1-step estimator

 C=1nn∑i=1u(x⊤iΣ−10xi)xix⊤i−5pt (7)

we can compute

 Cβ=βC+(1−β)[tr(C)/p]I−2pt (8)

as a proxy for . Naturally, such an estimator is fictional, as the initial value is unknown. The 1-step estimator is an unbiased estimator of , i.e., .

Ideally we would like to find the value of for which the corresponding estimator attains the minimum MSE, that is,

 βo=argminβ{MSE(^Σβ)=E[∥∥^Σβ−Σ0∥2F]}, (9)

where denotes the Frobenius matrix norm (). Since solving (9) is not doable due to the implicit form of M-estimators, we look for an approximation:

 βappo=argminβ {MSE(Cβ)=E[∥∥Cβ−Σ0∥∥2F]}. (10)

Such approach was also used in [11] in deriving an optimal parameter for shrinkage Tyler’s M-estimator of scatter.

Before stating the expression for we introduce a sphericity measure of scatter:

 γ=ptr(Σ20)tr(Σ0)2. (11)

Sphericity measures how close is to a scaled identity matrix: where if and only if and if has rank equal to 1.

###### Theorem 1.

Suppose is an i.i.d. random sample from any -variate distribution (not necessarily elliptical distribution), and is a weight function for which the expectation exists. The oracle parameter in (10) is

 βappo =∥Σ0−ηoI∥2FE[∥∥C−(tr(C)/p)I∥∥2F] (12) =p(γ−1)η2oE[tr(C2)]−p−1E[tr(C)2] (13)

where and is defined in (11). Note that and the value of the MSE at the optimum is

 MSE(Cβappo)=E[tr(C)2]−tr(Σ0)2p+(1−βappo)∥∥Σ0−ηo∥∥2F. (14)
###### Proof.

Write . Then note that

 L(β) =E[∥∥βC+(1−β)p−1tr(C)I−Σ0∥∥2F] =E[∥∥β(C−Σ0)+(1−β)(p−1tr(C)I−Σ0)∥∥2F] =β2a1+(1−β)2a2+2β(1−β)a3,

where , and

 a2 =E[∥∥p−1tr(C)I−Σ0∥∥2F] =a3+tr(Σ20)−pη2o=a3+p(γ−1)η2o a3 =p−1E[tr(C)tr(C−Σ0)]=p−1E[tr(C)2]−η2op.

Note that is a convex quadratic function in with a unique minimum given by

 βappo=a2−a3(a1−a3)+(a2−a3).

Substituting the expressions for constants and into yields the stated result. ∎

Next we derive a more explicit form of by assuming that the data is generated from unspecified elliptically symmetric distribution.

## 3 Shrinkage parameter computation

Maronna [3] developed M-estimators of scatter matrices originally within the framework of elliptically symmetric distributions [12, 13]

. The probability density function (p.d.f.) of centered (zero mean) elliptically distributed random vector

is

 f(x)=Cp,g|Σ|−1/2g(x⊤Σ−1x),

where is the positive definite symmetric matrix parameter, called the scatter matrix, is the density generator, which is a fixed function that is independent of and , and is a normalizing constant ensuring that integrates to 1. The density generator determines the elliptical distribution. For example, the MVN distribution is obtained when and the -distribution with d.o.f., denoted , is obtained when . Then the weight function for the MLE of scatter corresponds to the case that This yields (5) as the weight function for the MLE of scatter when

. If the second moments of

exists, then can be defined so that represents the covariance matrix of , i.e., ; see [13] for details.

When , then the M-functional in (6) is related to underlying scatter matrix parameter via the relationship

 Σ0=σΣ,

where is a solution to an equation

 E[ψ(x⊤Σ−1xσ)]=p, (15)

where . Often needs to be solved numerically from (15) but in some cases an analytic expression can be derived. If and the used weight function matches with the data distribution, so , then .

Next we derive expressions for and appearing in the denominator of in (13). They depend on a constant (which depend on weight function via ) as follows:

 ψ1=1p(p+2)E[ψ(x⊤Σ−1xσ)2], (16)

where the expectation is w.r.t. .

###### Lemma 1.

Suppose is an i.i.d. random sample from . Then

 E[tr(C2)] =(1+2ψ1−1n)tr(Σ20)+ψ1ntr(Σ0)2, E[tr(C)2] =2ψ1ntr(Σ20)+(1+ψ1−1n)tr(Σ0)2,

given that expectation (16) exists.

###### Proof.

Omitted due to lack of space. ∎

###### Theorem 2.

Let denote an i.i.d. random sample from an elliptical distribution . Then the oracle parameter that minimizes the MSE in Theorem 1 is

 βappo =γ−1(γ−1)(1−1/n)+ψ1(1−1/p)(2γ+p)/n (17)

where is the sphericity measure defined in (11).

###### Proof.

Follows from Theorem 1 after substituting the values for and derived in Lemma 1 in the denominator of in (13). ∎

If one uses Gaussian loss function , then one needs to assume that the 4th-order moments exists and one may assume w.l.o.g. that the scatter matrix parameter equals the covariance matrix [13], i.e., , so and . Furthermore, it holds that and and hence . Finally, we may relate

with an elliptical kurtosis

[14] parameter , defined as

 κ=E[∥Σ−1/2x∥4]p(p+2)−1.−5pt (18)

Elliptical kurtosis vanishes, i.e., , when .

###### Corollary 1.

Let denote an i.i.d. random sample from an elliptical distribution with finite 4th order moments and covariance matrix . Then the optimal tuning parameter of the shrinkage SCM estimator in (1) is

 βo=argminβE[∥∥Sβ−Σ∥2F]=γ−1γ−1+a, (19)

where

 a=κ(2γ(1−1/p)+p−1)n+γ(1−2/p)+pn.−2pt
###### Proof.

The result follows from Theorem 2 since and the M-functional for Gaussian loss is . Since for Gaussian loss, , we notice from (16) that . Plugging into (17) yields the stated expression. ∎

## 4 Simulation studies

We compute different shrinkage M-estimators detailed below. We use acronym Huber to refer to the shrinkage M-estimator that uses Huber’s weight with threshold corresponding to quantile. Shrinkage parameter is computed as . As an estimator of we use the same estimate as in [9, 2] and is an estimate of , computed as

 ^ψ1=1nn∑i=1[tiu(ti)]2p(p+2),−8pt (20)

where and is the corresponding Huber’s M-estimator. Huber’s weight function is standardized to be Fisher consistent for Gaussian samples, meaning that (15) holds with when . Since (20) ignores estimation of , some loss in accuracy of this estimate of is expected for non-Gaussian data.

Acronym t-MLE refers to the shrinkage M-estimator of scatter using weight function , where d.o.f. parameter is estimated from the data. This means that can be assumed since the scaling factor equals one for an MLE of the scatter matrix parameter. The shrinkage parameter is computed as , where is as earlier and is computed as in (20) but using and being the corresponding M-estimator.

Acronym Gauss refers to the shrinkage M-estimator of scatter using Gaussian weight function , i.e., . The shrinkage parameter is computed as with given by (19) and is an estimate of elliptical kurtosis proposed in [2]. Finally, acronym LW refers to estimator proposed by Ledoit and Wold [1]. LW estimator also uses RSCM , where parameter is computed in a different manner than for Gauss estimator.

We generated the data from an elliptical distribution , where the scatter matrix has an AR(1) structure, , where and scale parameter . When , then is close to an identity matrix scaled by , and when , tends to a singular matrix of rank 1. Parameter is set to . The dimension is and varies from 60 to 280.

In our first study, samples are drawn from a MVN distribution and the normalized MSE as a function of sample length is depicted in Figure 3. Results are averages over 2000 Monte-Carlo trials. All estimators are performing well; Gauss and t-MLE are performing slightly better than LW or Huber but differences are marginal.

Figure 3 shows the NMSE figures in the case that samples are from - and -distribution, respectively. In the latter case, the non-robust Gauss and LW estimator provided large NMSE and are therefore not shown in the plot. This was expected as -distribution is heavy-tailed with non-finite kurtosis. As can be seen, the robust Huber and t-MLE shrinkage estimators provide significantly improved performance when the data is sampled from a heavy-tailed or -distribution. We can also notice that t-MLE estimator that adaptively estimates the d.o.f. from the data is able to outperform the Huber’s M-estimator due to the data adaptivity.

Figure 3 depicts the (average) shrinkage parameter as a function of in the case that samples are from a -variate distribution. As can be seen the robust shrinkage estimators (Huber and t-MLE) use larger shrinkage parameter value than Gauss and LW.

## 5 Conclusions and perspectives

This work proposed an original and fully automatic approach to compute an optimal shrinkage parameter in the context of heavy-tailed distributions and/or in presence of outliers. It has been shown that the performance of the method is similar to the optimal one when the data is Gaussian while it outperforms shrinkage Gaussian-based methods when the data distribution turns out to be non-Gaussian. This paper opens several ways, notably considering the case when .

## References

• [1] O. Ledoit and M. Wolf, “A well-conditioned estimator for large-dimensional covariance matrices,” J. Mult. Anal., vol. 88, no. 2, pp. 365–411, 2004.
• [2] E. Ollila and E. Raninen, “Optimal shrinkage covariance matrix estimation under random sampling from elliptical distributions,” IEEE Trans. Signal Process., vol. 67, no. 10, pp. 2707–2719, 2019.
• [3] R. A. Maronna, “Robust M-estimators of multivariate location and scatter,” Ann. Stat., vol. 5, no. 1, pp. 51–67, 1976.
• [4] E. Ollila and D. E. Tyler, “Regularized -estimators of scatter matrix,” IEEE Trans. Signal Process., vol. 62, no. 22, pp. 6059–6070, 2014.
• [5] F. Pascal, Y. Chitour, and Y. Quek, “Generalized robust shrinkage estimator and its application to STAP detection problem,” IEEE Trans. Signal Process., vol. 62, no. 21, pp. 5640–5651, 2014.
• [6] Y. Sun, P. Babu, and D. P. Palomar, “Regularized Tyler’s scatter estimator: Existence, uniqueness, and algorithms,” IEEE Trans. Signal Process., vol. 62, no. 19, pp. 5143–5156, 2014.
• [7] Y. Chen, A. Wiesel, Y. C. Eldar, and A. O. Hero, “Shrinkage algorithms for MMSE covariance estimation,” IEEE Trans. Signal Process., vol. 58, no. 10, pp. 5016–5029, 2010.
• [8] R. Couillet and M. McKay, “Large dimensional analysis and optimization of robust shrinkage covariance matrix estimators,” J. Mult. Anal., vol. 131, pp. 99–120, 2014.
• [9] T. Zhang and A. Wiesel, “Automatic diagonal loading for Tyler’s robust covariance estimator,” in IEEE Statistical Signal Processing Workshop (SSP’16), 2016, pp. 1–5.
• [10] J. T. Kent and D. E. Tyler, “Redescending M-estimates of multivariate location and scatter,” Ann. Stat., vol. 19, no. 4, pp. 2102–2119, 1991.
• [11] Y. Chen, A. Wiesel, and A. O. Hero, “Robust shrinkage estimation of high-dimensional covariance matrices,” IEEE Trans. Signal Process., vol. 59, no. 9, pp. 4097 – 4107, 2011.
• [12] K.-T. Fang, S. Kotz, and K.-W. Ng, Symmetric Multivariate and Related Distributions.   London: Chapman and hall, 1990.
• [13] E. Ollila, D. E. Tyler, V. Koivunen, and H. V. Poor, “Complex elliptically symmetric distributions: survey, new results and applications,” IEEE Trans. Signal Process., vol. 60, no. 11, pp. 5597–5625, 2012.
• [14] R. J. Muirhead, Aspects of Multivariate Statistical Theory.   New York: Wiley, 1982, 704 pages.