Consider a sample of
-dimensional vectorssampled from a distribution of a random vector with
. One of the first tasks in the analysis of high-dimensional data is to estimate the covariance matrix. The most commonly used estimator is the sample covariance matrix (SCM),
, but its main drawbacks are its loss of efficiency when sampling from distributions which have longer tails than the multivariate normal (MVN) distribution and its sensitivity to outliers. Although being unbiased estimator of the covariance matrixfor any sample length , it is well-known that the eigenvalues are poorly estimated when is not orders of magnitude larger than
. In such cases, one commonly uses a regularized SCM (RSCM) with a linear shrinkage towards a scaled identity matrix,
where is the regularization parameter. The RSCM shares the same set of eigenvectors as the SCM , but its eigenvalues are shrinked towards the grand mean of the eigenvalues. That is, if denote the eigenvalues of , then are the eigenvalues of , where . Optimal computation of such that has minimum mean squared error (MMSE) has been developed for example in [1, 2].
The estimator in (1) remains sensitive to outliers and non-Gaussianity. M-estimators of scatter  are popular robust alternatives to SCM. We consider the situation where and hence a conventional M-estimator of scatter exists and can be used in place of the SCM in (1). We then propose a fully automatic data adaptive method to compute the optimal shrinkage parameter . First, we derive an approximation for parameter that attains the minimum MMSE and then propose a data adaptive method for its computation. The benefit of our approach is that it can be easily applied to any M-estimator using any weight function
. Our simulation examples illustrate that a shrinkage M-estimator using the proposed tuning and a robust loss function does not loose in performance to optimal shrinkage SCM estimator when the data is Gaussian, but is able to provide significantly improved performance in the case of heavy-tailed data.
Relations to prior work: Earlier work, [4, 5, 6, 7, 8, 9], proposed regularized M-estimators of scatter matrix either by adding a penalty function to M-estimation objective function or a diagonal loading term to the respective first-order solution (M-estimating equation). We consider a simpler approach that uses conventional M-estimator and shrinks its eigenvalues to grand mean of the eigenvalues. Our approach permits computation of the optimal shrinkage parameter for any M-estimation weight function.
2 Shrinkage M-estimators of scatter
In this paper, we assume that and consider an M-estimator of scatter matrix  that solves an estimating equation
where is a non-increasing weight function. An M-estimator is a sort of adaptively weighted sample covariance matrix with weights determined by function . To guarantee existence of the solution, it is required that the data verifies the condition stated in . An M-estimator of scatter which shrinks the eigenvalues towards the grand mean of the eigenvalues is then defined as:
For example, the RSCM is obtained when one uses the Gaussian weight function since then . Other popular choices are Huber’s weight function
where is a tuning constant, chosen by the user, and is a scaling factor used to obtain Fisher consistency at the multivariate normal (MVN) distribution :
We choose as
th upper quantile of: . Another popular choice is -MLE weight function
in which case the corresponding M-estimator is also the maximum likelihood estimate (MLE) of the scatter matrix parameter of a -variate -distribution with degrees of freedom.
An M-estimator is a consistent estimator of the M-functional of scatter matrix, defined as
If the population M-functional is known, then by defining a 1-step estimator
we can compute
as a proxy for . Naturally, such an estimator is fictional, as the initial value is unknown. The 1-step estimator is an unbiased estimator of , i.e., .
Ideally we would like to find the value of for which the corresponding estimator attains the minimum MSE, that is,
where denotes the Frobenius matrix norm (). Since solving (9) is not doable due to the implicit form of M-estimators, we look for an approximation:
Such approach was also used in  in deriving an optimal parameter for shrinkage Tyler’s M-estimator of scatter.
Before stating the expression for we introduce a sphericity measure of scatter:
Sphericity measures how close is to a scaled identity matrix: where if and only if and if has rank equal to 1.
Write . Then note that
where , and
Note that is a convex quadratic function in with a unique minimum given by
Substituting the expressions for constants and into yields the stated result. ∎
Next we derive a more explicit form of by assuming that the data is generated from unspecified elliptically symmetric distribution.
3 Shrinkage parameter computation
. The probability density function (p.d.f.) of centered (zero mean) elliptically distributed random vectoris
where is the positive definite symmetric matrix parameter, called the scatter matrix, is the density generator, which is a fixed function that is independent of and , and is a normalizing constant ensuring that integrates to 1. The density generator determines the elliptical distribution. For example, the MVN distribution is obtained when and the -distribution with d.o.f., denoted , is obtained when . Then the weight function for the MLE of scatter corresponds to the case that This yields (5) as the weight function for the MLE of scatter when
. If the second moments ofexists, then can be defined so that represents the covariance matrix of , i.e., ; see  for details.
When , then the M-functional in (6) is related to underlying scatter matrix parameter via the relationship
where is a solution to an equation
where . Often needs to be solved numerically from (15) but in some cases an analytic expression can be derived. If and the used weight function matches with the data distribution, so , then .
Next we derive expressions for and appearing in the denominator of in (13). They depend on a constant (which depend on weight function via ) as follows:
where the expectation is w.r.t. .
Suppose is an i.i.d. random sample from . Then
given that expectation (16) exists.
Omitted due to lack of space. ∎
Let denote an i.i.d. random sample from an elliptical distribution . Then the oracle parameter that minimizes the MSE in Theorem 1 is
where is the sphericity measure defined in (11).
If one uses Gaussian loss function , then one needs to assume that the 4th-order moments exists and one may assume w.l.o.g. that the scatter matrix parameter equals the covariance matrix , i.e., , so and . Furthermore, it holds that and and hence . Finally, we may relate
with an elliptical kurtosis parameter , defined as
Elliptical kurtosis vanishes, i.e., , when .
Let denote an i.i.d. random sample from an elliptical distribution with finite 4th order moments and covariance matrix . Then the optimal tuning parameter of the shrinkage SCM estimator in (1) is
4 Simulation studies
We compute different shrinkage M-estimators detailed below. We use acronym Huber to refer to the shrinkage M-estimator that uses Huber’s weight with threshold corresponding to quantile. Shrinkage parameter is computed as . As an estimator of we use the same estimate as in [9, 2] and is an estimate of , computed as
where and is the corresponding Huber’s M-estimator. Huber’s weight function is standardized to be Fisher consistent for Gaussian samples, meaning that (15) holds with when . Since (20) ignores estimation of , some loss in accuracy of this estimate of is expected for non-Gaussian data.
Acronym t-MLE refers to the shrinkage M-estimator of scatter using weight function , where d.o.f. parameter is estimated from the data. This means that can be assumed since the scaling factor equals one for an MLE of the scatter matrix parameter. The shrinkage parameter is computed as , where is as earlier and is computed as in (20) but using and being the corresponding M-estimator.
Acronym Gauss refers to the shrinkage M-estimator of scatter using Gaussian weight function , i.e., . The shrinkage parameter is computed as with given by (19) and is an estimate of elliptical kurtosis proposed in . Finally, acronym LW refers to estimator proposed by Ledoit and Wold . LW estimator also uses RSCM , where parameter is computed in a different manner than for Gauss estimator.
We generated the data from an elliptical distribution , where the scatter matrix has an AR(1) structure, , where and scale parameter . When , then is close to an identity matrix scaled by , and when , tends to a singular matrix of rank 1. Parameter is set to . The dimension is and varies from 60 to 280.
In our first study, samples are drawn from a MVN distribution and the normalized MSE as a function of sample length is depicted in Figure 3. Results are averages over 2000 Monte-Carlo trials. All estimators are performing well; Gauss and t-MLE are performing slightly better than LW or Huber but differences are marginal.
Figure 3 shows the NMSE figures in the case that samples are from - and -distribution, respectively. In the latter case, the non-robust Gauss and LW estimator provided large NMSE and are therefore not shown in the plot. This was expected as -distribution is heavy-tailed with non-finite kurtosis. As can be seen, the robust Huber and t-MLE shrinkage estimators provide significantly improved performance when the data is sampled from a heavy-tailed or -distribution. We can also notice that t-MLE estimator that adaptively estimates the d.o.f. from the data is able to outperform the Huber’s M-estimator due to the data adaptivity.
Figure 3 depicts the (average) shrinkage parameter as a function of in the case that samples are from a -variate distribution. As can be seen the robust shrinkage estimators (Huber and t-MLE) use larger shrinkage parameter value than Gauss and LW.
5 Conclusions and perspectives
This work proposed an original and fully automatic approach to compute an optimal shrinkage parameter in the context of heavy-tailed distributions and/or in presence of outliers. It has been shown that the performance of the method is similar to the optimal one when the data is Gaussian while it outperforms shrinkage Gaussian-based methods when the data distribution turns out to be non-Gaussian. This paper opens several ways, notably considering the case when .
-  O. Ledoit and M. Wolf, “A well-conditioned estimator for large-dimensional covariance matrices,” J. Mult. Anal., vol. 88, no. 2, pp. 365–411, 2004.
-  E. Ollila and E. Raninen, “Optimal shrinkage covariance matrix estimation under random sampling from elliptical distributions,” IEEE Trans. Signal Process., vol. 67, no. 10, pp. 2707–2719, 2019.
-  R. A. Maronna, “Robust M-estimators of multivariate location and scatter,” Ann. Stat., vol. 5, no. 1, pp. 51–67, 1976.
-  E. Ollila and D. E. Tyler, “Regularized -estimators of scatter matrix,” IEEE Trans. Signal Process., vol. 62, no. 22, pp. 6059–6070, 2014.
-  F. Pascal, Y. Chitour, and Y. Quek, “Generalized robust shrinkage estimator and its application to STAP detection problem,” IEEE Trans. Signal Process., vol. 62, no. 21, pp. 5640–5651, 2014.
-  Y. Sun, P. Babu, and D. P. Palomar, “Regularized Tyler’s scatter estimator: Existence, uniqueness, and algorithms,” IEEE Trans. Signal Process., vol. 62, no. 19, pp. 5143–5156, 2014.
-  Y. Chen, A. Wiesel, Y. C. Eldar, and A. O. Hero, “Shrinkage algorithms for MMSE covariance estimation,” IEEE Trans. Signal Process., vol. 58, no. 10, pp. 5016–5029, 2010.
-  R. Couillet and M. McKay, “Large dimensional analysis and optimization of robust shrinkage covariance matrix estimators,” J. Mult. Anal., vol. 131, pp. 99–120, 2014.
-  T. Zhang and A. Wiesel, “Automatic diagonal loading for Tyler’s robust covariance estimator,” in IEEE Statistical Signal Processing Workshop (SSP’16), 2016, pp. 1–5.
-  J. T. Kent and D. E. Tyler, “Redescending M-estimates of multivariate location and scatter,” Ann. Stat., vol. 19, no. 4, pp. 2102–2119, 1991.
-  Y. Chen, A. Wiesel, and A. O. Hero, “Robust shrinkage estimation of high-dimensional covariance matrices,” IEEE Trans. Signal Process., vol. 59, no. 9, pp. 4097 – 4107, 2011.
-  K.-T. Fang, S. Kotz, and K.-W. Ng, Symmetric Multivariate and Related Distributions. London: Chapman and hall, 1990.
-  E. Ollila, D. E. Tyler, V. Koivunen, and H. V. Poor, “Complex elliptically symmetric distributions: survey, new results and applications,” IEEE Trans. Signal Process., vol. 60, no. 11, pp. 5597–5625, 2012.
-  R. J. Muirhead, Aspects of Multivariate Statistical Theory. New York: Wiley, 1982, 704 pages.