Large-dimensional behavior of regularized Maronna's M-estimators of covariance matrices

Robust estimators of large covariance matrices are considered, comprising regularized (linear shrinkage) modifications of Maronna's classical M-estimators. These estimators provide robustness to outliers, while simultaneously being well-defined when the number of samples does not exceed the number of variables. By applying tools from random matrix theory, we characterize the asymptotic performance of such estimators when the numbers of samples and variables grow large together. In particular, our results show that, when outliers are absent, many estimators of the regularized-Maronna type share the same asymptotic performance, and for these estimators we present a data-driven method for choosing the asymptotically optimal regularization parameter with respect to a quadratic loss. Robustness in the presence of outliers is then studied: in the non-regularized case, a large-dimensional robustness metric is proposed, and explicitly computed for two particular types of estimators, exhibiting interesting differences depending on the underlying contamination model. The impact of outliers in regularized estimators is then studied, with interesting differences with respect to the non-regularized case, leading to new practical insights on the choice of particular estimators.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/04/2015

Large Dimensional Analysis of Robust M-Estimators of Covariance with Outliers

A large dimensional characterization of robust M-estimators of covarianc...
12/19/2017

Some Large Sample Results for the Method of Regularized Estimators

We present a general framework for studying regularized estimators; i.e....
11/09/2020

Coupled regularized sample covariance matrix estimator for multiple classes

The estimation of covariance matrices of multiple classes with limited t...
01/21/2022

Tuned Regularized Estimators for Linear Regression via Covariance Fitting

We consider the problem of finding tuned regularized parameter estimator...
06/05/2019

Robustness and Tractability for Non-convex M-estimators

We investigate two important properties of M-estimator, namely, robustne...
06/26/2019

Preliminary test estimation in ULAN models

Preliminary test estimation, which is a natural procedure when it is sus...
07/11/2021

Derivatives and residual distribution of regularized M-estimators with application to adaptive tuning

This paper studies M-estimators with gradient-Lipschitz loss function re...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Covariance or scatter matrix estimation is a fundamental problem in statistical signal processing [1, 2], with applications ranging from wireless communications [3] to financial engineering [4] and biology [5]. Historically, the sample covariance matrix (SCM) , where are zero-mean data samples, has been a particularly appealing choice among possible estimators. The SCM is known to be the maximum likelihood estimator (MLE) of the covariance matrix when the are independent, identically distributed zero-mean Gaussian observations, and its simple structure makes it easy to implement. Nonetheless, the SCM is known to suffer from three major drawbacks: first, it is not resilient to outliers nor samples of impulsive nature; second, it is a poor estimate of the true covariance matrix whenever the number of samples and the number of variables are of similar order; lastly, it is not invertible for . The sensitivity to outliers is a particularly important issue in radar-related applications [6, 7], where the background noise usually follows a heavy-tailed distribution, often modeled as a complex elliptical distribution [8, 9]. In such cases, the MLE of the covariance matrix is no longer the SCM. On the other hand, data scarcity is a relevant issue in an ever-growing number of signal processing applications where and are generally of similar order, possibly with [4, 5, 10, 11]. New improved covariance estimators are needed to account for both potential data anomalies and high-dimensional scenarios.

In order to harness the effect of outliers and thus provide a better inference of the true covariance matrix, robust estimators known as M-estimators have been designed [12, 13, 14, 9]. Their structure is non-trivial, involving matrix fixed-point equations, and their analysis challenging. Nonetheless, significant progress towards understanding these estimators has been made in large-dimensional settings [15, 16, 17, 18, 19], motivated by the increasing number of applications where are both large and comparable. Salient messages of these works are: (i) outliers or impulsive data can be handled by these estimators, if appropriately designed (the choice of the specific form of the estimator is important to handle different types of outliers) [19]; (ii) in the absence of outliers, robust M-estimators essentially behave as the SCM and, therefore, are still subject to the data scarcity issue [16].

To alleviate the issue of scarce data, regularized versions of the SCM have originally been proposed [20, 21]

. Such estimators consist of a linear combination of the SCM and a shrinkage target (often the identity matrix), which guarantees their invertibility, and often provides a large improvement in accuracy over the SCM when

and are of the same order. Nevertheless, the regularized SCM (RSCM) inherits the sensitivity of the SCM to outliers/heavy-tailed data. To alleviate both the data scarcity issue and the sensitivity to data anomalies, regularized M-estimators have been proposed [1, 15, 2, 22]. Such estimators are similar in spirit to the RSCM, in that they consist of a combination of a robust M-estimator and a shrinkage target. However, unlike the RSCM, but similar to the estimators studied in [16, 19], these estimators are only defined implicitly as the solution to a matrix fixed-point equation, which makes their analysis particularly challenging.

In this article, we propose to study these robust regularized estimators in the double-asymptotic regime where and grow large together. Building upon recent works [17, 23], we will make use of random matrix theory tools to understand the yet-unknown asymptotic behavior of these estimators and subsequently to establish design principles aimed at choosing appropriate estimators in different scenarios. In order to do so, we will first study the behavior of these regularized M-estimators in an outlier-free scenario. In this setting, we will show that, upon optimally choosing the regularization parameter, most M-estimators perform asymptotically the same, meaning that the form of the underlying M-estimator does not impact the performance of its regularized counterpart in clean data. Second, we will investigate the effect of the introduction of outliers in the data, under different contamination models. Initial insights were obtained in [19] for non-regularized estimators, focusing on the weights given by the M-estimator to outlying and legitimate data. However, the current study, by proposing an intuitive measure of robustness, takes a more formal approach to qualify the robustness of these estimators. In particular, we will demonstrate which form of M-estimators is preferable given a certain contamination model, first in the non-regularized setting, and then for regularized estimators.

Notation: , and denote the spectral norm, the Frobenius norm and the trace of the matrix , respectively. The superscript stands for Hermitian transpose. Thereafter, we will use

to denote the ordered eigenvalues of the square matrix

. The statement (resp. ) means that the symmetric matrix is positive definite (resp. positive semi-definite). The arrow designates almost sure convergence, while denotes the Dirac measure at point .

Ii Review of the large dimensional behavior of non-regularized M-estimators

Ii-a General form of non-regularized M-estimators

In the non-regularized case, robust M-estimators of covariance matrices are defined as the solution (when it exists) to the equation in [13]

(1)

where represents the data matrix, and where satisfies the following properties:

  • is a nonnegative, nonincreasing, bounded, and continuous function on ,

  • is increasing and bounded, with .

If well-defined, the solution of (1) can be obtained via an iterative procedure (see, for example, [24, 16]). Intuitively, the -th data sample is given a weight , which should be smaller for outlying samples than for legitimate ones. The choice of the function determines the degree of robustness of the M-estimator. As a rule of thumb, the larger , the more robust the underlying M-estimator to potential extreme outliers [13]. However, such increased robustness is usually achieved at the expense of accuracy [9].

A related and commonly-used estimator is Tyler’s estimator [14], which is associated with the unbounded function . Recent papers dealing with the application of Tyler’s estimator in engineering include [25, 26, 27]. We remark however that, for such function, the existence of a solution to (1) depends on the data at hand (see, e.g., [14, 28, 22]). To avoid this issue, we here focus on a class of estimators with bounded functions (as prescribed above). Examples of practical interest, which we study in some detail, include

(2)
(3)

for some . For a specific , is known to be the MLE of the true covariance matrix when the

are independent, zero-mean, multivariate Student vectors

[19], whereas refers to a modified form of the so-called Huber estimator [29]. Observe that for these functions, , such that the robustness of the associated M-estimator to extreme outliers is controlled by both and the scale factor . In what follows, with a slight abuse of terminology, we will refer to these estimators as “Tyler’s” and “Huber’s” estimators, respectively.

Ii-B Asymptotic equivalent form under outlier-free data model

Assume now the following “outlier free” data model: let be -dimensional data samples, drawn from , where is deterministic and

are random vectors, the entries of which are independent with zero mean, unit variance and finite

-th order moment (for some

). With this model, we now recall the main result from [16].

Theorem 1.

[16] Assume that as . Further assume that . Then, denoting by a solution to (1), we have

where .

This shows that, up to a multiplying constant, regardless of the choice of , Maronna’s M-estimators behave (asymptotically) like the SCM. As such, in the absence of outliers, no information is lost.

However, Theorem 1 excludes the “under-sampled” case . Regularized versions of Maronna’s M-estimators have been proposed to alleviate this issue, in most cases considering regularized versions of Tyler’s estimator () [1, 15, 28, 2], the behavior of which has been studied in [18, 17]. Recently, a regularized M-estimator which accounts for a wider class of functions has been introduced in [22], but its large-dimensional behavior remains unknown. We address this in the next section. Moreover, note that Theorem 1 does not tell us anything about the behavior of different estimators, associated with different functions, in the presence of outlying or contaminating data. While progress to better understand the effect of outliers was recently made in [19], their study focused on non-regularized estimators. In this work, a new measure to characterize the robustness of different M-estimators will be proposed, allowing us to study both non-regularized and regularized estimators (Sections IV and V).

Iii Regularized M-estimators: Large dimensional analysis and calibration

Iii-a General form of regularized M-estimators

We consider the class of regularized M-estimators introduced in [22], and given as the unique solution to

(4)

where is a regularization (or shrinkage) parameter, and where denotes the identity matrix. The introduction of a regularization parameter allows for a solution to exist when . The structure of (4) strongly resembles that of the RSCM, defined as

(5)

where , also referred to as linear shrinkage estimator [30], linear combination estimator [31], diagonal loading [21]

, or ridge regression

[32]. Regularized M-estimators are robust versions of the RSCM.

Iii-B Asymptotic equivalent form under outlier-free data model

Based on recent random matrix theory results, we now characterize the asymptotic behavior of these M-estimators. Under the same data model as that of Section II, we answer the basic question of whether (and to what extent) different regularized estimators, associated with different functions, are asymptotically equivalent. We need the following assumption on the growth regime and the underlying covariance matrix :

Assumption 1.
  1. as .

  2. .

  3. satisfies weakly with almost everywhere.

Assumption 1 slightly differs from the assumptions of Theorem 1. In particular, the introduction of a regularization parameter now allows . Likewise, is now only required to be positive semidefinite.

For each , we denote by the unique solution to (4). We first characterize its behavior in the large regime. To this end, we need the following assumption:

Assumption 2.

.

We now introduce an additional function, which will be useful in characterizing a matrix equivalent to .

Definition. Let Assumption 2 hold. Define as where denotes the inverse function of , which maps onto .

The function is continuous, non-increasing and onto. We remark that Assumption 2 guarantees that (and thus ) is properly defined111Assumption 2 could in fact be relaxed by considering instead the inequality , which therefore enforces a constraint on the choice of both the function (through ) and the regularization parameter . The proof of Theorem 2 (provided in Appendix) considers this more general case. Nevertheless, for simplicity of exposition, we will avoid this technical aspect in the core of the paper.. Note that, importantly, does not have to be lower bounded by 1, as opposed to the non-regularized setting. With this in hand, we have the following theorem:

Theorem 2.

Define a compact set included in . Let be the unique solution to (4). Then, as , under Assumptions 1-2,

where

with the unique positive solution to the equation

(6)

Furthermore, the function is bounded, continuous on and greater than zero.

The proof of Theorem 2 (as well as that of the other technical results in this section) is provided in Appendix A-A.

Remark 1.

The uniform convergence in Theorem 2 will be important for finding the optimal regularization parameter of a given estimator. As a matter of fact, the set , required to establish such uniform convergence, can be taken as in the over-sampled case (provided that ).

Remark 2.

Note that Theorem 2 (and Propositions 1, 2 below) resemble similar results obtained in [17]. The key difference in the present work is that our results apply to a wide class of estimators (associated with functions satisfying the assumptions prescribed above), while [17] only focused on Tyler’s estimator (associated with ).

Theorem 2 shows that, for every function, the estimator asymptotically behaves (uniformly on ) like the RSCM, with weights in lieu of the parameters in (5). Importantly, the relative weight given to the SCM depends on the underlying function, which entails that, for a fixed , two different estimators may have different asymptotic behaviors. However, while this is indeed the case, in the following it will be shown that, upon properly choosing the regularization parameter, all regularized M-estimators share the same, optimal asymptotic performance, at least with respect to a quadratic loss.

Iii-C Optimized regularization and asymptotic equivalence of different regularized M-estimators

First, we will demonstrate that any trace-normalized regularized M-estimator is in fact asymptotically equivalent to the RSCM, up to a simple transformation of the regularization parameter. The result is as follows:

Proposition 1.

Let Assumptions 1-2 hold. For each , the parameter defined as

(7)

is such that

(8)

where we recall that .

Reciprocally, for each , there exists a solution to the equation (7) for which equality (8) holds.

Proposition 1 implies that, in the absence of outliers, any (trace-normalized) estimator is equal to a trace-normalized RSCM estimator with a regularization parameter depending on and on the underlying function (through ). From Theorem 2, it then follows that the estimator asymptotically behaves like the RSCM estimator with parameter .

Thanks to Proposition 1, we may thus look for optimal asymptotic choices of . Given an estimator of , define the quadratic loss of the associated trace-normalized estimator as:

We then have the following proposition:

Proposition 2.

(Optimal regularization)

Let Assumptions 1 and 2 hold. Define

where and . Then,

Furthermore, for a solution to ,

(Optimal regularization parameter estimate)

The solution to

satisfies

Proposition 2 states that, irrespective of the choice of , there exists some for which the quadratic loss of the corresponding regularized M-estimator is minimal, this minimum being the same as the minimum achieved by an optimally-regularized RSCM. The last result of Proposition 2 provides a simple way to estimate this optimal parameter.

In the following, we validate these theoretical findings through simulation. Let , and consider the functions specified in (2) and (3), with and . For , Fig. 1 depicts the expected quadratic loss associated with the solution of (4) and that associated with the RSCM (line curves), along with the expected quadratic loss associated with the random equivalent of Tyler’s and Huber’s estimators (marker).

Fig. 1: Expected quadratic loss of different estimators as varies, for , (), and , averaged over 100 realizations. Arrows indicate the estimated optimal regularization parameters for the considered estimators, while indicates the asymptotic, minimal quadratic loss.

For both functions and all , there is a close match between the quadratic loss of and that of . This shows the accuracy of the (asymptotic) equivalence of and described in Theorem 2. As suggested by our analysis, while the estimators associated with different functions have different performances for a given , they have the same performance when is optimized, with a quadratic loss approaching for large. Furthermore, the optimal regularization parameter for a given function is accurately estimated, as shown by the arrows in Fig. 1.

Iv Large-dimensional robustness: Non-regularized case

In this section, we turn to the case where the data is contaminated by random outliers and study the robustness of M-estimators for distinct functions. Some initial insight has been previously provided in [19] for the non-regularized case. Specifically, that study focused on the comparison of the weights given by the estimator to outlying and legitimate samples. Albeit insightful, the analysis in [19] did not directly assess robustness, understood as the impact of outliers on the estimator’s performance. Moreover, it is not clear how these weights translate into the Frobenius loss, or any other standard metric to measure the estimator’s performance. Here we propose a different approach to analyze robustness, by introducing and evaluating a robustness metric which measures the bias induced by data contamination.

We start by studying non-regularized estimators (thereby excluding the case ), which are technically easier to handle. This will provide insight on the capabilities of different M-estimators to harness outlying samples. Then, in the following section, we will investigate how this study translates to the regularized case. The proofs of the technical results in this section are provided in Appendix A-B.

Iv-a Asymptotic equivalent form under outlier data model

We focus on a particular type of contamination model where outlying samples follow a distribution different from that of the legitimate samples. Similar to [19], the data matrix is constructed with the first data samples () being the legitimate data, and following the same distribution as in Sections II and III (that is, they verify ). The remaining “contaminating” data samples () are assumed to be random, independent of the , with , where is deterministic positive definite and are independent random vectors with i.i.d. zero mean, unit variance, and finite -th order moment entries, for some . This outlier data model is effectively a cluster contamination model [33].

To characterize the asymptotic behavior of M-estimators for this data model, we require the following assumptions on the growth regime and on the underlying covariance matrices and :

Assumption 3.
  1. and as .

  2. .

  3. .

Let us consider a function with now . For such a function, the equation in

(9)

has a unique solution [13], hereafter referred to as . Moreover, define as where denotes the inverse function of , which maps onto .

In this setting, we have the following result:

Theorem 3.

Let Assumption 3 hold and let be the unique solution to (9). Then, as ,

where

with and the unique positive solutions to:

with

Theorem 3 shows that behaves similar to a weighted SCM, with the legitimate samples weighted by , and the outlying samples by . This result generalizes [19, Corollary 3] to allow for . Indeed, the applicability of the results of [19] was limited by a constraint on and , which had to verify the inequality (along with ). This constraint can be easily violated in practice, particularly for : for example, if , then having a proportion of outliers as little as would imply that , in which case there does not exist a verifying the inequality above.

A scenario which will be of particular interest in the following concerns the case where there is a vanishingly small proportion of outliers. This occurs when for some , in which case . For this scenario, the weights given to the legitimate and outlying data are

(10)
(11)

respectively.

In the following, we exploit the form of to characterize the effect of random outliers on the estimator .

Iv-B Robustness analysis

Let be the solution to (1), and the solution to (9). We propose the following metric, termed measure of influence, to assess the robustness of a given estimator to an -contamination of the data:

Definition 1.

For , the measure of influence is given by

For simplicity, we assume hereafter that for all . From Theorems 1 and 3, we have the following:

Corollary 1.

As ,

where

(12)

Note that , as expected. The result (12) shows that is globally influenced by , which is also an intuitive result, since it suggests that the more “different” is from , the higher the influence of the outliers on the estimator. To get clearer insight on the effect of a small proportion of outliers, assuming for some , we compute

(13)

which we will refer to as the infinitesimal measure of influence (IMI)222As defined, the IMI is reminiscent of a standard robust statistical metric known as the influence function [34]. A key distinction with our proposed metric is that the influence function measures the effect of an infinitesimal mass contamination on the estimator robustness (here, is an arbitrary point). In contrast, the IMI focuses on the impact of an infinitesimal random contamination..

From (12) and (13),

(14)

with , given in (10) and (11), respectively.

For particular functions, these general results reduce to even simpler forms: for example, for functions such that (such as or ), which entails , (14) further yields

Further, considering small, the IMI associated with and can be approximated as

(15)

and

(16)

Hence, when , , which shows that the influence of an infinitesimal fraction of outliers is higher for Tyler’s estimator than for Huber’s. In contrast, when , both Huber’s and Tyler’s estimators exhibit the same IMI.

For comparison, the measure of influence of the SCM can be written as

which is linear in . It follows immediately that

(17)

The fact that is bounded may seem surprising, since it is known that a single arbitrary outlier can arbitrarily bias the SCM [34], however we recall that the current model focuses on a particular random outlier scenario. From (12), the SCM is more affected than given M-estimators by the introduction of outliers if and only if

This further legitimizes the study in [19], which focused on these weights to assess the robustness of a given M-estimator. However, in the regularized case it will be shown that the relationship between the relative weights and robustness is more complex (see Subsection V-B).

Fig. 2 depicts the measure of influence for different functions, as the proportion of outlying samples increases. For every function ( or ), we take . In addition, we show the measure of influence of the SCM, as well as the linear approximation (computed using (15), (16) and (17)) of the measure of influence in the neighborhood of . We first set and (such that ), and then swap the roles of and (such that ).

(a) and .
(b) and .
Fig. 2: Measure of influence for , in the non-regularized case for , ().

In the case where , Fig. 2 confirms that the measure of influence of both Tyler’s and Huber’s estimators is lower than that of the SCM, as corroborated by the fact that (see (15), (16)). This shows that the considered M-estimators are more robust to the introduction of outliers than the SCM. Furthermore both Tyler’s and Huber’s estimators exhibit the same robustness for small . However, in the opposite case where , Tyler’s estimator is much less robust than both Huber’s estimator and the SCM, which are both equally robust (for small ). Since both and are unknown in practice, it suggests that choosing Huber’s estimator is preferable over Tyler’s.

V Large dimensional robustness: Regularized case

We now turn to the regularized case, which, in particular, allows . The proofs of the technical results in this section are provided in Appendix A-C.

V-a Asymptotic equivalent form under outlier data model

To facilitate the robustness study of regularized M-estimators, we start by analyzing the large-dimensional behavior of these estimators in the presence of outliers.

For , where , we define the regularized estimator associated with the function as the unique solution to the equation in :

(18)
Remark 3.

If we assume that , then the range of admissible is . Furthermore, if , we can in fact take . In the following, we assume that . Note that, similar to the outlier-free scenario, the introduction of a regularization parameter allows us to relax the assumption of .

Theorem 4.

Assume the same contaminated data model as in Theorem 3. Let Assumptions 1-2 hold and let be the unique solution to (18). Then, as , for all ,

where

with and the unique positive solutions to:

(19)

with

Remark 4.

In the case (no outliers), Theorem 4 reduces to Theorem 2, while in the case (if ), it reduces to Theorem 3.

V-B Robustness analysis

Similar to the non-regularized case, we next make use of to study the robustness of . Importantly, introducing a regularization parameter entails an additional variable to consider when studying the robustness of M-estimators.

We denote by the solution to (18) for a given when there are no outliers. Similar to the non-regularized case, we define

By Theorem 4, we have the following corollary:

Corollary 2.

Let the same assumptions as in Theorem 4 hold. Then,

with

with defined as

Unlike in the non-regularized case, the form of renders the analysis difficult in general. For the specific case however, for all . This is intuitive, and reflects the fact that the more we regularize an estimator, the more robust it becomes (eventually, it boils down to taking ). This extreme regularization, however, leads to a significant bias, and is therefore not desirable.

In the following, we focus on the infinitesimal measure of influence associated with , which is defined in a similar way as for the non-regularized case: assume for some (