Robustness has been a classical topic in statistics since the work of Hampel [MR0301858, MR0359096], Huber([MR2488795, MR0161415]) and Tukey [MR0133937]. Informally, we call an estimator robust if it behaves nicely (in some sense) even when the data are not independent and sub-gaussian. In this work we deal with two different kinds of robustness :
Robustness against outlier
: The classical setup in statistics is that the observations were generated independently from a given probabilistic model. The field of robust statistics aims to relax this strong assumption, as real datasets are typically exposed to some source of contamination. Robustness to outliers has first been described in[MR2488795]: in this context, the dataset is “close” to the ideal setup, but has been corrupted. The corruption of the data can be modeled in several ways. A classical example is Huber’s contamination model, where the data are i.i.d. with common distribution
Here is the distribution of the inliers, and is the distribution of outliers (see [MR2488795]). Another example is the so-called framework, presented in [guillaume2017learning], where outliers may not be independant from each other, nor independant from the other uncorrupted data. The model that we deal with in this work is even more general and is sometimes referred to as “adversarial contamination” (see, for instance, [MR3631028]). The samples are generated from the following process: First, samples are drawn from some unknown distribution. Then, an adversary is allowed to look at the samples and arbitrarily corrupt an -fraction of them. In this setup, not only can outliers be correlated to each other and to inliers, but inliers can also be correlated to one another (because the adversary can choose which original samples to keep and in doing so correlating the samples that he keeps, for instance only keeping the largest samples when they are real-valued). In recent years, designing outlier-robust estimators for high-dimensional settings has become a challenge in a number of applications from the analysis of biological datasets to data poisoning attacks [Voro18].
Robustness against heavy-tailed data
: In this work, we are interested in estimators whose risks are controlled with very high probabilitywithout making either boundedness or gaussian assumptions on the data (or any other strong concentration assumptions). Indeed, we want to avoid those assumptions that limit severely the applicability of the results. Notice that it is not sufficient to bound the expected loss of an estimator, since high probability guarantees derived from such bound (using for instance Markov’s inequality) may be quite weak when data comes from heavy-tail distribution.
Let us give a simple example: For univariate mean estimation, if we are given
independant realizations of a random variable with mean
and variance, the empirical mean satisfies, with probability , . This bound rapidly deteriorates for small values of , but it is sharp for the empirical mean in general [AIHPB_2012__48_4_1148_0]. We call informally robust to heavy tail an estimator whose “loss” is bounded with probability by a quantity proportional to when . The definition of the loss may change depending on the problem at stack, in mean estimation for instance it is just the squared distance . We say that an estimator achieves a subgaussian rate if it achieves the same upper bound on the loss, up to multiplicative constants, as the empirical risk minimizer of a -sample of i.i.d. gaussian variables, even when the real sample is heavy tailed. In a sense, we want our estimator to be as good as if the data were Gaussian, even when it is not !
These two informal definitions are different even if somehow related. For this to make sense, let us consider now high dimension mean estimation. Let denote i.i.d. random variables with mean and covariance matrix . The estimators described in [minsker2015geometric] or in [JMLR:v17:14-273] both achieve, with probability larger than ,
where is the canonical Euclidean norm on . This bound is proportional to , thus the estimators are robust to heavy tails, but they do not achieve a subgaussian rate. Indeed, if are independent identically distributed Gaussian variables , it follows from Borell-TIS’s inequality (see [Led01, Theorem 7.1] or [MR2814399, pages 56-57]) that with probability at least ,
where . It is straightforward to check that and , which leads to the rate in (2) (where is an absolute constant), which is called subgaussian rate
The main difference with (1) is that, in the subgaussian rate, the complexity term and the failure-dependant factor are not multiplied, but added instead. This rate is shown to be deviation-minimax up to constant factors in [lugosi2018nearoptimal]. Recently, the seminal paper [lugosi2019sub] described the first estimator to reach the subgaussian rate only assuming finite second moment, soon followed by other works in that path ([guillaume2017learning], [depersin2019robust], [Bartlett19], [hopkins2018sub]).
In this work, we show that the analysis presented in [lugosi2019sub], in [guillaume2017learning], [lerasle2019lecture] or in [lecu2017robust], all based on the Median-of-mean principle and the use of Rademacher complexities, can be modified in order to achieve sub-gaussian rates for sparse problems assuming only bounded two-order moments. The method developed in [lerasle2019lecture] or in [lecu2017robust] requires data to have at least finite moments (where is the dimension of the space) in order to exploit the sparsity of the problem and offers no guarantees without that requirement, and to the date is the best known. We show that we can drop this condition by judiciously introducing VC-dimension in the different proofs, and exploit the sparsity of the problem with only two moments. Classical approaches using local Rademacher complexities cannot achieve this type of subgaussian bounds under only a second moment assumption in this setup. Indeed, as shown in the counter-example from Section 3.2.3 of [chinot2018statistical], local Rademacher complexities may scale like whereas the right Gaussian bound should be of the order of . Somehow the classical approach used so far does not capture the right statistical complexity of high-dimensional problems under low-dimensional structural assumptions and under only a second moment assumption. Our VC-dimension based approach allows to overcome this issue and to go beyond this subgaussian moments assumption that has appeared in all works on robust and subgaussian estimation in the high-dimensional framework [lerasle2019lecture]. We also show that this general technique can be easily replicated and give new robust estimators that achieve state-of-the-art bounds for different estimation tasks such as :
Regression, already studied in [lecu2017robust] where our estimator’s rate match the one from [lecu2017robust], and sparse regression where our estimator’s rate is the first to match the one from [lecu2017robust] with only two moments.
Mean estimation with non-Euclidean norms, studied in [lugosi2018nearoptimal], where our analysis gives a different rate that is better for some norms.
Covariance estimation, studied in [mendelson2018robust] under norm equivalence: we do not need this assumption with our analysis, thus we give the first subgaussian estimator without this assumption.
Robust low-rank covariance estimation, studied in [Wei2017EstimationOT]. We give an estimators that achieves better bounds and is to our knowledge the first MOM estimator for that problem.
Using VC-dimension in mean estimation, we loose a nice dependence of the risk bounds in the covariance structure: our rates for (non-sparse) mean estimation depend on the ambient dimension instead of the effective rank . In particular, the general approach does not generalize directly to infinite dimensional spaces. In the last section, we show that this issue can be overcome if we have some knowledge on the covariance matrix.
2 Warm-up : MOM principle, VC-dimension and mean estimation
We start with the mean estimation problem in that illustrates our technique: the goal is to estimate the mean
of a random vectorin given a possibly corrupted dataset of i.i.d. copies of . The precise setting is the following:
Let denote independent and identically distributed random vectors in . We want to estimate , assuming that has finite second moment. Let denote the unknown covariance matrix of .
The vectors are not observed, instead, this dataset may have been corrupted, and this corruption may be adversarial: there exists a (possibly random) set such that, for any . The satisfies
The observed dataset is , and we want to recover .
Notice that there are no assumption on the data . In particular these may be dependent of , and the may have arbitrary dependence structure.
We start this part by recalling some basic facts about VC-dimension that appear for instance in [JMLR:v20:17-504].
Let be a set of Boolean functions on any space . We say that a finite set is shattered by if, for every subset , there exists such that . We call VC-dimension of (and note ) the largest integer such that there exists a set of cardinal that is shattered by .
Whenever is a Euclidean space, we will sometimes abusively call VC-dimension of a set and note the VC-dimension of the set of half-spaces generated by the vectors of :
Let us recall some basic facts about VC dimensions.
. More generally, if a set of real-valued functions in a -dimensional linear space, then has VC-dimension (see for instance [dudley1978], Theorem 7.2).
If is a set of functions and , then .
For any , , see Section 6.
Sauer’s Lemma [SAUER1972145]: Let denote a set a functions with VC-dimension and let be a set of points. Let , then
This last lemma can be used to prove the following result that is useful to bound the VC dimension of the set of sparse vectors.
Let , …, denote sets of boolean functions, each having VC-dimension . Then,
Lemma 1 is a straightforward extension of Theorem 3 in [JMLR:v20:17-504].
Let be a set shattered by , and . Because is shattered, we have . But we also have , so .
It follows that or . By technical Lemma 4.6 in [10.5555/524030] if , then . Hence, , which implies Lemma 1.
Fix and note the set of s-sparse vectors, then
To prove Corollary 1, just write the set as a union of -dimensional subspaces. As a side remark, we note that [JMLR:v20:17-504] also shows that this bound is tight up to multiplicative constants: there exists an absolute constant such that . Besides, the result holds even if the set of vectors is not an orthogonal family or if it is not a base. Let us now recall an important theorem that will be very useful in regression and covariance estimation. Let denote a set of multivariate polynomials. A sign assignment is an element of . The sign assignment is consistent with if there exists such that .
Theorem 1 (Warren, [10.2307/1994937]).
Let denote a set of polynomials of degree at most in real variables with , then the number of sign assignments consistent for is at most .
Assume that the set of functions can be written , then .
The following example will be useful in some applications.
Let and call the set of rank , symmetric, -dimensional matrices.
Let . Then .
This work uses the median-of-means (MOM) approach which was introduced in [MR702836, MR1688610, MR855970]
and has received a lot of attention recently in the statistical and machine learning communities[MR3124669, LO, MR3576558, MS, minsker2015geometric]. This approach allows to build estimators that are robust to both outliers and heavy-tail data in various settings [MR1688610, MR855970, MR762855]. It can be defined as follows: we first randomly split the data into blocks of equal-size (if does not divide , we just remove some data). We then compute the empirical mean within each block: for ,
In the one-dimensional case, the final estimator is the median of the latter empirical means. This estimator has subgaussian deviations as shown in [MR3576558]. The extension of this result to higher dimensions is not trivial as there exist several possible generalizations of the one dimensional median [minsker2015geometric].
For any , let and . We start with a basic observation.
When , there is at least blocks on which .
For instance, if , then, there exist at least three quarters of the blocks where . We can now state the main lemma.
Let be a set of Boolean functions satisfying the following assumptions.
For all , .
where is a universal constant.
Then, with probability , for all , there is at least blocks on which .
In words, if each property is true for one non corrupted block with constant probability (here but it could be any fixed constant ) and is large enough, then, with very high probability, all properties are “true for most of the blocks”. This result is an alternative to [lugosi2018nearoptimal, Theorem 2] where the complexity is measured with VC-dimension instead of the Rademacher complexity. We show below that this difference yields to substantial improvements in some examples such as sparse multivariate mean estimation compared with the bounds in [lugosi2018nearoptimal]. The strength of this result is that it is uniform in and gives an exponentially low failure probability, but its proof is quite simple. The proof of this result is given in Section 6.2.
Clearly, the fraction of the block is arbitrary in Lemma 2. In fact, up to some modifications of the constants, the same result holds for any fixed fraction .
2.3 Mean estimation
Let denote a norm on and let denote its dual norm. Let denote the unit ball for the norm and the one for the norm . Let denote the set of extremal vectors of . Let where is the Euclidean norm on . Let
There exists an universival constant such that if , then, with probability larger than ,
In particular, for any , there exists an estimator such that
The ’outlier’ term can be shown to be optimal in important cases (see the remarks after [MR3909640, Theorem 1.3]). The deviation term () is the same as in the Borel-TIS inequality, being the weak variance term. It is optimal as shown in [lugosi2018nearoptimal]. The difference is the complexity term, which is here . Neither [lugosi2018nearoptimal] nor this work build estimators achieving the subgaussian rate, where this complexity is , being a centered Gaussian vector with the same covariance as . We do not know if this rate can be reached in general.
The inequality gives a general bound on the complexity term.
The complexity term in [lugosi2018nearoptimal], which can also be found in [lerasle2019lecture, Chapter 4, Lemma 47] is where , being i.i.d. Rademacher variables. Here it is . Which of them is the best depends on the situation. For instance, when one wishes to estimate with respect to , the Euclidean norm on , , while ,
being the largest eigenvalue of, so the former is better. In this example, the bound in VC dimension looses the nice dependence in the covariance structure. On the other hand, suppose that we want to estimate with respect to the sup norm and assume that for simplicity. Then and so
On the other hand, if we only have two moments on the coordinates of , then the best bound on the Rademacher complexity is which is of order in general (to see that, take for a random vector whose coordinates are independent, equal to with probability and otherwise).
The analysis of Section 6.2 and in particular Lemma 4’, shows that the estimator achieves the bound when , where the complexity is the minimum between the VC dimension and the Rademacher complexity . Therefore both our bounds and the bound of [lugosi2018nearoptimal] hold simultaneously and we can always keep the “best complexity term” among VC and Rademacher complexity. As the main novelty here is the introduction of the VC-dimension, we do not remind this fact in each application. The interested reader can have in mind that, in most examples, the same result holds and the estimators have risk bounds smaller than both complexities. Our aim is to show that VC type bounds are particularly efficient in and sparse scenarii, when Rademacher complexity fails to achieve optimal bounds.
3 Sparse setting and other estimation tasks
This section shows that the methodology of Theorem 2 also applies to a great variety of estimation tasks. Let us start with the example of sparse mean estimation for the Euclidean norm.
3.1 Sparse mean estimation
For any , let denote the set of s-sparse vectors over the dictionary . We fix for this part the vectors and we note . We consider Setting 1 and assume furthermore that belongs to . We note the unit ball for the canonical Euclidean norm in , and we propose the estimator
There exists an absolute constant such that, if , then, with probability larger than ,
Here, is the largest eigenvalue of .
The conclusion of Theorem 3 can be writen as follows. For any , there exists an estimator such that
We see that the complexity () is once again decoupled from the deviation (), which is not the case in works such as [JMLR:v17:14-273] where those two terms are multiplied together. The complexity term is not optimal because it does not depend on the structure of (see Section 4 for details). However, our complexity term is interesting for two main reasons:
This is the first sparsity dependent bound that holds without higher moments conditions than the ones. By contrast, [lerasle2019lecture] or [lecu2017robust] need to assume the existence subgaussian moments in order to make the sparsity appear, and offer no guarantees without that requirement.
It comes close to the theoretic optimal when .
In this section, we consider the standard linear regression setting where data are couplesand we look for the best linear combination of the coordinates of to predict , that is we look for defined as follows: Given (in practice, we will only study or ),
As in the previous section, the observed dataset is a corrupted version of the i.i.d. dataset in a possibly adversarial way. The assumptions made on good data are gathered in the following setting: (see also [lerasle2019lecture] or [audibert2011]).
There exists a (possibly random) set such that, for any , where are independent identically distributed observations in . Let , we make three main assumptions:
has finite second moment and write its -moments matrix . Let also be the ellipsoid associated with this structure.
Let and assume that .
There exists an universal constant such that, for all , .
Condition 2 is implied by Assumptions 3.5 and 3.7 in [audibert2011], the same assumption is made in [lerasle2019lecture].
Condition 3 is called the “small ball hypothesis”, it is described in details in [mendelson2017extending] or in [lecu2016regularization] for instance. It is implied by Condition 3.5 in [audibert2011], it is stated similarly in [lerasle2019lecture].
The two last conditions may seem exotic, we refer to [audibert2011, Section 3] for detailed discussions and examples where these are satisfied. For the moment, we may emphase that they involve only first and second moment conditions on and
Our estimator is defined as follows: Let , where
is the first quartile over: for any sequence , if we note the corresponding increasingly ordered sequence, . Then,
This new estimator satisfies the following result.
There exists an absolute constant such that the following holds. Let or and let . Then, with probability ,
For all ,
So, if , then
If , which holds for instance when (linear regression with independant noise), we have : it is interesting to note that in this case, the excess loss scales with and not with (and becomes just the variance of the noise).
The conclusion of Theorem 4 can be written as follows: for any , there exists an estimator such that
Once again we notice the nice decoupling between complexity and deviation. This result is interesting for a several reasons. First, this work is the first that gives a bound holding with exponential probability, that holds without assuming more than 2 moments on the design , even in the sparse setting. By comparison, [lecu2017robust] or [lugosi2017regularization] for instance, assume that at least subgaussian moments exist to achieve this kind of rate and offers no guarantees without that requirement and are the best to the date.
3.3 Covariance estimation
This section studies the problem of robust covariance estimation. Consider Setting 1, and assume that is known, fixed to without loss of generality. We want to estimate . This problem has a number of applications: the bounds we present can for instance easily be transposed (with the Davis–Kahan theorem) to the problem of robust PCA. It has already been studied in [Wei2017EstimationOT], or [JMLR:v17:14-273], but these estimators do not exhibit any decoupling between complexity and deviation. In [mendelson2018robust], the authors propose a robust estimator for covariance using the MOM method, and get the optimal complexity-deviation decoupling. They also give interesting comments and insights about this estimation problem. However, they need a norm equivalence in order to do so. In addition, they do not study the problem of low rank estimation that we present here.
For any matrix , define its spectral norm by
Let denote the set of dimensional symmetric positive matrices. Assume that
This quantity is sometime refered to as weak variance
of a random matrix[mendelson2018robust]. Our estimator is defined as follows
It satisfies the following bound.
There exists an absolute constant such that, if , then, with probability larger than ,
Assume that , then, for
The “bounded kurtosis assumption”appears similarly in [JMLR:v17:14-273]. In [JMLR:v17:14-273], the estimator achieves a bound of order where is the effective rank of the covariance matrix: once again, the complexity is multiplied by the deviation term in this case, while here they are decoupled : the dimension does not multiply in our bound. In [mendelson2018robust], authors give a better rate (because they only need to be larger than instead of for the bound to hold, but they use the norm equivalence, which is a strong hypothesis we do not need here.
To conclude this section, we propose an estimator for covariance estimation for the Frobenius norm. What is interesting here is that, when we know that the covariance matrix has rank , we can exploit the low-rank assumption in the VC-dimension analysis, while it might be harder to grasp using other forms of complexity. This work is the first, to our knowledge, to exhibit the decoupling between complexity and deviation for this problem, and this estimator is the first of its kind. We note the Frobenius norm for matrices, and the inner product for matrices.
There exists an universal constant such that, if and , then, with probability larger than ,
The conclusion of Theorem 6 can be written as follows: for any , there exists an estimator such that
Compared with [JMLR:v17:14-273], we note that this bound on the risk of our estimator does decouple the complexity and the deviation as announced. Again it holds under only a moment assumption, with no extra subgaussian moment needed to get this bound.
4 An algorithm to improve risk bounds
The different applications of Lemma 2 show that, in general, the complexity term derived from this result is not optimal. For example, for mean estimation estimation in Euclidean norm, the complexity term reached by our estimator is proportional to , where the best rate would be .
In this section, we provide an algorithm that leads to better and in some cases optimal complexity rates.
The price to pay is that these new estimators require some knowledge on the covariance matrix .
We will consider the example of sparse mean estimation for the sake of clarity, but we argue that it also holds for sparse (and non-sparse) regression.
We therefore use the setting 1 and assume furthermore that the mean belongs to , where the set of vectors is fixed and known.
Let denote the eigenvalues of in decreasing order, and let
denote a set of normalized corresponding eigenvectors. For any, let denote the largest index such that . In particular, . By convention, let . Finally, we note , with convention . If we know the matrix
, we can identify the eigenspacesand thus compute the orthogonal projections of the data on these subspaces: , for and .
In Section 3.1, we described a procedure that takes as input an integer and a (possibly corrupted) dataset having common mean which is -sparse relatively to a set of vectors and common covariance matrix . The procedure returns satisfying, with probability at least
Let denote the output of this procedure. The idea of the algorithm is to project on the subspaces and apply this preliminary procedure on those subspaces. Let the dimension of . The algorithm is formally defined as follows:
The algorithm produces an estimator satisfying the following result.
The complexity term in Proposition 3 is better than in Theorem 3, because . More importantly, this complexity term depends on the the covariance structure of the data through the . In the case of sparse mean estimation, we can deduce a precise estimate of this complexity term: for any , there exists an estimator such that:
This estimate comes from the bounds , where is such that .
We then write .
5 Conclusion : concurrent work and discussion
This work is not the first to deal with robust estimation: a lot of results and algorithms have been developed over the past few years for sparse estimation in presence of outliers (see for instance [li2017robust]) but most of these works assume that non-corrupted data are Gaussian. For instance [MR3845006] already deal with mean and covariance estimation using extensions of the Tukey-depth (and using VC-dimension), but their methods rely on informative data being Gaussian.
Robustness to heavy-tailed data has also been studied in various works, see [lugosi2019mean] for a survey of recent developments. We already mentioned two articles that this work tries to complete and improve: [lugosi2018nearoptimal] for mean estimation under any norm and [lecu2017robust] for sparse regression. Though the techniques involved are close, this work illustrates that using VC-dimension can drastically improve risk bounds in various applications, in particular in the sparse setting.
There are still many exciting open questions. The quantity that is crucial in all the studies is
where are boolean functions. In mean estimation for instance, is the important quantity. Bounding this quantity using the VC-dimension of yields a bound independant of the covariance of . On the other hand, bounding that quantity by the Rademacher complexity of the (like in [lecu2017robust], [lugosi2018nearoptimal] or here in Part 6.2) stating that
does not exploit the boundedness of the indicator function and necessitates unnecessary stronger assumptions on data. The ideal would be to conciliate both ideas, and to find a nice in-between that would take into account both the boudedness and the dependency in the covariance structure.
The last point we make is about computational issues. The estimators presented can not be implemented as is. Nevertheless, encouraging recent works have shown that ”relaxed”, computable estimators can be derived from this kind of work. For instance the pioneer work of [hopkins2018sub], followed by [depersin2019robust] and [Bartlett19] for instance, derived tractable estimators, in polynomial times, from the work of [lugosi2019sub]. Even more recently, some new tractable estimators for regression and covariance estimation with heavy-tailed data have emerged in [cherapanamjeri2019algorithms]. We can hope for this work to be made tractable as well, which seems to be quite a challenge.
6.1 A fact about VC-dimension
For any Euclidean space , and any
where is universal constant.
Assume that a set is shattered by . Then, for any , there is a vector so that if and only if . There is a vector so that if and only if . Then we have if and only if , so shatters , and .
Now we see that because if is shattered by then is shattered by . Theorem 1.1 in [vandervaart] states that for some constant , and that concludes the proof.
6.2 General methodology
We want to prove that, with probability ,
If , and it is sufficient to show that by Remark 1. Now we write
By the bounded difference inequality [MR3185193, Theorem 6.2], with probability , .
For the magnitude term, we write
By hypothesis, . Then, we just have to use a classical result of Vapnik-Chervonenkis theory, either in the version of [vershynin_2018, Theorem 8.3.23], or of [Handel_2016, Corollary 7.18]. There exists a universal constant such that
Hence, if ,
Putting everything together, we have the following. If , with probability , . Therefore, by Remark 1, for all
We state a technical lemma that appears in most proofs. Let be any measurable function so that exists. We take
where are any sets of . We have :
If and if , then, with probability ,
where is a universal constant
is the ”weak variance” of the problem.
Let with the universal constant from Lemma 2, let and let
Let . The function are compositions of the function and of the functions The VC-dimension of the set of these compositions is smaller than the VC-dimension of the set of indicator functions indexed by , as recalled in the basic fact 2 at the beginning of Section 2.1. We just use fact 3 to remove the and we get for some constant .
By Markov’s inequality, for any ,
By Lemma 2, applied with , the following event has probability