# Robust subgaussian estimation with VC-dimension

Median-of-means (MOM) based procedures provide non-asymptotic and strong deviation bounds even when data are heavy-tailed and/or corrupted. This work proposes a new general way to bound the excess risk for MOM estimators. The core technique is the use of VC-dimension (instead of Rademacher complexity) to measure the statistical complexity. In particular, this allows to give the first robust estimators for sparse estimation which achieves the so-called subgaussian rate only assuming a finite second moment for the uncorrupted data. By comparison, previous works using Rademacher complexities required a number of finite moments that grows logarithmically with the dimension. With this technique, we derive new robust sugaussian bounds for mean estimation in any norm. We also derive a new robust estimator for covariance estimation that is the first to achieve subgaussian bounds without L_4-L_2 norm equivalence.

## Authors

• 5 publications
01/22/2021

### On the robustness to adversarial corruption and to heavy-tailed data of the Stahel-Donoho median of means

We consider median of means (MOM) versions of the Stahel-Donoho outlying...
12/09/2018

### Uniform bounds for robust mean estimators

Median-of-means technique is an elegant and general method for estimatin...
02/01/2021

### Optimal robust mean and location estimation via convex programs with respect to any pseudo-norms

We consider the problem of robust mean and location estimation w.r.t. an...
03/12/2020

### Benign overfitting in the large deviation regime

We investigate the benign overfitting phenomenon in the large deviation ...
03/30/2022

### Robust signal dimension estimation via SURE

The estimation of signal dimension under heavy-tailed latent factor mode...
02/24/2022

### U-statistics of growing order and sub-Gaussian mean estimators with sharp constants

This paper addresses the following question: given a sample of i.i.d. ra...
06/07/2019

### Robust subgaussian estimation of a mean vector in nearly linear time

We construct an algorithm, running in nearly-linear time, which is robus...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Robustness has been a classical topic in statistics since the work of Hampel [MR0301858, MR0359096], Huber([MR2488795, MR0161415]) and Tukey [MR0133937]. Informally, we call an estimator robust if it behaves nicely (in some sense) even when the data are not independent and sub-gaussian. In this work we deal with two different kinds of robustness :

• [label=-, itemsep=0pt]

• Robustness against outlier

: The classical setup in statistics is that the observations were generated independently from a given probabilistic model. The field of robust statistics aims to relax this strong assumption, as real datasets are typically exposed to some source of contamination. Robustness to outliers has first been described in

[MR2488795]: in this context, the dataset is “close” to the ideal setup, but has been corrupted. The corruption of the data can be modeled in several ways. A classical example is Huber’s contamination model, where the data are i.i.d. with common distribution

 dP=(1−ϵ)dPi+ϵdPo.

Here is the distribution of the inliers, and is the distribution of outliers (see [MR2488795]). Another example is the so-called framework, presented in [guillaume2017learning], where outliers may not be independant from each other, nor independant from the other uncorrupted data. The model that we deal with in this work is even more general and is sometimes referred to as “adversarial contamination” (see, for instance, [MR3631028]). The samples are generated from the following process: First, samples are drawn from some unknown distribution. Then, an adversary is allowed to look at the samples and arbitrarily corrupt an -fraction of them. In this setup, not only can outliers be correlated to each other and to inliers, but inliers can also be correlated to one another (because the adversary can choose which original samples to keep and in doing so correlating the samples that he keeps, for instance only keeping the largest samples when they are real-valued). In recent years, designing outlier-robust estimators for high-dimensional settings has become a challenge in a number of applications from the analysis of biological datasets to data poisoning attacks [Voro18].

• Robustness against heavy-tailed data

: In this work, we are interested in estimators whose risks are controlled with very high probability

without making either boundedness or gaussian assumptions on the data (or any other strong concentration assumptions). Indeed, we want to avoid those assumptions that limit severely the applicability of the results. Notice that it is not sufficient to bound the expected loss of an estimator, since high probability guarantees derived from such bound (using for instance Markov’s inequality) may be quite weak when data comes from heavy-tail distribution.

Let us give a simple example: For univariate mean estimation, if we are given

independant realizations of a random variable with mean

and variance

, the empirical mean satisfies, with probability , . This bound rapidly deteriorates for small values of , but it is sharp for the empirical mean in general [AIHPB_2012__48_4_1148_0]. We call informally robust to heavy tail an estimator whose “loss” is bounded with probability by a quantity proportional to when . The definition of the loss may change depending on the problem at stack, in mean estimation for instance it is just the squared distance . We say that an estimator achieves a subgaussian rate if it achieves the same upper bound on the loss, up to multiplicative constants, as the empirical risk minimizer of a -sample of i.i.d. gaussian variables, even when the real sample is heavy tailed. In a sense, we want our estimator to be as good as if the data were Gaussian, even when it is not !

These two informal definitions are different even if somehow related. For this to make sense, let us consider now high dimension mean estimation. Let denote i.i.d. random variables with mean and covariance matrix . The estimators described in [minsker2015geometric] or in [JMLR:v17:14-273] both achieve, with probability larger than ,

 ∥^μ−μ∥2≤dlog(1/δ)N, (1)

where is the canonical Euclidean norm on . This bound is proportional to , thus the estimators are robust to heavy tails, but they do not achieve a subgaussian rate. Indeed, if are independent identically distributed Gaussian variables , it follows from Borell-TIS’s inequality (see [Led01, Theorem 7.1] or [MR2814399, pages 56-57]) that with probability at least ,

where . It is straightforward to check that and , which leads to the rate in (2) (where is an absolute constant), which is called subgaussian rate

 ∥∥¯ZN−μ∥∥22≤C(dN+log(1/δ)N):=Crδ. (2)

The main difference with (1) is that, in the subgaussian rate, the complexity term and the failure-dependant factor are not multiplied, but added instead. This rate is shown to be deviation-minimax up to constant factors in [lugosi2018nearoptimal]. Recently, the seminal paper [lugosi2019sub] described the first estimator to reach the subgaussian rate only assuming finite second moment, soon followed by other works in that path ([guillaume2017learning], [depersin2019robust], [Bartlett19], [hopkins2018sub]).

In this work, we show that the analysis presented in [lugosi2019sub], in [guillaume2017learning], [lerasle2019lecture] or in [lecu2017robust], all based on the Median-of-mean principle and the use of Rademacher complexities, can be modified in order to achieve sub-gaussian rates for sparse problems assuming only bounded two-order moments. The method developed in [lerasle2019lecture] or in [lecu2017robust] requires data to have at least finite moments (where is the dimension of the space) in order to exploit the sparsity of the problem and offers no guarantees without that requirement, and to the date is the best known. We show that we can drop this condition by judiciously introducing VC-dimension in the different proofs, and exploit the sparsity of the problem with only two moments. Classical approaches using local Rademacher complexities cannot achieve this type of subgaussian bounds under only a second moment assumption in this setup. Indeed, as shown in the counter-example from Section 3.2.3 of [chinot2018statistical], local Rademacher complexities may scale like whereas the right Gaussian bound should be of the order of . Somehow the classical approach used so far does not capture the right statistical complexity of high-dimensional problems under low-dimensional structural assumptions and under only a second moment assumption. Our VC-dimension based approach allows to overcome this issue and to go beyond this subgaussian moments assumption that has appeared in all works on robust and subgaussian estimation in the high-dimensional framework [lerasle2019lecture]. We also show that this general technique can be easily replicated and give new robust estimators that achieve state-of-the-art bounds for different estimation tasks such as :

• [label=-, itemsep=0pt]

• Regression, already studied in [lecu2017robust] where our estimator’s rate match the one from [lecu2017robust], and sparse regression where our estimator’s rate is the first to match the one from [lecu2017robust] with only two moments.

• Mean estimation with non-Euclidean norms, studied in [lugosi2018nearoptimal], where our analysis gives a different rate that is better for some norms.

• Covariance estimation, studied in [mendelson2018robust] under norm equivalence: we do not need this assumption with our analysis, thus we give the first subgaussian estimator without this assumption.

• Robust low-rank covariance estimation, studied in [Wei2017EstimationOT]. We give an estimators that achieves better bounds and is to our knowledge the first MOM estimator for that problem.

Using VC-dimension in mean estimation, we loose a nice dependence of the risk bounds in the covariance structure: our rates for (non-sparse) mean estimation depend on the ambient dimension instead of the effective rank . In particular, the general approach does not generalize directly to infinite dimensional spaces. In the last section, we show that this issue can be overcome if we have some knowledge on the covariance matrix.

## 2 Warm-up : MOM principle, VC-dimension and mean estimation

We start with the mean estimation problem in that illustrates our technique: the goal is to estimate the mean

of a random vector

in given a possibly corrupted dataset of i.i.d. copies of . The precise setting is the following:

###### Setting 1.

Let denote independent and identically distributed random vectors in . We want to estimate , assuming that has finite second moment. Let denote the unknown covariance matrix of .

The vectors are not observed, instead, this dataset may have been corrupted, and this corruption may be adversarial: there exists a (possibly random) set such that, for any . The satisfies

The observed dataset is , and we want to recover .

Notice that there are no assumption on the data . In particular these may be dependent of , and the may have arbitrary dependence structure.

### 2.1 VC-dimension

We start this part by recalling some basic facts about VC-dimension that appear for instance in [JMLR:v20:17-504].

###### Definition 1.

Let be a set of Boolean functions on any space . We say that a finite set is shattered by if, for every subset , there exists such that . We call VC-dimension of (and note ) the largest integer such that there exists a set of cardinal that is shattered by .

Whenever is a Euclidean space, we will sometimes abusively call VC-dimension of a set and note the VC-dimension of the set of half-spaces generated by the vectors of :

Let us recall some basic facts about VC dimensions.

1. . More generally, if a set of real-valued functions in a -dimensional linear space, then has VC-dimension (see for instance [dudley1978], Theorem 7.2).

2. If is a set of functions and , then .

3. For any , , see Section 6.

4. Sauer’s Lemma [SAUER1972145]: Let denote a set a functions with VC-dimension and let be a set of points. Let , then

 Card(F∗S)≤(enν)ν.

This last lemma can be used to prove the following result that is useful to bound the VC dimension of the set of sparse vectors.

###### Lemma 1.

Let , …, denote sets of boolean functions, each having VC-dimension . Then,

 VC(F1∪F2∪...∪Fn)≤2ν+2log(n).

Lemma 1 is a straightforward extension of Theorem 3 in [JMLR:v20:17-504].

###### Proof.

Let be a set shattered by , and . Because is shattered, we have . But we also have , so .

It follows that or . By technical Lemma 4.6 in [10.5555/524030] if , then . Hence, , which implies Lemma 1.

###### Corollary 1.

Fix and note the set of s-sparse vectors, then

 VC(Us)≤4slog2(ed/s).

To prove Corollary 1, just write the set as a union of -dimensional subspaces. As a side remark, we note that [JMLR:v20:17-504] also shows that this bound is tight up to multiplicative constants: there exists an absolute constant such that . Besides, the result holds even if the set of vectors is not an orthogonal family or if it is not a base. Let us now recall an important theorem that will be very useful in regression and covariance estimation. Let denote a set of multivariate polynomials. A sign assignment is an element of . The sign assignment is consistent with if there exists such that .

###### Theorem 1 (Warren, [10.2307/1994937]).

Let denote a set of polynomials of degree at most in real variables with , then the number of sign assignments consistent for is at most .

###### Corollary 2.

Assume that the set of functions can be written , then .

The following example will be useful in some applications.

###### Proposition 1.

Let and call the set of rank , symmetric, -dimensional matrices.

Let . Then .

###### Proof.

Any can be written , with . Besides, for any , the function is a polynomial of degree in variables. Hence, the result follows from Corollary 2.

Combining Lemma 1 applied with rank one matrices of the form and Corollary 2 yields the following result.

Let . Then

### 2.2 Median-of-mean

This work uses the median-of-means (MOM) approach which was introduced in [MR702836, MR1688610, MR855970]

and has received a lot of attention recently in the statistical and machine learning communities

[MR3124669, LO, MR3576558, MS, minsker2015geometric]. This approach allows to build estimators that are robust to both outliers and heavy-tail data in various settings [MR1688610, MR855970, MR762855]. It can be defined as follows: we first randomly split the data into blocks of equal-size (if does not divide , we just remove some data). We then compute the empirical mean within each block: for ,

 ¯Xk=1m∑i∈BkXi.

In the one-dimensional case, the final estimator is the median of the latter empirical means. This estimator has subgaussian deviations as shown in [MR3576558]. The extension of this result to higher dimensions is not trivial as there exist several possible generalizations of the one dimensional median [minsker2015geometric].

For any , let and . We start with a basic observation.

###### Remark 1.

When , there is at least blocks on which .

For instance, if , then, there exist at least three quarters of the blocks where . We can now state the main lemma.

###### Lemma 2.

Let be a set of Boolean functions satisfying the following assumptions.

• For all , .

• where is a universal constant.

Then, with probability , for all , there is at least blocks on which .

In words, if each property is true for one non corrupted block with constant probability (here but it could be any fixed constant ) and is large enough, then, with very high probability, all properties are “true for most of the blocks”. This result is an alternative to [lugosi2018nearoptimal, Theorem 2] where the complexity is measured with VC-dimension instead of the Rademacher complexity. We show below that this difference yields to substantial improvements in some examples such as sparse multivariate mean estimation compared with the bounds in [lugosi2018nearoptimal]. The strength of this result is that it is uniform in and gives an exponentially low failure probability, but its proof is quite simple. The proof of this result is given in Section 6.2.

Clearly, the fraction of the block is arbitrary in Lemma 2. In fact, up to some modifications of the constants, the same result holds for any fixed fraction .

### 2.3 Mean estimation

Let denote a norm on and let denote its dual norm. Let denote the unit ball for the norm and the one for the norm . Let denote the set of extremal vectors of . Let where is the Euclidean norm on . Let

###### Theorem 2.

There exists an universival constant such that if , then, with probability larger than ,

 ∥^μK−μ∥≤8 ∣∣∣∣∣∣Σ1/2∣∣∣∣∣∣√KN.

In particular, for any , there exists an estimator such that

 ∥μ−μδ∥≲∣∣∣∣∣∣Σ1/2∣∣∣∣∣∣⎛⎝√VC(B∗0)N+√log(1/δ)N+√ϵ⎞⎠. (3)

The ’outlier’ term can be shown to be optimal in important cases (see the remarks after [MR3909640, Theorem 1.3]). The deviation term () is the same as in the Borel-TIS inequality, being the weak variance term. It is optimal as shown in [lugosi2018nearoptimal]. The difference is the complexity term, which is here . Neither [lugosi2018nearoptimal] nor this work build estimators achieving the subgaussian rate, where this complexity is , being a centered Gaussian vector with the same covariance as . We do not know if this rate can be reached in general.

###### Remark 2.

The inequality gives a general bound on the complexity term.

The complexity term in [lugosi2018nearoptimal], which can also be found in [lerasle2019lecture, Chapter 4, Lemma 47] is where , being i.i.d. Rademacher variables. Here it is . Which of them is the best depends on the situation. For instance, when one wishes to estimate with respect to , the Euclidean norm on , , while ,

being the largest eigenvalue of

, so the former is better. In this example, the bound in VC dimension looses the nice dependence in the covariance structure. On the other hand, suppose that we want to estimate with respect to the sup norm and assume that for simplicity. Then and so

 ∣∣∣∣∣∣Σ1/2∣∣∣∣∣∣√VC(B∗0)/N≃√log(d)/N.

On the other hand, if we only have two moments on the coordinates of , then the best bound on the Rademacher complexity is which is of order in general (to see that, take for a random vector whose coordinates are independent, equal to with probability and otherwise).

###### Remark 3.

The analysis of Section 6.2 and in particular Lemma 4’, shows that the estimator achieves the bound when , where the complexity is the minimum between the VC dimension and the Rademacher complexity . Therefore both our bounds and the bound of [lugosi2018nearoptimal] hold simultaneously and we can always keep the “best complexity term” among VC and Rademacher complexity. As the main novelty here is the introduction of the VC-dimension, we do not remind this fact in each application. The interested reader can have in mind that, in most examples, the same result holds and the estimators have risk bounds smaller than both complexities. Our aim is to show that VC type bounds are particularly efficient in and sparse scenarii, when Rademacher complexity fails to achieve optimal bounds.

## 3 Sparse setting and other estimation tasks

This section shows that the methodology of Theorem 2 also applies to a great variety of estimation tasks. Let us start with the example of sparse mean estimation for the Euclidean norm.

### 3.1 Sparse mean estimation

For any , let denote the set of s-sparse vectors over the dictionary . We fix for this part the vectors and we note . We consider Setting 1 and assume furthermore that belongs to . We note the unit ball for the canonical Euclidean norm in , and we propose the estimator

 ^μK=argmina∈Us maxu∈U2s∩B2 Med⟨¯Xk−a,u⟩.
###### Theorem 3.

There exists an absolute constant such that, if , then, with probability larger than ,

 ∥^μK−μ∥2≤8√λ1(Σ)KN.

Here, is the largest eigenvalue of .

The conclusion of Theorem 3 can be writen as follows. For any , there exists an estimator such that

 ∥μ−μδ∥2≲λ1(Σ)(√slog(d/s)N+√log(1/δ)N+√ϵ).

We see that the complexity () is once again decoupled from the deviation (), which is not the case in works such as [JMLR:v17:14-273] where those two terms are multiplied together. The complexity term is not optimal because it does not depend on the structure of (see Section 4 for details). However, our complexity term is interesting for two main reasons:

• This is the first sparsity dependent bound that holds without higher moments conditions than the ones. By contrast, [lerasle2019lecture] or [lecu2017robust] need to assume the existence subgaussian moments in order to make the sparsity appear, and offer no guarantees without that requirement.

• It comes close to the theoretic optimal when .

### 3.2 Regression

In this section, we consider the standard linear regression setting where data are couples

and we look for the best linear combination of the coordinates of to predict , that is we look for defined as follows: Given (in practice, we will only study or ),

As in the previous section, the observed dataset is a corrupted version of the i.i.d. dataset in a possibly adversarial way. The assumptions made on good data are gathered in the following setting: (see also [lerasle2019lecture] or [audibert2011]).

###### Setting 2.

There exists a (possibly random) set such that, for any , where are independent identically distributed observations in . Let , we make three main assumptions:

• has finite second moment and write its -moments matrix . Let also be the ellipsoid associated with this structure.

• Let and assume that .

• There exists an universal constant such that, for all , .

Condition 2 is implied by Assumptions 3.5 and 3.7 in [audibert2011], the same assumption is made in [lerasle2019lecture].

Condition 3 is called the “small ball hypothesis”, it is described in details in [mendelson2017extending] or in [lecu2016regularization] for instance. It is implied by Condition 3.5 in [audibert2011], it is stated similarly in [lerasle2019lecture].

The two last conditions may seem exotic, we refer to [audibert2011, Section 3] for detailed discussions and examples where these are satisfied. For the moment, we may emphase that they involve only first and second moment conditions on and

Our estimator is defined as follows: Let , where

is the first quartile over

: for any sequence , if we note the corresponding increasingly ordered sequence, . Then,

This new estimator satisfies the following result.

###### Theorem 4.

There exists an absolute constant such that the following holds. Let or and let . Then, with probability ,

For all ,

So, if , then

 l(^β)−l(β∗)≤(r2/γ4+2σr/γ2).

If , which holds for instance when (linear regression with independant noise), we have : it is interesting to note that in this case, the excess loss scales with and not with (and becomes just the variance of the noise).

The conclusion of Theorem 4 can be written as follows: for any , there exists an estimator such that

Once again we notice the nice decoupling between complexity and deviation. This result is interesting for a several reasons. First, this work is the first that gives a bound holding with exponential probability, that holds without assuming more than 2 moments on the design , even in the sparse setting. By comparison, [lecu2017robust] or [lugosi2017regularization] for instance, assume that at least subgaussian moments exist to achieve this kind of rate and offers no guarantees without that requirement and are the best to the date.

### 3.3 Covariance estimation

This section studies the problem of robust covariance estimation. Consider Setting 1, and assume that is known, fixed to without loss of generality. We want to estimate . This problem has a number of applications: the bounds we present can for instance easily be transposed (with the Davis–Kahan theorem) to the problem of robust PCA. It has already been studied in [Wei2017EstimationOT], or [JMLR:v17:14-273], but these estimators do not exhibit any decoupling between complexity and deviation. In [mendelson2018robust], the authors propose a robust estimator for covariance using the MOM method, and get the optimal complexity-deviation decoupling. They also give interesting comments and insights about this estimation problem. However, they need a norm equivalence in order to do so. In addition, they do not study the problem of low rank estimation that we present here.

For any matrix , define its spectral norm by

Let denote the set of dimensional symmetric positive matrices. Assume that

This quantity is sometime refered to as weak variance

of a random matrix

[mendelson2018robust]. Our estimator is defined as follows

It satisfies the following bound.

###### Theorem 5.

There exists an absolute constant such that, if , then, with probability larger than ,

 ∣∣∣∣∣∣^ΣK−Σ∣∣∣∣∣∣≤8 σ√KN.
###### Corollary 3.

Assume that , then, for

 ∣∣∣∣∣∣^ΣK−Σ∣∣∣∣∣∣≤8 R|||Σ|||√KN.

The “bounded kurtosis assumption”

appears similarly in [JMLR:v17:14-273]. In [JMLR:v17:14-273], the estimator achieves a bound of order where is the effective rank of the covariance matrix: once again, the complexity is multiplied by the deviation term in this case, while here they are decoupled : the dimension does not multiply in our bound. In [mendelson2018robust], authors give a better rate (because they only need to be larger than instead of for the bound to hold, but they use the norm equivalence, which is a strong hypothesis we do not need here.

To conclude this section, we propose an estimator for covariance estimation for the Frobenius norm. What is interesting here is that, when we know that the covariance matrix has rank , we can exploit the low-rank assumption in the VC-dimension analysis, while it might be harder to grasp using other forms of complexity. This work is the first, to our knowledge, to exhibit the decoupling between complexity and deviation for this problem, and this estimator is the first of its kind. We note the Frobenius norm for matrices, and the inner product for matrices.

###### Theorem 6.

Let

There exists an universal constant such that, if and , then, with probability larger than ,

 ∥∥^ΣK−Σ∥∥F≤8 σ√KN.

The conclusion of Theorem 6 can be written as follows: for any , there exists an estimator such that

 ||Σδ−Σ||2≲σ(√rdN+√log(1/δ)N+√ϵ).

Compared with [JMLR:v17:14-273], we note that this bound on the risk of our estimator does decouple the complexity and the deviation as announced. Again it holds under only a moment assumption, with no extra subgaussian moment needed to get this bound.

## 4 An algorithm to improve risk bounds

The different applications of Lemma 2 show that, in general, the complexity term derived from this result is not optimal. For example, for mean estimation estimation in Euclidean norm, the complexity term reached by our estimator is proportional to , where the best rate would be .

In this section, we provide an algorithm that leads to better and in some cases optimal complexity rates. The price to pay is that these new estimators require some knowledge on the covariance matrix . We will consider the example of sparse mean estimation for the sake of clarity, but we argue that it also holds for sparse (and non-sparse) regression. We therefore use the setting 1 and assume furthermore that the mean belongs to , where the set of vectors is fixed and known.

Let denote the eigenvalues of in decreasing order, and let

denote a set of normalized corresponding eigenvectors. For any

, let denote the largest index such that . In particular, . By convention, let . Finally, we note , with convention . If we know the matrix

, we can identify the eigenspaces

and thus compute the orthogonal projections of the data on these subspaces: , for and .

In Section 3.1, we described a procedure that takes as input an integer and a (possibly corrupted) dataset having common mean which is -sparse relatively to a set of vectors and common covariance matrix . The procedure returns satisfying, with probability at least

Let denote the output of this procedure. The idea of the algorithm is to project on the subspaces and apply this preliminary procedure on those subspaces. Let the dimension of . The algorithm is formally defined as follows:

The algorithm produces an estimator satisfying the following result.

###### Proposition 3.

Assume setup 1. There exists an absolute constant such that, if , the output of Algorithm 1, satisfies, with probability ,

 ∥^μK−μ∥2≤8log(d)√λ1KN.

The complexity term in Proposition 3 is better than in Theorem 3, because . More importantly, this complexity term depends on the the covariance structure of the data through the . In the case of sparse mean estimation, we can deduce a precise estimate of this complexity term: for any , there exists an estimator such that:

 ∥^μδ−μ∥2≤Clog(d)⎛⎝√log(d/s)∑si=1λiN+√λ1log(1/δ)N+√λ1ϵ⎞⎠.

This estimate comes from the bounds , where is such that . We then write .

Thies result is proved in Section 6.5. We argue that the very same proof can be replicated for the regression problem, if has the same eigenspaces as . This happens for instance when is independent of . We end up with the bound of Theorem 4, holding whenever instead of

## 5 Conclusion : concurrent work and discussion

This work is not the first to deal with robust estimation: a lot of results and algorithms have been developed over the past few years for sparse estimation in presence of outliers (see for instance [li2017robust]) but most of these works assume that non-corrupted data are Gaussian. For instance [MR3845006] already deal with mean and covariance estimation using extensions of the Tukey-depth (and using VC-dimension), but their methods rely on informative data being Gaussian.

Robustness to heavy-tailed data has also been studied in various works, see [lugosi2019mean] for a survey of recent developments. We already mentioned two articles that this work tries to complete and improve: [lugosi2018nearoptimal] for mean estimation under any norm and [lecu2017robust] for sparse regression. Though the techniques involved are close, this work illustrates that using VC-dimension can drastically improve risk bounds in various applications, in particular in the sparse setting.

There are still many exciting open questions. The quantity that is crucial in all the studies is

 E(supfK∑k=0f(Yk)−KE(f(% Yk)))

where are boolean functions. In mean estimation for instance, is the important quantity. Bounding this quantity using the VC-dimension of yields a bound independant of the covariance of . On the other hand, bounding that quantity by the Rademacher complexity of the (like in [lecu2017robust], [lugosi2018nearoptimal] or here in Part 6.2) stating that

does not exploit the boundedness of the indicator function and necessitates unnecessary stronger assumptions on data. The ideal would be to conciliate both ideas, and to find a nice in-between that would take into account both the boudedness and the dependency in the covariance structure.

The last point we make is about computational issues. The estimators presented can not be implemented as is. Nevertheless, encouraging recent works have shown that ”relaxed”, computable estimators can be derived from this kind of work. For instance the pioneer work of [hopkins2018sub], followed by [depersin2019robust] and [Bartlett19] for instance, derived tractable estimators, in polynomial times, from the work of [lugosi2019sub]. Even more recently, some new tractable estimators for regression and covariance estimation with heavy-tailed data have emerged in [cherapanamjeri2019algorithms]. We can hope for this work to be made tractable as well, which seems to be quite a challenge.

## 6 Proofs

### 6.1 A fact about VC-dimension

For any Euclidean space , and any

###### Lemma 3.

where is universal constant.

###### Proof.

Assume that a set is shattered by . Then, for any , there is a vector so that if and only if . There is a vector so that if and only if . Then we have if and only if , so shatters , and .

Now we see that because if is shattered by then is shattered by . Theorem 1.1 in [vandervaart] states that for some constant , and that concludes the proof.

### 6.2 General methodology

We begin by proving the main lemma 2 of Part 2.3

###### Proof.

We want to prove that, with probability ,

 supfK∑k=0f(Xk)≤K/4.

If , and it is sufficient to show that by Remark 1. Now we write

By the bounded difference inequality [MR3185193, Theorem 6.2], with probability , .

For the magnitude term, we write

 M≤E(supfK∑k=0f(Yk)−KE(f(Yk)))+supfKE(f(Yk)).

By hypothesis, . Then, we just have to use a classical result of Vapnik-Chervonenkis theory, either in the version of [vershynin_2018, Theorem 8.3.23], or of [Handel_2016, Corollary 7.18]. There exists a universal constant such that

 E(supfK∑k=0f(Yk)−KE(f(% Yk)))≤C′K√VC(F)K.

Hence, if ,

 E(supfK∑k=0f(Yk)−KE(f(% Yk)))≤K16.

Putting everything together, we have the following. If , with probability , . Therefore, by Remark 1, for all

 K∑k=0f(Xk)≤K/4.

We state a technical lemma that appears in most proofs. Let be any measurable function so that exists. We take

where are any sets of . We have :

###### Lemma 4.

If and if , then, with probability ,

where is a universal constant

is the ”weak variance” of the problem.

###### Proof.

Let with the universal constant from Lemma 2, let and let

.

Let . The function are compositions of the function and of the functions The VC-dimension of the set of these compositions is smaller than the VC-dimension of the set of indicator functions indexed by , as recalled in the basic fact 2 at the beginning of Section 2.1. We just use fact 3 to remove the and we get for some constant .

By Markov’s inequality, for any ,

By Lemma 2, applied with , the following event has probability