# Outlier-robust estimation of a sparse linear model using ℓ_1-penalized Huber's M-estimator

We study the problem of estimating a p-dimensional s-sparse vector in a linear model with Gaussian design and additive noise. In the case where the labels are contaminated by at most o adversarial outliers, we prove that the ℓ_1-penalized Huber's M-estimator based on n samples attains the optimal rate of convergence (s/n)^1/2 + (o/n), up to a logarithmic factor. This is proved when the proportion of contaminated samples goes to zero at least as fast as 1/(n), but we argue that constant fraction of outliers can be achieved by slightly more involved techniques.

## Authors

• 14 publications
• 3 publications
• ### Robust censored regression with l1-norm regularization

This paper considers inference in a linear regression model with random ...
10/05/2021 ∙ by Jad Beyhum, et al. ∙ 0

• ### Inference robust to outliers with l1-norm penalization

This paper considers the problem of inference in a linear regression mod...
06/04/2019 ∙ by Jad Beyhum, et al. ∙ 0

• ### Regress Consistently when Oblivious Outliers Overwhelm

We give a novel analysis of the Huber loss estimator for consistent robu...
09/30/2020 ∙ by Tommaso d'Orsi, et al. ∙ 0

• ### High-dimensional inference robust to outliers with l1-norm penalization

This paper studies inference in the high-dimensional linear regression m...
12/28/2020 ∙ by Jad Beyhum, et al. ∙ 0

• ### Scale calibration for high-dimensional robust regression

We present a new method for high-dimensional linear regression when a sc...
11/06/2018 ∙ by Po-Ling Loh, et al. ∙ 0

• ### Robust Subspace Recovery with Adversarial Outliers

We study the problem of robust subspace recovery (RSR) in the presence o...
04/05/2019 ∙ by Tyler Maunu, et al. ∙ 0

• ### TVOR: Finding Discrete Total Variation Outliers among Histograms

Pearson's chi-squared test can detect outliers in the data distribution ...
12/21/2020 ∙ by Nikola Banić, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1 Introduction

Is it possible to attain optimal rates of estimation in outlier-robust sparse regression using penalized empirical risk minimization (PERM) with convex loss and convex penalties? Current state of literature on robust estimation does not answer this question. Furthermore, it contains some signals that might suggest that the answer to this question is negative. First, it has been shown in (Chen et al., 2013, Theorem 1) that in the case of adversarially corrupted samples, no method based on penalized empirical loss minimization, with convex loss and convex penalty, can lead to consistent support recovery. The authors then advocate for robustifying the -penalized least-squares estimators by replacing usual scalar products by their trimmed counterparts. Second, (Chen et al., 2018) established that in the multivariate Gaussian model subject to Huber’s contamination, coordinatewise median—which is the ERM for the -loss—is sub-optimal. Similar result was proved in (Lai et al., 2016, Prop. 2.1) for the geometric median, the ERM corresponding to the

-loss. These negative results prompted researchers to use other techniques, often of higher computational complexity, to solve the problem of outlier-corrupted sparse linear regression.

In the present work, we prove that the -penalized empirical risk minimizer based on Huber’s loss is minimax-rate-optimal, up to possible logarithmic factors. Naturally, this result is not valid in the most general situation, but we demonstrate its validity under the assumptions that the design matrix satisfies some incoherence condition and only the response is subject to contamination. The incoherence condition is shown to be satisfied by the Gaussian design with a covariance matrix that has bounded and bounded away from zero diagonal entries. This relatively simple setting is chosen in order to convey the main message of this work: for properly chosen convex loss and convex penalty functions, the PERM is minimax-rate-optimal in sparse linear regression with adversarially corrupted labels.

To describe more precisely the aforementioned optimality result, let be iid feature-label pairs such that are Gaussian with zero mean and covariance matrix and are defined by the linear model

 y∘i=X⊤iβ∗+ξi,i=1,…,n, (1)

where the random noise , independent of

, is Gaussian with zero mean and variance

. Instead of observing the “clean” data , we have access to a contaminated version of it, , in which a small number of labels are replaced by an arbitrary value. Setting , and using the matrix-vector notation, the described model can be written as

 Y=Xβ∗+√nθ∗+ξ, (2)

where is the design matrix, is the response vector, is the contamination and is the noise vector. The goal is to estimate the vector . The dimension is assumed to be large, possibly larger than but, for some small value , the vector is assumed to be -sparse: . In such a setting, it is well-known that if we have access to the clean data and measure the quality of an estimator by the Mahalanobis norm111In the sequel, we use notation for any vector and any . , the optimal rate is

 r∘(n,p,s)=σ(slog(p/s)n)1/2. (3)

In the outlier-contaminated setting, i.e., when is unavailable but one has access to , the minimax-optimal-rate (Chen et al., 2016) takes the form

 r(n,p,s,o)=σ(slog(p/s)n)1/2+σon. (4)

The first estimators proved to attain this rate (Chen et al., 2016; Gao, 2017) were computationally intractable222In the sense that there is no algorithm computing these estimators in time polynomial in . for large , and . This motivated several authors to search for polynomial-time algorithms attaining nearly optimal rate; the most relevant results will be reviewed later in this work.

The assumption that only a small number of labels are contaminated by outliers implies that the vector in (2) is -sparse. In order to take advantage of sparsity of both and while ensuring computational tractability of the resulting estimator, a natural approach studied in several papers (Laska et al., 2009; Nguyen and Tran, 2013; Dalalyan and Chen, 2012) is to use some version of the -penalized ERM. This corresponds to defining

 ˆβ∈argminβ∈Rpminθ∈Rn{12n∥Y−X⊤β−√nθ∥22+λs∥β∥1+λo∥θ∥1}, (5)

where are tuning parameters. This estimator is very attractive from a computational perspective, since it can be seen as the Lasso for the augmented design matrix , where is the identity matrix. To date, the best known rate for this type of estimator is

 σ(slogpn)1/2+σ(on)1/2, (6)

obtained in (Nguyen and Tran, 2013) under some restrictions on . A quick comparison of (4) and (6) shows that the latter is sub-optimal. Indeed, the ratio of the two rates may be as large as . The main goal of the present paper is to show that this sub-optimality is not an intrinsic property of the estimator (5), but rather an artefact of previous proof techniques. By using a refined argument, we prove that defined by (5) does attain the optimal rate under very mild assumptions.

In the sequel, we refer to as -penalized Huber’s -estimator. The rationale for this term is that the minimization with respect to in (5) can be done explicitly. It yields (Donoho and Montanari, 2016, Section 6)

 ˆβ∈argminβ∈Rp{λ2on∑i=1Φ(yi−X⊤iβλo√n)+λs∥β∥1}, (7)

where is Huber’s function defined by .

To prove the rate-optimality of the estimator , we first establish a risk bound for a general design matrix not necessarily formed by Gaussian vectors. This is done in the next section. Then, in Section 3, we state and discuss the result showing that all the necessary conditions are satisfied for the Gaussian design. Relevant prior work is presented in Section 4, while Section 5 discusses potential extensions. Section 7 provides a summary of our results and an outlook on future work. The proofs are deferred to the supplementary material.

### 2 Risk bound for the ℓ1-penalized Huber’s M-estimator

This section is devoted to bringing forward sufficient conditions on the design matrix that allow for rate-optimal risk bounds for the estimator defined by (5) or, equivalently, by (7). There are two qualitative conditions that can be easily seen to be necessary: we call them restricted invertibility and incoherence. Indeed, even when there is no contamination, i.e., the number of outliers is known to be , the matrix

has to satisfy a restricted invertibility condition (such as restricted isometry, restricted eigenvalue or compatibility) in order that the Lasso estimator (

5) does achieve the optimal rate . On the other hand, in the case where and , even in the extremely favorable situation where the noise is zero, the only identifiable vector is . Therefore, it is impossible to consistently estimate when the design matrix is aligned with the identity matrix or close to be so.

The next definition formalizes what we call restricted invertibility and incoherence by introducing three notions: the transfer principle, the incoherence property and the augmented transfer principle. We will show that these notions play a key role in robust estimation by -penalized least squares.

###### Definition 1.

Let

be a (random) matrix and

. We use notation .

• We say that satisfies the transfer principle with and , denoted by , if for all ,

 ∥∥Z(n)v∥∥2≥a1∥Σ1/2v∥2−a2∥v∥1. (8)
• We say that satisfies the incoherence property for some positive numbers , and , if for all ,

 |u⊤Z(n)v|≤b1∥∥Σ1/2v∥∥2∥u∥2+b2∥v∥1∥u∥2+b3∥∥Σ1/2v∥∥2∥u∥1. (9)
• We say that satisfies the augmented transfer principle for some positive numbers , and , if for all ,

 ∥Z(n)v+u∥2≥c1∥∥[Σ1/2v;u]∥∥2−c2∥v∥1−c3∥u∥1. (10)

These three properties are inter-related and related to extreme singular values of the matrix

.

(P1)

If satisfies then it also satisfies .

(P2)

If satisfies and then it also satisfies with , and for any positive .

(P3)

If satisfies , then it also satisfies

(P4)

Any matrix satisfies , and , where and are, respectively, the -th largest and the largest singular values of .

Claim (P1) is true, since if we choose in (10) we obtain (8). Claim (P2) coincides with Lemma 7, proved in the supplement. (P3) is a direct consequence of the inequality , valid for any vector

. (P4) is a well-known characterization of the smallest and the largest singular values of a matrix. We will show later on that a Gaussian matrix satisfies with high probability all these conditions with constants

and independent of and , , , , of order , up to logarithmic factors.

To state the main theorem of this section, we consider the simplified setting in which . Remind that in practice it is always recommended to normalize the columns of the matrix so that their Euclidean norm is of the order . The more precise version of the next result with better constants is provided in the supplement (see Proposition 1). We recall that a matrix is said to satisfy the restricted eigenvalue condition with some constant , if for any vector and any set such that and .

###### Theorem 1.

Let satisfy the condition with constant . Let , , , , , be some positive real numbers such that satisfies the and the . Assume that for some , the tuning parameter satisfies

 λ√n≥√8log(n/δ)⋁(maxj=1,…,p∥X(n)∙,j∥2)√8log(p/δ). (11)

If the sparsity and the number of outliers satisfy the condition

 sϰ2+o ≤c21400(c2∨c3∨5b2/c1)2, (12)

then, with probability at least , we have

 ∥∥Σ1/2(ˆβ−β∗)∥∥2≤24λc21(2c2c1⋁b3c21)(sϰ2+7o)+5λ√s6c21ϰ. (13)

Theorem 1 is somewhat hard to parse. At this stage, let us simply mention that in the case of a Gaussian design considered in the next section, is of order while are of order , up to a factor logarithmic in , and . Here is an upper bound on the probability that the Gaussian matrix does not satisfy either or . Since Theorem 1 allows us to choose of the order , we infer from (13) that the error of estimating , measured in Euclidean norm, is of order , under the assumption that is smaller than a universal constant.

To complete this section, we present a sketch of the proof of Theorem 1. In order to convey the main ideas without diving too much into technical details, we assume . This means that the condition is satisfied with for any and . From the fact that the holds for , we infer that satisfies the condition with the constant . Using the well-known risk bounds for the Lasso estimator (Bickel et al., 2009), we get

 ∥ˆβ−β∗∥22+∥ˆθ−θ∗∥22≤Cλ2(s+o)and∥ˆβ−β∗∥1+∥ˆθ−θ∗∥1≤Cλ(s+o). (14)

Note that these are the risk bounds established in333the first two references deal with the small dimensional case only, that is where . (Candès and Randall, 2008; Dalalyan and Chen, 2012; Nguyen and Tran, 2013). These bounds are most likely unimprovable as long as the estimation of is of interest. However, if we focus only on the estimation error of , considering as a nuisance parameter, the following argument leads to a sharper risk bound. First, we note that

 ˆβ∈argminβ∈Rp{12n∥Y−Xβ−√nˆθ∥22+λ∥β∥1}. (15)

The KKT conditions of this convex optimization problem take the following form

 \nicefrac1nX⊤(Y−Xˆβ−√nˆθ)∈λ⋅sgn(ˆβ), (16)

where is the subset of containing all the vectors such that and for every . Multiplying the last displayed equation from left by , we get

 \nicefrac1n(β∗−ˆβ)⊤X⊤(Y−Xˆβ−√nˆθ)≤λ(∥β∗∥1−∥ˆβ∥1). (17)

Recall now that and set and . We arrive at

 \nicefrac1n∥Xv∥22=\nicefrac1nv⊤X⊤Xv≤−v⊤(X(n))⊤u−\nicefrac1nv⊤X⊤ξ+λ(∥β∗∥1−∥ˆβ∥1). (18)

On the one hand, the duality inequality and the lower bound on imply that . On the other hand, well-known arguments yield . Therefore, we have

 \nicefrac1n∥Xv∥22≤|v⊤(X(n))⊤u|+\nicefracλ2(4∥vS∥1−∥v∥1). (19)

Since satisfies the that implies the , we get . Combining with (19), this yields

 c21∥v∥22 IPI(0,b2,b3)≤2|v⊤(X(n))⊤u|+λ(4∥vS∥1−∥v∥1)+2c22∥v∥21 (20) IPI(0,b2,b3)≤2b3∥v∥2∥u∥1+2b2∥v∥1∥u∥2+λ(4∥vS∥1−∥v∥1)+2c22∥v∥21 (21) IPI(0,b2,b3)≤c212∥v∥22+2b23c21∥u∥21+∥v∥1(2b2∥u∥2−λ)+4λ∥vS∥1+2c22∥v∥21. (22)

Using the first inequality in (14) and condition (12), we upper bound by 0. To upper bound the second last term, we use the Cauchy-Schwarz inequality: . Combining all these bounds and rearranging the terms, we arrive at

 (c21/4)∥v∥22 ≤2{(b3/c1)∨c2}2(∥u∥1+∥v∥1)2+(4/c1)2λ2s. (23)

Taking the square root of both sides and using the second inequality in (14), we obtain an inequality of the same type as (13) but with slightly larger constants. As a concluding remark for this sketch of proof, let us note that if instead of using the last arguments, we replace all the error terms appearing in (21) by their upper bounds provided by (14), we do not get the optimal rate.

### 3 The case of Gaussian design

Our main result, Theorem 1, shows that if the design matrix satisfies the transfer principle and the incoherence property with suitable constants, then the -penalized Huber’s -estimator achieves the optimal rate under adversarial contamination. As a concrete example of a design matrix for which the aforementioned conditions are satisfied, we consider the case of correlated Gaussian design. As opposed to most of prior work on robust estimation for linear regression with Gaussian design, we allow the covariance matrix to have a non degenerate null space. We will simply assume that the rows of the matrix

are independently drawn from the Gaussian distribution

with a covariance matrix satisfying the condition. We will also assume in this section that all the diagonal entries of are equal to 1: . The more formal statements of the results, provided in the supplementary material, do not require this condition.

###### Theorem 2.

Let be a tolerance level and . For every positive semi-definite matrix with all the diagonal entries bounded by one, with probability at least , the matrix satisfies the , the and the with constants

 a1 =1−4.3+√2log(9/δ)√n,a2=b2=1.2√2logpn (24) b1 =4.8√2+√2log(81/δ)√n,b3=1.2√2lognn, (25) c1 =34−17.5+9.6√2log(2/δ)√n,c2=3.6√2logpn,c3=2.4√2lognn. (26)

The proof of this result is provided in the supplementary material. It relies on by now standard tools such as Gordon’s comparison inequality, Gaussian concentration inequality and the peeling argument. Note that the and related results have been obtained in Raskutti et al. (2010); Oliveira (2016); Rudelson and Zhou (2013). The is basically a combination of a high probability version of Chevet’s inequality (Vershynin, 2018, Exercises 8.7.3-4) and the peeling argument. A property similar to the for Gaussian matrices with non degenerate covariance was established in (Nguyen and Tran, 2013, Lemma 1) under further restrictions on .

###### Theorem 3.

There exist universal positive constants , , such that if

 slogpϰ2+ologn≤d1nand1/7≥δ≥2e−d2n

then, with probability at least , -penalized Huber’s -estimator with and satisfies

 ∥∥Σ1/2(ˆβ−β∗)∥∥2≤d3σ{(slog(p/δ)nϰ2)1/2+olog(n/δ)n}. (27)

Even though the constants appearing in Theorem 2 are reasonably small and smaller than in the analogous results in prior work, the constants , and are large, too large for being of any practical relevance. Finally, let us note that if and are known, it is very likely that following the techniques developed in (Bellec et al., 2018, Theorem 4.2), one can replace the terms and in (27) by and , respectively.

Comparing Theorem 3 with (Nguyen and Tran, 2013, Theorem 1), we see that our rate improvement is not only in terms of its dependence on the proportion of outliers, , but also in terms of the condition number , which is now completely decoupled from in the risk bound.

While our main focus is on the high dimensional situation in which can be larger than , it also applies to the case of small dimensional dense vectors, i.e., when is significantly smaller than . One of the applications of such a setting is the problem of stylized communication considered, for instance, in (Candès and Randall, 2008). The problem is to transmit a signal

to a remote receiver. What the receiver gets is a linearly transformed codeword

corrupted by small noise and malicious errors. While all the entries of the received codeword are affected by noise, only a fraction of them is corrupted by malicious errors, corresponding to outliers. The receiver has access to the corrupted version of as well as to the encoding matrix . Theorem 3.1 from (Candès and Randall, 2008) establishes that the Dantzig selector (Candès and Tao, 2007), for a properly chosen tuning parameter proportional to the noise level, achieves the (sub-optimal) rate , up to a logarithmic factor. A similar result, with a noise-level-free version of the Dantzig selector, was proved in (Dalalyan and Chen, 2012). Our Theorem 3 implies that the error of the -penalized Huber’s estimator goes to zero at the faster rate .

Finally, one can deduce from Theorem 3 that as soon as the number of outliers satisfies , the rate of convergence remains the same as in the outlier-free setting.

### 4 Prior work

As attested by early references such as (Tukey, 1960), robust estimation has a long history. A remarkable—by now classic—result by Huber (1964) shows that among all the shift invariant

-estimators of a location parameter, the one that minimizes the asymptotic variance corresponds to the loss function

. This result was proved in the case when the reference distribution is univariate Gaussian. Apart from some exceptions, such as (Yatracos, 1985), during several decades the literature on robust estimation was mainly exploring the notions of breakdown point, influence function, asymptotic efficiency, etc., see for instance (Donoho and Gasko, 1992; Hampel et al., 2005; Huber and Ronchetti, 2009) and the recent survey (Yu and Yao, 2017). A more recent trend in statistics is to focus on finite sample risk bounds that are minimax-rate-optimal when the sample size , the dimension of the unknown parameter and the number of outliers tend jointly to infinity (Chen et al., 2018, 2016; Gao, 2017).

In the problem of estimating the mean of a multivariate Gaussian distribution, it was shown that the optimal rate of the estimation error measured in Euclidean norm scales as . Similar results were established for the problem of robust linear regression as well. However, the estimator that was shown to achieve this rate under fairly general conditions on the design is based on minimizing regression depths, which is a hard computational problem. Several alternative robust estimators with polynomial complexity were proposed (Diakonikolas et al., 2016; Lai et al., 2016; Cheng et al., 2019; Collier and Dalalyan, 2017; Diakonikolas et al., 2018).

Many recent papers studied robust linear regression. (Karmalkar and Price, 2018) considered -constrained minimization of the -norm of residuals and found a sharp threshold on the proportion of outliers determining whether the error of estimation tends to zero or not, when the noise level goes to zero. From a methodological point of view, -penalized Huber’s estimator has been considered in (She and Owen, 2011; Lee et al., 2012). These papers contain also comprehensive empirical evaluation and proposals for data-driven choice of tuning parameters. Robust sparse regression with an emphasis on contaminated design was investigated in (Chen et al., 2013; Balakrishnan et al., 2017; Diakonikolas et al., 2019; Liu et al., 2018, 2019). Iterative and adaptive hard thresholding approaches were considered in (Bhatia et al., 2017; Suggala et al., 2019). Methods based on penalizing the vector of outliers were studied by Li2013; Foygel and Mackey (2014); Adcock, who adopted a more signal-processing point of view in which the noise vector is known to have a small norm and nothing else is known about it. We should stress that our proof techniques share many common features with those in (Foygel and Mackey, 2014).

The problem of robust estimation of graphical models, closely related to the present work, was addressed in (Balmand and Dalalyan, 2015; Katiyar et al., 2019; Liu et al., 2019). Quite surprisingly, at least to us, the minimax rate of robust estimation of the precision matrix in Frobenius norm is not known yet.

### 5 Extensions

The results presented in previous sections pave the way for some future investigations, that are discussed below. None of these extensions is carried out in this work, they are listed here as possible avenues for future research.

##### Contaminated design

In addition to labels, the features also might be corrupted by outliers. This is the case, for instance, in Gaussian graphical models. Formally, this means that instead of observing the clean data satisfying , we observe such that for all except for a fraction of outliers . In such a setting, we can set and recover exactly the same model as in (2).

The important difference as compared to the setting investigated in previous section is that it is not reasonable anymore to assume that the feature vectors are iid Gaussian. In the adversarial setting, they may even be correlated with the noise vector . It is then natural to remove all the observations for which and to assume, that the -penalized Huber estimator is applied to data for which . This implies that can be chosen of the order of444We use notation as a shorthand for for some and for every . , which is an upper bound on .

In addition, is clearly satisfied since it is satisfied for the submatrix and . As for the , we know from Theorem 2 that satisfies with constants , , of order . On the other hand,

 |u⊤OXOv|≤∥X∥∞∥uO∥1∥v∥1≤√2olog(np/δ)∥uO∥2∥v∥1. (28)

This implies that satisfies with , and . Applying Theorem 1, we obtain that if for a sufficiently small constant , then with high probability

 (29)

This rate of convergence appear to be slower than those obtained by methods tailored to deal with corruption in design, see (Liu et al., 2018, 2019) and the references therein. Using more careful analysis, this rate might be improvable. On the positive side, unlike many of its competitors, the estimator has the advantage of being independent of the covariance matrix and on the sparsity . Furthermore, the upper bound does not depend, even logarithmically, on . Finally, if , our bound yields the minimax-optimal rate. To the best of our knowledge, none of the previously studied robust estimators has such a property.

##### Sub-Gaussian design

The proof of Theorem 2 makes use of some results, such as Gordon-Sudakov-Fernique or Gaussian concentration inequality, which are specific to the Gaussian distribution. A natural question is whether the rate can be obtained for more general design distributions. In the case of a sub-Gaussian design with the scale- parameter , it should be possible to adapt the methodology developed in this work to show that the and the are satisfied with high-probability. Indeed, for proving the , it is possible to replace Gordon’s comparison inequality by Talagrand’s sub-Gaussian comparison inequality (Vershynin, 2018, Cor. 8.6.2). The Gaussian concentration inequality can be replaced by generic chaining.

##### Heavier tailed noise distributions

For simplicity, we assumed in the paper that the random variables

are drawn from a Gaussian distribution. As usual for the Lasso analysis, all the results extend to the case of sub-Gaussian noise, see (Koltchinskii, 2011). Indeed, we only need to control tail probabilities of the random variable and , which can be done using standard tools. We believe that it is possible to extend our results beyond sub-Gaussian noise, by assuming some type of heavy-tailed distributions. The rationale behind this is that any random variable can be written (in many different ways) as a sum of a sub-Gaussian variable and a “sparse” variable . By “sparse” we mean that takes the value 0 with high probability. The most naive way for getting such a decomposition is to set and . The random noise terms can be merged with and considered as outliers. We hope that this approach can establish a connection between two types of robustness: robustness to outliers considered in this work and robustness to heavy tails considered in many recent papers (Devroye et al., 2016; Catoni, 2012; Minsker, 2018; Lugosi and Mendelson, 2019; Lecué and Lerasle, 2017).

### 6 Numerical illustration

We performed a synthetic experiment to illustrate the obtained theoretical result and to check that it is in line with numerical results. We chose and for 3 different levels of sparsity . The noise variance was set to and was set to have its first non-zero coordinates equal to 10. Each corrupted response coordinate was . The fraction of outliers was ranging between 0 and 0.25 with a step-size of 5 for the number of outliers is used. The MSE was computed using 200 independent repetitions. The optimisation problem in (5) was solved using the glmnet package with the tuning parameters .

The obtained plots clearly demonstrate that there is a linear dependence on of the square-root of the mean squared error.

### 7 Conclusion

We provided the first proof of the rate-optimality—up to logarithmic terms that can be avoided—of -penalized Huber’s -estimator in the setting of robust linear regression with adversarial contamination. We established this result under the assumption that the design is Gaussian with a covariance matrix that need not be invertible. The condition number governing the risk bound is the ratio of the largest diagonal entry of and its restricted eigenvalue. Thus, in addition to improving the rate of convergence, we also relaxed the assumptions on the design. Furthermore, we outlined some possible extensions, namely to corrupted design and/or sub-Gaussian design, which seem to be fairly easy to carry out building on the current work.

Next on our agenda is the more thorough analysis of the robust estimation by -penalization in the case of contaminated design. A possible approach, complementary to the one described in Section 5 above, is to adopt an errors-in-variables point of view similar to that developed in (Belloni et al., 2016). Another interesting avenue for future research is the development of scale-invariant robust estimators and their adaptation to the Gaussian graphical models. This can be done using methodology brought forward in (Sun and Zhang, 2013; Balmand and Dalalyan, 2015). Finally, we would like to better understand what is the largest fraction of outliers for which the -penalized Huber’s -estimator has a risk—measured in Euclidean norm—upper bounded by . Answering this question even under stringent assumptions of independent standard Gaussian design with going to zero as tends to infinity would be of interest.

### References

• Balakrishnan et al. (2017) Balakrishnan, S., Du, S. S., Li, J., and Singh, A. (2017). Computationally efficient robust sparse estimation in high dimensions. Proceedings of the 2017 Conference on Learning Theory, PMLR, 65:169–212.
• Balmand and Dalalyan (2015) Balmand, S. and Dalalyan, A. S. (2015). Convex programming approach to robust estimation of a multivariate gaussian model. arXiv. 1512.04734.
• Bellec (2017) Bellec, P. C. (2017). Localized Gaussian width of $M$-convex hulls with applications to Lasso and convex aggregation. arXiv e-prints, page arXiv:1705.10696.
• Bellec et al. (2018) Bellec, P. C., Lecué, G., and Tsybakov, A. B. (2018). Slope meets lasso: Improved oracle bounds and optimality. Ann. Statist., 46(6B):3603–3642.
• Belloni et al. (2016) Belloni, A., Rosenbaum, M., and Tsybakov, A. B. (2016). An -regularization approach to high-dimensional errors-in-variables models. Electron. J. Statist., 10(2):1729–1750.
• Bhatia et al. (2017) Bhatia, K., Jain, P., Kamalaruban, P., and Kar, P. (2017). Consistent robust regression. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 2107–2116.
• Bickel et al. (2009) Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist., 37(4):1705–1732.
• Boucheron et al. (2013) Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press.
• Candès and Randall (2008) Candès, E. and Randall, P. A. (2008). Highly robust error correction by convex programming. IEEE Trans. Inform. Theory, 54(7):2829–2840.
• Candès and Tao (2007) Candès, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger than n. Ann. Statist., 35(6):2313–2351.
• Catoni (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. Ann. Inst. Henri Poincaré Probab. Stat., 48(4):1148–1185.
• Chen et al. (2016) Chen, M., Gao, C., and Ren, Z. (2016). A general decision theory for Huber’s -contamination model. Electron. J. Statist., 10(2):3752–3774.
• Chen et al. (2018) Chen, M., Gao, C., and Ren, Z. (2018). Robust covariance and scatter matrix estimation under Huber’s contamination model. Ann. Statist., 46(5):1932–1960.
• Chen et al. (2013) Chen, Y., Caramanis, C., and Mannor, S. (2013). Robust sparse regression under adversarial corruption. In

Proceedings of the 30th International Conference on Machine Learning

, volume 28 of Proceedings of Machine Learning Research, pages 774–782. PMLR.
• Cheng et al. (2019) Cheng, Y., Diakonikolas, I., and Ge, R. (2019). High-dimensional robust mean estimation in nearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2755–2771.
• Collier and Dalalyan (2017) Collier, O. and Dalalyan, A. S. (2017). Minimax estimation of a p-dimensional linear functional in sparse Gaussian models and robust estimation of the mean. arXiv e-prints, page arXiv:1712.05495.
• Dalalyan and Chen (2012) Dalalyan, A. S. and Chen, Y. (2012). Fused sparsity and robust estimation for linear models with unknown variance. In Advances in Neural Information Processing Systems 25: NIPS, pages 1268–1276.
• Devroye et al. (2016) Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. Ann. Statist., 44(6):2695–2725.
• Diakonikolas et al. (2016) Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. (2016). Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE.
• Diakonikolas et al. (2018) Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. (2018). Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 2683–2702.
• Diakonikolas et al. (2019) Diakonikolas, I., Kong, W., and Stewart, A. (2019). Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2745–2754.
• Donoho and Montanari (2016) Donoho, D. and Montanari, A. (2016). High dimensional robust m-estimation: asymptotic variance via approximate message passing. Probability Theory and Related Fields, 166(3):935–969.
• Donoho and Gasko (1992) Donoho, D. L. and Gasko, M. (1992). Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Ann. Statist., 20(4):1803–1827.
• Foygel and Mackey (2014) Foygel, R. and Mackey, L. (2014). Corrupted sensing: novel guarantees for separating structured signals. IEEE Trans. Inform. Theory, 60(2):1223–1247.
• Gao (2017) Gao, C. (2017). Robust Regression via Mutivariate Regression Depth. arXiv e-prints, page arXiv:1702.04656.
• Hampel et al. (2005) Hampel, F., Ronchetti, E., Rousseeuw, P., and Stahel, W. (2005). Robust statistics: the approach based on influence functions. Wiley series in probability and mathematical statistics. Probability and mathematical statistics. Wiley.
• Huber (1964) Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist., 35(1):73–101.
• Huber and Ronchetti (2009) Huber, P. J. and Ronchetti, E. M. (2009). Robust statistics. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ, second edition.
• Karmalkar and Price (2018) Karmalkar, S. and Price, E. (2018). Compressed sensing with adversarial sparse noise via l1 regression. arXiv. 1809.08055.
• Katiyar et al. (2019) Katiyar, A., Hoffmann, J., and Caramanis, C. (2019). Robust estimation of tree structured Gaussian Graphical Model. arXiv e-prints, page arXiv:1901.08770.
• Koltchinskii (2011) Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: École d’Été de Probabilités de Saint-Flour XXXVIII-2008. Lecture Notes in Mathematics. Springer Berlin Heidelberg.
• Lai et al. (2016) Lai, K. A., Rao, A. B., and Vempala, S. (2016). Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665–674. IEEE.
• Laska et al. (2009) Laska, J. N., Davenport, M. A., and Baraniuk, R. G. (2009). Exact signal recovery from sparsely corrupted measurements through the pursuit of justice. In Asilomar Conference on Signals, Systems and Computers, pages 1556–1560.
• Lecué and Lerasle (2017) Lecué, G. and Lerasle, M. (2017). Robust machine learning by median-of-means : theory and practice. arXiv e-prints, page arXiv:1711.10306.
• Lee et al. (2012) Lee, Y., MacEachern, S. N., and Jung, Y. (2012). Regularization of case-specific parameters for robustness and efficiency. Statist. Sci., 27(3):350–372.
• Liu et al. (2019) Liu, L., Li, T., and Caramanis, C. (2019). High dimensional robust estimation of sparse models via trimmed hard thresholding. CoRR, abs/1901.08237.
• Liu et al. (2018) Liu, L., Shen, Y., Li, T., and Caramanis, C. (2018). High dimensional robust sparse regression. CoRR, abs/1805.11643.
• Lugosi and Mendelson (2019) Lugosi, G. and Mendelson, S. (2019). Sub-Gaussian estimators of the mean of a random vector. Ann. Statist., 47(2):783–794.
• Minsker (2018) Minsker, S. (2018). Sub-Gaussian estimators of the mean of a random matrix with heavy-tailed entries. Ann. Statist., 46(6A):2871–2903.
• Nguyen and Tran (2013) Nguyen, N. H. and Tran, T. D. (2013). Robust lasso with missing and grossly corrupted observations. IEEE Trans. Inform. Theory, 59(4):2036–2058.
• Oliveira (2013) Oliveira, R. (2013). The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties. arXiv. 1312.2903.
• Oliveira (2016) Oliveira, R. (2016).

The lower tail of random quadratic forms with applications to ordinary least squares.

Probability Theory and Related Fields, 166(3-4):1175–1194.
• Raskutti et al. (2010) Raskutti, G., Wainwright, M. J., and Yu, B. (2010). Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res., 11:2241–2259.
• Rudelson and Zhou (2013) Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements. IEEE Trans. Inf. Theory, 59(6):3434–3447.
• She and Owen (2011) She, Y. and Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494):626–639.
• Suggala et al. (2019) Suggala, A. S., Bhatia, K., Ravikumar, P., and Jain, P. (2019). Adaptive hard thresholding for near-optimal consistent robust regression. CoRR, abs/1903.08192.
• Sun and Zhang (2013) Sun, T. and Zhang, C.-H. (2013). Sparse matrix inversion with scaled lasso. Journal of Machine Learning Research, 14:3385–3418.
• Tukey (1960) Tukey, J. W. (1960). A survey of sampling from contaminated distributions. Contributions to Probability and Statistics.
• Vershynin (2018) Vershynin, R. (2018). High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge.

An introduction with applications in data science, With a foreword by Sara van de Geer.

• Yatracos (1985) Yatracos, Y. G. (1985). Rates of convergence of minimum distance estimators and kolmogorov’s entropy. Ann. Statist., 13(2):768–774.
• Yu and Yao (2017) Yu, C. and Yao, W. (2017). Robust linear regression: a review and comparison. Comm. Statist. Simulation Comput., 46(8):6261–6282.

### 8 Main technical results for general design matrices

In the sequel, we denote by the unit sphere in with respect to the Euclidean norm centered at the origin. With a slight abuse of notation, will be identified with . The unit ball with respect to the -norm centered at the origin will be denoted by . Given a matrix , we will use the definition without further notice. We will use notation , and . We denote by the support of and by that of . We know that and . Throughout, we set and define the dimension reduction cone , where is a constant.

#### 8.1 Augmented transfer principle implies the sub-optimal rate

This section is devoted to the proof of the fact that the estimators and achieve, up to logarithmic factors, the rates

 snϰ2+onands√nϰ2+o√n

for squared error and errors, respectively. This is true under suitable conditions on the design matrix . These rates are not optimal, but they will help us to obtain the optimal rates.

###### Proposition 1.

Let satisfy the with constant . Let and be some positive real numbers satisfying

 8(c2∨γc3)(sϰ2+6.25oγ2)1/2≤c1. (30)

Assume that on some event , the following conditions are met:

• satisfies the .

Then, on the same event , we have and

 ∥∥Σ1/2Δβ∥22+∥Δθ∥∥22 ≤36c41(λ2ssϰ2+6.25λ2oo), (31) λs∥∥Δβ∥1+λo∥Δθ∥∥1 ≤24c21(λ2ssϰ2+6.25λ2oo). (32)
###### Proof.

First, we use the KKT conditions to infer that for some vectors and such that and , we have

 [X(n)In]⊤(y(n)−X(n)ˆβ−ˆθ)=[λsv;λou]. (33)

Using the facts that and rearranging the terms, the last display takes the form

 [X(n)In]⊤[X(n)In]Δ=[(X(n))⊤ξ(n);ξ(n)]+[λsv;λou]. (34)

Multiplying the last display from the left by , we arrive at

 ∥[X(n)In]Δ∥