1 Introduction
Is it possible to attain optimal rates of estimation in outlierrobust sparse regression using penalized empirical risk minimization (PERM) with convex loss and convex penalties? Current state of literature on robust estimation does not answer this question. Furthermore, it contains some signals that might suggest that the answer to this question is negative. First, it has been shown in (Chen et al., 2013, Theorem 1) that in the case of adversarially corrupted samples, no method based on penalized empirical loss minimization, with convex loss and convex penalty, can lead to consistent support recovery. The authors then advocate for robustifying the penalized leastsquares estimators by replacing usual scalar products by their trimmed counterparts. Second, (Chen et al., 2018) established that in the multivariate Gaussian model subject to Huber’s contamination, coordinatewise median—which is the ERM for the loss—is suboptimal. Similar result was proved in (Lai et al., 2016, Prop. 2.1) for the geometric median, the ERM corresponding to the
loss. These negative results prompted researchers to use other techniques, often of higher computational complexity, to solve the problem of outliercorrupted sparse linear regression.
In the present work, we prove that the penalized empirical risk minimizer based on Huber’s loss is minimaxrateoptimal, up to possible logarithmic factors. Naturally, this result is not valid in the most general situation, but we demonstrate its validity under the assumptions that the design matrix satisfies some incoherence condition and only the response is subject to contamination. The incoherence condition is shown to be satisfied by the Gaussian design with a covariance matrix that has bounded and bounded away from zero diagonal entries. This relatively simple setting is chosen in order to convey the main message of this work: for properly chosen convex loss and convex penalty functions, the PERM is minimaxrateoptimal in sparse linear regression with adversarially corrupted labels.
To describe more precisely the aforementioned optimality result, let be iid featurelabel pairs such that are Gaussian with zero mean and covariance matrix and are defined by the linear model
(1) 
where the random noise , independent of
, is Gaussian with zero mean and variance
. Instead of observing the “clean” data , we have access to a contaminated version of it, , in which a small number of labels are replaced by an arbitrary value. Setting , and using the matrixvector notation, the described model can be written as(2) 
where is the design matrix, is the response vector, is the contamination and is the noise vector. The goal is to estimate the vector . The dimension is assumed to be large, possibly larger than but, for some small value , the vector is assumed to be sparse: . In such a setting, it is wellknown that if we have access to the clean data and measure the quality of an estimator by the Mahalanobis norm^{1}^{1}1In the sequel, we use notation for any vector and any . , the optimal rate is
(3) 
In the outliercontaminated setting, i.e., when is unavailable but one has access to , the minimaxoptimalrate (Chen et al., 2016) takes the form
(4) 
The first estimators proved to attain this rate (Chen et al., 2016; Gao, 2017) were computationally intractable^{2}^{2}2In the sense that there is no algorithm computing these estimators in time polynomial in . for large , and . This motivated several authors to search for polynomialtime algorithms attaining nearly optimal rate; the most relevant results will be reviewed later in this work.
The assumption that only a small number of labels are contaminated by outliers implies that the vector in (2) is sparse. In order to take advantage of sparsity of both and while ensuring computational tractability of the resulting estimator, a natural approach studied in several papers (Laska et al., 2009; Nguyen and Tran, 2013; Dalalyan and Chen, 2012) is to use some version of the penalized ERM. This corresponds to defining
(5) 
where are tuning parameters. This estimator is very attractive from a computational perspective, since it can be seen as the Lasso for the augmented design matrix , where is the identity matrix. To date, the best known rate for this type of estimator is
(6) 
obtained in (Nguyen and Tran, 2013) under some restrictions on . A quick comparison of (4) and (6) shows that the latter is suboptimal. Indeed, the ratio of the two rates may be as large as . The main goal of the present paper is to show that this suboptimality is not an intrinsic property of the estimator (5), but rather an artefact of previous proof techniques. By using a refined argument, we prove that defined by (5) does attain the optimal rate under very mild assumptions.
In the sequel, we refer to as penalized Huber’s estimator. The rationale for this term is that the minimization with respect to in (5) can be done explicitly. It yields (Donoho and Montanari, 2016, Section 6)
(7) 
where is Huber’s function defined by .
To prove the rateoptimality of the estimator , we first establish a risk bound for a general design matrix not necessarily formed by Gaussian vectors. This is done in the next section. Then, in Section 3, we state and discuss the result showing that all the necessary conditions are satisfied for the Gaussian design. Relevant prior work is presented in Section 4, while Section 5 discusses potential extensions. Section 7 provides a summary of our results and an outlook on future work. The proofs are deferred to the supplementary material.
2 Risk bound for the penalized Huber’s estimator
This section is devoted to bringing forward sufficient conditions on the design matrix that allow for rateoptimal risk bounds for the estimator defined by (5) or, equivalently, by (7). There are two qualitative conditions that can be easily seen to be necessary: we call them restricted invertibility and incoherence. Indeed, even when there is no contamination, i.e., the number of outliers is known to be , the matrix
has to satisfy a restricted invertibility condition (such as restricted isometry, restricted eigenvalue or compatibility) in order that the Lasso estimator (
5) does achieve the optimal rate . On the other hand, in the case where and , even in the extremely favorable situation where the noise is zero, the only identifiable vector is . Therefore, it is impossible to consistently estimate when the design matrix is aligned with the identity matrix or close to be so.The next definition formalizes what we call restricted invertibility and incoherence by introducing three notions: the transfer principle, the incoherence property and the augmented transfer principle. We will show that these notions play a key role in robust estimation by penalized least squares.
Definition 1.
Let
be a (random) matrix and
. We use notation .
We say that satisfies the transfer principle with and , denoted by , if for all ,
(8) 
We say that satisfies the incoherence property for some positive numbers , and , if for all ,
(9) 
We say that satisfies the augmented transfer principle for some positive numbers , and , if for all ,
(10)
These three properties are interrelated and related to extreme singular values of the matrix
. (P1)

If satisfies then it also satisfies .
 (P2)

If satisfies and then it also satisfies with , and for any positive .
 (P3)

If satisfies , then it also satisfies
 (P4)

Any matrix satisfies , and , where and are, respectively, the th largest and the largest singular values of .
Claim (P1) is true, since if we choose in (10) we obtain (8). Claim (P2) coincides with Lemma 7, proved in the supplement. (P3) is a direct consequence of the inequality , valid for any vector
. (P4) is a wellknown characterization of the smallest and the largest singular values of a matrix. We will show later on that a Gaussian matrix satisfies with high probability all these conditions with constants
and independent of and , , , , of order , up to logarithmic factors.To state the main theorem of this section, we consider the simplified setting in which . Remind that in practice it is always recommended to normalize the columns of the matrix so that their Euclidean norm is of the order . The more precise version of the next result with better constants is provided in the supplement (see Proposition 1). We recall that a matrix is said to satisfy the restricted eigenvalue condition with some constant , if for any vector and any set such that and .
Theorem 1.
Let satisfy the condition with constant . Let , , , , , be some positive real numbers such that satisfies the and the . Assume that for some , the tuning parameter satisfies
(11) 
If the sparsity and the number of outliers satisfy the condition
(12) 
then, with probability at least , we have
(13) 
Theorem 1 is somewhat hard to parse. At this stage, let us simply mention that in the case of a Gaussian design considered in the next section, is of order while are of order , up to a factor logarithmic in , and . Here is an upper bound on the probability that the Gaussian matrix does not satisfy either or . Since Theorem 1 allows us to choose of the order , we infer from (13) that the error of estimating , measured in Euclidean norm, is of order , under the assumption that is smaller than a universal constant.
To complete this section, we present a sketch of the proof of Theorem 1. In order to convey the main ideas without diving too much into technical details, we assume . This means that the condition is satisfied with for any and . From the fact that the holds for , we infer that satisfies the condition with the constant . Using the wellknown risk bounds for the Lasso estimator (Bickel et al., 2009), we get
(14) 
Note that these are the risk bounds established in^{3}^{3}3the first two references deal with the small dimensional case only, that is where . (Candès and Randall, 2008; Dalalyan and Chen, 2012; Nguyen and Tran, 2013). These bounds are most likely unimprovable as long as the estimation of is of interest. However, if we focus only on the estimation error of , considering as a nuisance parameter, the following argument leads to a sharper risk bound. First, we note that
(15) 
The KKT conditions of this convex optimization problem take the following form
(16) 
where is the subset of containing all the vectors such that and for every . Multiplying the last displayed equation from left by , we get
(17) 
Recall now that and set and . We arrive at
(18) 
On the one hand, the duality inequality and the lower bound on imply that . On the other hand, wellknown arguments yield . Therefore, we have
(19) 
Since satisfies the that implies the , we get . Combining with (19), this yields
(20)  
(21)  
(22) 
Using the first inequality in (14) and condition (12), we upper bound by 0. To upper bound the second last term, we use the CauchySchwarz inequality: . Combining all these bounds and rearranging the terms, we arrive at
(23) 
Taking the square root of both sides and using the second inequality in (14), we obtain an inequality of the same type as (13) but with slightly larger constants. As a concluding remark for this sketch of proof, let us note that if instead of using the last arguments, we replace all the error terms appearing in (21) by their upper bounds provided by (14), we do not get the optimal rate.
3 The case of Gaussian design
Our main result, Theorem 1, shows that if the design matrix satisfies the transfer principle and the incoherence property with suitable constants, then the penalized Huber’s estimator achieves the optimal rate under adversarial contamination. As a concrete example of a design matrix for which the aforementioned conditions are satisfied, we consider the case of correlated Gaussian design. As opposed to most of prior work on robust estimation for linear regression with Gaussian design, we allow the covariance matrix to have a non degenerate null space. We will simply assume that the rows of the matrix
are independently drawn from the Gaussian distribution
with a covariance matrix satisfying the condition. We will also assume in this section that all the diagonal entries of are equal to 1: . The more formal statements of the results, provided in the supplementary material, do not require this condition.Theorem 2.
Let be a tolerance level and . For every positive semidefinite matrix with all the diagonal entries bounded by one, with probability at least , the matrix satisfies the , the and the with constants
(24)  
(25)  
(26) 
The proof of this result is provided in the supplementary material. It relies on by now standard tools such as Gordon’s comparison inequality, Gaussian concentration inequality and the peeling argument. Note that the and related results have been obtained in Raskutti et al. (2010); Oliveira (2016); Rudelson and Zhou (2013). The is basically a combination of a high probability version of Chevet’s inequality (Vershynin, 2018, Exercises 8.7.34) and the peeling argument. A property similar to the for Gaussian matrices with non degenerate covariance was established in (Nguyen and Tran, 2013, Lemma 1) under further restrictions on .
Theorem 3.
There exist universal positive constants , , such that if
then, with probability at least , penalized Huber’s estimator with and satisfies
(27) 
Even though the constants appearing in Theorem 2 are reasonably small and smaller than in the analogous results in prior work, the constants , and are large, too large for being of any practical relevance. Finally, let us note that if and are known, it is very likely that following the techniques developed in (Bellec et al., 2018, Theorem 4.2), one can replace the terms and in (27) by and , respectively.
Comparing Theorem 3 with (Nguyen and Tran, 2013, Theorem 1), we see that our rate improvement is not only in terms of its dependence on the proportion of outliers, , but also in terms of the condition number , which is now completely decoupled from in the risk bound.
While our main focus is on the high dimensional situation in which can be larger than , it also applies to the case of small dimensional dense vectors, i.e., when is significantly smaller than . One of the applications of such a setting is the problem of stylized communication considered, for instance, in (Candès and Randall, 2008). The problem is to transmit a signal
to a remote receiver. What the receiver gets is a linearly transformed codeword
corrupted by small noise and malicious errors. While all the entries of the received codeword are affected by noise, only a fraction of them is corrupted by malicious errors, corresponding to outliers. The receiver has access to the corrupted version of as well as to the encoding matrix . Theorem 3.1 from (Candès and Randall, 2008) establishes that the Dantzig selector (Candès and Tao, 2007), for a properly chosen tuning parameter proportional to the noise level, achieves the (suboptimal) rate , up to a logarithmic factor. A similar result, with a noiselevelfree version of the Dantzig selector, was proved in (Dalalyan and Chen, 2012). Our Theorem 3 implies that the error of the penalized Huber’s estimator goes to zero at the faster rate .Finally, one can deduce from Theorem 3 that as soon as the number of outliers satisfies , the rate of convergence remains the same as in the outlierfree setting.
4 Prior work
As attested by early references such as (Tukey, 1960), robust estimation has a long history. A remarkable—by now classic—result by Huber (1964) shows that among all the shift invariant
estimators of a location parameter, the one that minimizes the asymptotic variance corresponds to the loss function
. This result was proved in the case when the reference distribution is univariate Gaussian. Apart from some exceptions, such as (Yatracos, 1985), during several decades the literature on robust estimation was mainly exploring the notions of breakdown point, influence function, asymptotic efficiency, etc., see for instance (Donoho and Gasko, 1992; Hampel et al., 2005; Huber and Ronchetti, 2009) and the recent survey (Yu and Yao, 2017). A more recent trend in statistics is to focus on finite sample risk bounds that are minimaxrateoptimal when the sample size , the dimension of the unknown parameter and the number of outliers tend jointly to infinity (Chen et al., 2018, 2016; Gao, 2017).In the problem of estimating the mean of a multivariate Gaussian distribution, it was shown that the optimal rate of the estimation error measured in Euclidean norm scales as . Similar results were established for the problem of robust linear regression as well. However, the estimator that was shown to achieve this rate under fairly general conditions on the design is based on minimizing regression depths, which is a hard computational problem. Several alternative robust estimators with polynomial complexity were proposed (Diakonikolas et al., 2016; Lai et al., 2016; Cheng et al., 2019; Collier and Dalalyan, 2017; Diakonikolas et al., 2018).
Many recent papers studied robust linear regression. (Karmalkar and Price, 2018) considered constrained minimization of the norm of residuals and found a sharp threshold on the proportion of outliers determining whether the error of estimation tends to zero or not, when the noise level goes to zero. From a methodological point of view, penalized Huber’s estimator has been considered in (She and Owen, 2011; Lee et al., 2012). These papers contain also comprehensive empirical evaluation and proposals for datadriven choice of tuning parameters. Robust sparse regression with an emphasis on contaminated design was investigated in (Chen et al., 2013; Balakrishnan et al., 2017; Diakonikolas et al., 2019; Liu et al., 2018, 2019). Iterative and adaptive hard thresholding approaches were considered in (Bhatia et al., 2017; Suggala et al., 2019). Methods based on penalizing the vector of outliers were studied by Li2013; Foygel and Mackey (2014); Adcock, who adopted a more signalprocessing point of view in which the noise vector is known to have a small norm and nothing else is known about it. We should stress that our proof techniques share many common features with those in (Foygel and Mackey, 2014).
The problem of robust estimation of graphical models, closely related to the present work, was addressed in (Balmand and Dalalyan, 2015; Katiyar et al., 2019; Liu et al., 2019). Quite surprisingly, at least to us, the minimax rate of robust estimation of the precision matrix in Frobenius norm is not known yet.
5 Extensions
The results presented in previous sections pave the way for some future investigations, that are discussed below. None of these extensions is carried out in this work, they are listed here as possible avenues for future research.
Contaminated design
In addition to labels, the features also might be corrupted by outliers. This is the case, for instance, in Gaussian graphical models. Formally, this means that instead of observing the clean data satisfying , we observe such that for all except for a fraction of outliers . In such a setting, we can set and recover exactly the same model as in (2).
The important difference as compared to the setting investigated in previous section is that it is not reasonable anymore to assume that the feature vectors are iid Gaussian. In the adversarial setting, they may even be correlated with the noise vector . It is then natural to remove all the observations for which and to assume, that the penalized Huber estimator is applied to data for which . This implies that can be chosen of the order of^{4}^{4}4We use notation as a shorthand for for some and for every . , which is an upper bound on .
In addition, is clearly satisfied since it is satisfied for the submatrix and . As for the , we know from Theorem 2 that satisfies with constants , , of order . On the other hand,
(28) 
This implies that satisfies with , and . Applying Theorem 1, we obtain that if for a sufficiently small constant , then with high probability
(29) 
This rate of convergence appear to be slower than those obtained by methods tailored to deal with corruption in design, see (Liu et al., 2018, 2019) and the references therein. Using more careful analysis, this rate might be improvable. On the positive side, unlike many of its competitors, the estimator has the advantage of being independent of the covariance matrix and on the sparsity . Furthermore, the upper bound does not depend, even logarithmically, on . Finally, if , our bound yields the minimaxoptimal rate. To the best of our knowledge, none of the previously studied robust estimators has such a property.
SubGaussian design
The proof of Theorem 2 makes use of some results, such as GordonSudakovFernique or Gaussian concentration inequality, which are specific to the Gaussian distribution. A natural question is whether the rate can be obtained for more general design distributions. In the case of a subGaussian design with the scale parameter , it should be possible to adapt the methodology developed in this work to show that the and the are satisfied with highprobability. Indeed, for proving the , it is possible to replace Gordon’s comparison inequality by Talagrand’s subGaussian comparison inequality (Vershynin, 2018, Cor. 8.6.2). The Gaussian concentration inequality can be replaced by generic chaining.
Heavier tailed noise distributions
For simplicity, we assumed in the paper that the random variables
are drawn from a Gaussian distribution. As usual for the Lasso analysis, all the results extend to the case of subGaussian noise, see (Koltchinskii, 2011). Indeed, we only need to control tail probabilities of the random variable and , which can be done using standard tools. We believe that it is possible to extend our results beyond subGaussian noise, by assuming some type of heavytailed distributions. The rationale behind this is that any random variable can be written (in many different ways) as a sum of a subGaussian variable and a “sparse” variable . By “sparse” we mean that takes the value 0 with high probability. The most naive way for getting such a decomposition is to set and . The random noise terms can be merged with and considered as outliers. We hope that this approach can establish a connection between two types of robustness: robustness to outliers considered in this work and robustness to heavy tails considered in many recent papers (Devroye et al., 2016; Catoni, 2012; Minsker, 2018; Lugosi and Mendelson, 2019; Lecué and Lerasle, 2017).6 Numerical illustration
We performed a synthetic experiment to illustrate the obtained theoretical result and to check that it is in line with numerical results. We chose and for 3 different levels of sparsity . The noise variance was set to and was set to have its first nonzero coordinates equal to 10. Each corrupted response coordinate was . The fraction of outliers was ranging between 0 and 0.25 with a stepsize of 5 for the number of outliers is used. The MSE was computed using 200 independent repetitions. The optimisation problem in (5) was solved using the glmnet package with the tuning parameters .
The obtained plots clearly demonstrate that there is a linear dependence on of the squareroot of the mean squared error.
7 Conclusion
We provided the first proof of the rateoptimality—up to logarithmic terms that can be avoided—of penalized Huber’s estimator in the setting of robust linear regression with adversarial contamination. We established this result under the assumption that the design is Gaussian with a covariance matrix that need not be invertible. The condition number governing the risk bound is the ratio of the largest diagonal entry of and its restricted eigenvalue. Thus, in addition to improving the rate of convergence, we also relaxed the assumptions on the design. Furthermore, we outlined some possible extensions, namely to corrupted design and/or subGaussian design, which seem to be fairly easy to carry out building on the current work.
Next on our agenda is the more thorough analysis of the robust estimation by penalization in the case of contaminated design. A possible approach, complementary to the one described in Section 5 above, is to adopt an errorsinvariables point of view similar to that developed in (Belloni et al., 2016). Another interesting avenue for future research is the development of scaleinvariant robust estimators and their adaptation to the Gaussian graphical models. This can be done using methodology brought forward in (Sun and Zhang, 2013; Balmand and Dalalyan, 2015). Finally, we would like to better understand what is the largest fraction of outliers for which the penalized Huber’s estimator has a risk—measured in Euclidean norm—upper bounded by . Answering this question even under stringent assumptions of independent standard Gaussian design with going to zero as tends to infinity would be of interest.
References
 Balakrishnan et al. (2017) Balakrishnan, S., Du, S. S., Li, J., and Singh, A. (2017). Computationally efficient robust sparse estimation in high dimensions. Proceedings of the 2017 Conference on Learning Theory, PMLR, 65:169–212.
 Balmand and Dalalyan (2015) Balmand, S. and Dalalyan, A. S. (2015). Convex programming approach to robust estimation of a multivariate gaussian model. arXiv. 1512.04734.
 Bellec (2017) Bellec, P. C. (2017). Localized Gaussian width of $M$convex hulls with applications to Lasso and convex aggregation. arXiv eprints, page arXiv:1705.10696.
 Bellec et al. (2018) Bellec, P. C., Lecué, G., and Tsybakov, A. B. (2018). Slope meets lasso: Improved oracle bounds and optimality. Ann. Statist., 46(6B):3603–3642.
 Belloni et al. (2016) Belloni, A., Rosenbaum, M., and Tsybakov, A. B. (2016). An regularization approach to highdimensional errorsinvariables models. Electron. J. Statist., 10(2):1729–1750.
 Bhatia et al. (2017) Bhatia, K., Jain, P., Kamalaruban, P., and Kar, P. (2017). Consistent robust regression. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 2107–2116.
 Bickel et al. (2009) Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist., 37(4):1705–1732.
 Boucheron et al. (2013) Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press.
 Candès and Randall (2008) Candès, E. and Randall, P. A. (2008). Highly robust error correction by convex programming. IEEE Trans. Inform. Theory, 54(7):2829–2840.
 Candès and Tao (2007) Candès, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger than n. Ann. Statist., 35(6):2313–2351.
 Catoni (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. Ann. Inst. Henri Poincaré Probab. Stat., 48(4):1148–1185.
 Chen et al. (2016) Chen, M., Gao, C., and Ren, Z. (2016). A general decision theory for Huber’s contamination model. Electron. J. Statist., 10(2):3752–3774.
 Chen et al. (2018) Chen, M., Gao, C., and Ren, Z. (2018). Robust covariance and scatter matrix estimation under Huber’s contamination model. Ann. Statist., 46(5):1932–1960.

Chen et al. (2013)
Chen, Y., Caramanis, C., and Mannor, S. (2013).
Robust sparse regression under adversarial corruption.
In
Proceedings of the 30th International Conference on Machine Learning
, volume 28 of Proceedings of Machine Learning Research, pages 774–782. PMLR.  Cheng et al. (2019) Cheng, Y., Diakonikolas, I., and Ge, R. (2019). Highdimensional robust mean estimation in nearlylinear time. In Proceedings of the Thirtieth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 69, 2019, pages 2755–2771.
 Collier and Dalalyan (2017) Collier, O. and Dalalyan, A. S. (2017). Minimax estimation of a pdimensional linear functional in sparse Gaussian models and robust estimation of the mean. arXiv eprints, page arXiv:1712.05495.
 Dalalyan and Chen (2012) Dalalyan, A. S. and Chen, Y. (2012). Fused sparsity and robust estimation for linear models with unknown variance. In Advances in Neural Information Processing Systems 25: NIPS, pages 1268–1276.
 Devroye et al. (2016) Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016). SubGaussian mean estimators. Ann. Statist., 44(6):2695–2725.
 Diakonikolas et al. (2016) Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. (2016). Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE.
 Diakonikolas et al. (2018) Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. (2018). Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of the TwentyNinth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 710, 2018, pages 2683–2702.
 Diakonikolas et al. (2019) Diakonikolas, I., Kong, W., and Stewart, A. (2019). Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 69, 2019, pages 2745–2754.
 Donoho and Montanari (2016) Donoho, D. and Montanari, A. (2016). High dimensional robust mestimation: asymptotic variance via approximate message passing. Probability Theory and Related Fields, 166(3):935–969.
 Donoho and Gasko (1992) Donoho, D. L. and Gasko, M. (1992). Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Ann. Statist., 20(4):1803–1827.
 Foygel and Mackey (2014) Foygel, R. and Mackey, L. (2014). Corrupted sensing: novel guarantees for separating structured signals. IEEE Trans. Inform. Theory, 60(2):1223–1247.
 Gao (2017) Gao, C. (2017). Robust Regression via Mutivariate Regression Depth. arXiv eprints, page arXiv:1702.04656.
 Hampel et al. (2005) Hampel, F., Ronchetti, E., Rousseeuw, P., and Stahel, W. (2005). Robust statistics: the approach based on influence functions. Wiley series in probability and mathematical statistics. Probability and mathematical statistics. Wiley.
 Huber (1964) Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist., 35(1):73–101.
 Huber and Ronchetti (2009) Huber, P. J. and Ronchetti, E. M. (2009). Robust statistics. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ, second edition.
 Karmalkar and Price (2018) Karmalkar, S. and Price, E. (2018). Compressed sensing with adversarial sparse noise via l1 regression. arXiv. 1809.08055.
 Katiyar et al. (2019) Katiyar, A., Hoffmann, J., and Caramanis, C. (2019). Robust estimation of tree structured Gaussian Graphical Model. arXiv eprints, page arXiv:1901.08770.
 Koltchinskii (2011) Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: École d’Été de Probabilités de SaintFlour XXXVIII2008. Lecture Notes in Mathematics. Springer Berlin Heidelberg.
 Lai et al. (2016) Lai, K. A., Rao, A. B., and Vempala, S. (2016). Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665–674. IEEE.
 Laska et al. (2009) Laska, J. N., Davenport, M. A., and Baraniuk, R. G. (2009). Exact signal recovery from sparsely corrupted measurements through the pursuit of justice. In Asilomar Conference on Signals, Systems and Computers, pages 1556–1560.
 Lecué and Lerasle (2017) Lecué, G. and Lerasle, M. (2017). Robust machine learning by medianofmeans : theory and practice. arXiv eprints, page arXiv:1711.10306.
 Lee et al. (2012) Lee, Y., MacEachern, S. N., and Jung, Y. (2012). Regularization of casespecific parameters for robustness and efficiency. Statist. Sci., 27(3):350–372.
 Liu et al. (2019) Liu, L., Li, T., and Caramanis, C. (2019). High dimensional robust estimation of sparse models via trimmed hard thresholding. CoRR, abs/1901.08237.
 Liu et al. (2018) Liu, L., Shen, Y., Li, T., and Caramanis, C. (2018). High dimensional robust sparse regression. CoRR, abs/1805.11643.
 Lugosi and Mendelson (2019) Lugosi, G. and Mendelson, S. (2019). SubGaussian estimators of the mean of a random vector. Ann. Statist., 47(2):783–794.
 Minsker (2018) Minsker, S. (2018). SubGaussian estimators of the mean of a random matrix with heavytailed entries. Ann. Statist., 46(6A):2871–2903.
 Nguyen and Tran (2013) Nguyen, N. H. and Tran, T. D. (2013). Robust lasso with missing and grossly corrupted observations. IEEE Trans. Inform. Theory, 59(4):2036–2058.
 Oliveira (2013) Oliveira, R. (2013). The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties. arXiv. 1312.2903.

Oliveira (2016)
Oliveira, R. (2016).
The lower tail of random quadratic forms with applications to ordinary least squares.
Probability Theory and Related Fields, 166(34):1175–1194.  Raskutti et al. (2010) Raskutti, G., Wainwright, M. J., and Yu, B. (2010). Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res., 11:2241–2259.
 Rudelson and Zhou (2013) Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements. IEEE Trans. Inf. Theory, 59(6):3434–3447.
 She and Owen (2011) She, Y. and Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494):626–639.
 Suggala et al. (2019) Suggala, A. S., Bhatia, K., Ravikumar, P., and Jain, P. (2019). Adaptive hard thresholding for nearoptimal consistent robust regression. CoRR, abs/1903.08192.
 Sun and Zhang (2013) Sun, T. and Zhang, C.H. (2013). Sparse matrix inversion with scaled lasso. Journal of Machine Learning Research, 14:3385–3418.
 Tukey (1960) Tukey, J. W. (1960). A survey of sampling from contaminated distributions. Contributions to Probability and Statistics.

Vershynin (2018)
Vershynin, R. (2018).
Highdimensional probability, volume 47 of Cambridge
Series in Statistical and Probabilistic Mathematics.
Cambridge University Press, Cambridge.
An introduction with applications in data science, With a foreword by Sara van de Geer.
 Yatracos (1985) Yatracos, Y. G. (1985). Rates of convergence of minimum distance estimators and kolmogorov’s entropy. Ann. Statist., 13(2):768–774.
 Yu and Yao (2017) Yu, C. and Yao, W. (2017). Robust linear regression: a review and comparison. Comm. Statist. Simulation Comput., 46(8):6261–6282.
8 Main technical results for general design matrices
In the sequel, we denote by the unit sphere in with respect to the Euclidean norm centered at the origin. With a slight abuse of notation, will be identified with . The unit ball with respect to the norm centered at the origin will be denoted by . Given a matrix , we will use the definition without further notice. We will use notation , and . We denote by the support of and by that of . We know that and . Throughout, we set and define the dimension reduction cone , where is a constant.
8.1 Augmented transfer principle implies the suboptimal rate
This section is devoted to the proof of the fact that the estimators and achieve, up to logarithmic factors, the rates
for squared error and errors, respectively. This is true under suitable conditions on the design matrix . These rates are not optimal, but they will help us to obtain the optimal rates.
Proposition 1.
Let satisfy the with constant . Let and be some positive real numbers satisfying
(30) 
Assume that on some event , the following conditions are met:

satisfies the .

Then, on the same event , we have and
(31)  
(32) 
Proof.
First, we use the KKT conditions to infer that for some vectors and such that and , we have
(33) 
Using the facts that and rearranging the terms, the last display takes the form
(34) 
Multiplying the last display from the left by , we arrive at